Manipulating genomic data with Hail

28/08/2017

hail

Hail is an open source tool for analyzing genomic data. It has a bunch of features of which you can read about here. It has a straightforward python API and is built on top of Apache Spark.

One of the great features of Hail is how simple it makes it manipulate genomic data into a tabular format which can be ingested by traditional relational databases.

Data

We’ll be working with data from the gnomad study. To keep things reasonable but still somewhat realistic, we’ll only look at chromosome 21 which about 1.26Gb compressed.

Process

Setup:

from hail import *
hc = HailContext()

Printing summary information

In:

vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds.summarize().report()

Out:

         Samples: 0
        Variants: 3185805
       Call Rate: nan
         Contigs: ['21']
   Multiallelics: 198060
            SNPs: 3006007
            MNPs: 0
      Insertions: 182686
       Deletions: 275012
 Complex Alleles: 7
    Star Alleles: 0
     Max Alleles: 7

Down-sample variants

In:

vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds = vds.sample_variants(0.01)
vds.write("sample.vds")
vds = vds.read("sample.vds")
vds.summarize().report()

Out:

         Samples: 0
        Variants: 34626
       Call Rate: nan
         Contigs: ['21']
   Multiallelics: 0
            SNPs: 30044
            MNPs: 0
      Insertions: 1792
       Deletions: 2790
 Complex Alleles: 0
    Star Alleles: 0
     Max Alleles: 2

Export variants into a tsv

Input:

vds = hc.read("sample.vds")
vds = vds.split_multi()
kt = vds.variants_table()
kt = kt.annotate("rsid = va.rsid") \
       .annotate("qual = va.qual") \
       .annotate("AF = va.info.AF[0]") \
       .annotate("AC = va.info.AC[0]") \
       .drop('va')

kt.export("variants.tsv")

Output:

head variants.tsv
v	rsid	qual	AF	AC
21:9411631:A:G	NA	2.57190e+02	0.00000e+00	0
21:9411764:C:T	NA	9.96130e+02	3.24591e-05	1
21:9411969:T:C	rs557972256	1.22599e+03	9.74152e-05	3
21:9412132:A:C	NA	2.73530e+02	0.00000e+00	0
21:9412339:C:T	rs190851473	3.44782e+03	6.59631e-05	2
21:9412958:G:A	NA	1.80470e+02	0.00000e+00	0
21:9413068:G:A	NA	3.08430e+02	3.24380e-05	1
21:9413867:AG:A	rs199741174	3.48559e+05	1.20240e-02	365
21:9414073:C:A	NA	5.92420e+02	3.23604e-05	1