Manipulating genomic data with Hail
28/08/2017

Hail is an open source tool for analyzing genomic data. It has a bunch of features of which you can read about here. It has a straightforward python API and is built on top of Apache Spark.
One of the great features of Hail is how simple it makes it manipulate genomic data into a tabular format which can be ingested by traditional relational databases.
Data
We’ll be working with data from the gnomad study. To keep things reasonable but still somewhat realistic, we’ll only look at chromosome 21 which about 1.26Gb compressed.
Process
Setup:
from hail import *
hc = HailContext()
Printing summary information
In:
vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds.summarize().report()
Out:
         Samples: 0
        Variants: 3185805
       Call Rate: nan
         Contigs: ['21']
   Multiallelics: 198060
            SNPs: 3006007
            MNPs: 0
      Insertions: 182686
       Deletions: 275012
 Complex Alleles: 7
    Star Alleles: 0
     Max Alleles: 7
Down-sample variants
In:
vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds = vds.sample_variants(0.01)
vds.write("sample.vds")
vds = vds.read("sample.vds")
vds.summarize().report()
Out:
         Samples: 0
        Variants: 34626
       Call Rate: nan
         Contigs: ['21']
   Multiallelics: 0
            SNPs: 30044
            MNPs: 0
      Insertions: 1792
       Deletions: 2790
 Complex Alleles: 0
    Star Alleles: 0
     Max Alleles: 2
Export variants into a tsv
Input:
vds = hc.read("sample.vds")
vds = vds.split_multi()
kt = vds.variants_table()
kt = kt.annotate("rsid = va.rsid") \
       .annotate("qual = va.qual") \
       .annotate("AF = va.info.AF[0]") \
       .annotate("AC = va.info.AC[0]") \
       .drop('va')
kt.export("variants.tsv")
Output:
head variants.tsv
v	rsid	qual	AF	AC
21:9411631:A:G	NA	2.57190e+02	0.00000e+00	0
21:9411764:C:T	NA	9.96130e+02	3.24591e-05	1
21:9411969:T:C	rs557972256	1.22599e+03	9.74152e-05	3
21:9412132:A:C	NA	2.73530e+02	0.00000e+00	0
21:9412339:C:T	rs190851473	3.44782e+03	6.59631e-05	2
21:9412958:G:A	NA	1.80470e+02	0.00000e+00	0
21:9413068:G:A	NA	3.08430e+02	3.24380e-05	1
21:9413867:AG:A	rs199741174	3.48559e+05	1.20240e-02	365
21:9414073:C:A	NA	5.92420e+02	3.23604e-05	1