Manipulating genomic data with Hail
28/08/2017
Hail is an open source tool for analyzing genomic data. It has a bunch of features of which you can read about here. It has a straightforward python API and is built on top of Apache Spark.
One of the great features of Hail is how simple it makes it manipulate genomic data into a tabular format which can be ingested by traditional relational databases.
Data
We’ll be working with data from the gnomad study. To keep things reasonable but still somewhat realistic, we’ll only look at chromosome 21 which about 1.26Gb compressed.
Process
Setup:
from hail import *
hc = HailContext()
Printing summary information
In:
vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds.summarize().report()
Out:
Samples: 0
Variants: 3185805
Call Rate: nan
Contigs: ['21']
Multiallelics: 198060
SNPs: 3006007
MNPs: 0
Insertions: 182686
Deletions: 275012
Complex Alleles: 7
Star Alleles: 0
Max Alleles: 7
Down-sample variants
In:
vds = hc.import_vcf("gnomad.genomes.r2.0.1.sites.21.vcf")
vds = vds.sample_variants(0.01)
vds.write("sample.vds")
vds = vds.read("sample.vds")
vds.summarize().report()
Out:
Samples: 0
Variants: 34626
Call Rate: nan
Contigs: ['21']
Multiallelics: 0
SNPs: 30044
MNPs: 0
Insertions: 1792
Deletions: 2790
Complex Alleles: 0
Star Alleles: 0
Max Alleles: 2
Export variants into a tsv
Input:
vds = hc.read("sample.vds")
vds = vds.split_multi()
kt = vds.variants_table()
kt = kt.annotate("rsid = va.rsid") \
.annotate("qual = va.qual") \
.annotate("AF = va.info.AF[0]") \
.annotate("AC = va.info.AC[0]") \
.drop('va')
kt.export("variants.tsv")
Output:
head variants.tsv
v rsid qual AF AC
21:9411631:A:G NA 2.57190e+02 0.00000e+00 0
21:9411764:C:T NA 9.96130e+02 3.24591e-05 1
21:9411969:T:C rs557972256 1.22599e+03 9.74152e-05 3
21:9412132:A:C NA 2.73530e+02 0.00000e+00 0
21:9412339:C:T rs190851473 3.44782e+03 6.59631e-05 2
21:9412958:G:A NA 1.80470e+02 0.00000e+00 0
21:9413068:G:A NA 3.08430e+02 3.24380e-05 1
21:9413867:AG:A rs199741174 3.48559e+05 1.20240e-02 365
21:9414073:C:A NA 5.92420e+02 3.23604e-05 1