Overview

This notebook is designed to provide a broad overview of Hail’s functionality, with emphasis on the functionality to manipulate and query a genetic dataset. We walk through a genome-wide SNP association test, and demonstrate the need to control for confounding caused by population stratification.

Each notebook starts the same: we import the hail package and create a HailContext. This object is the entry point for most Hail functionality.

In [1]:
from hail import *
hc = HailContext()
Running on Apache Spark version 2.0.2
SparkUI available at http://10.56.135.40:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-5a67787

If the above cell ran without error, we’re ready to go!

Before using Hail, we import some standard Python libraries for use throughout the notebook.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
from math import log, isnan
from pprint import pprint
%matplotlib inline

Installing and importing seaborn is optional; it just makes the plots prettier.

In [3]:
# optional
import seaborn

Check for tutorial data or download if necessary

This cell downloads the necessary data from Google Storage if it isn’t found in the current working directory.

In [4]:
import os
if os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt'):
    print('All files are present and accounted for!')
else:
    import sys
    sys.stderr.write('Downloading data (~50M) from Google Storage...\n')
    import urllib
    import tarfile
    urllib.urlretrieve('https://storage.googleapis.com/hail-1kg/tutorial_data.tar',
                       'tutorial_data.tar')
    sys.stderr.write('Download finished!\n')
    sys.stderr.write('Extracting...\n')
    tarfile.open('tutorial_data.tar').extractall()
    if not (os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt')):
        raise RuntimeError('Something went wrong!')
    else:
        sys.stderr.write('Done!\n')
All files are present and accounted for!

Loading data from disk

Hail has its own internal data representation, called a Variant Dataset (VDS). This is both an on-disk file format and a Python object. See the overview for a complete story. Here, we read a VDS from disk.

This dataset was created by downsampling a public 1000 genomes VCF to about 50 MB.

In [5]:
vds = hc.read('data/1kg.vds')

Getting to know our data

It’s important to have easy ways to slice, dice, query, and summarize a dataset. Some of these methods are demonstrated below.

The summarize method is useful for providing a broad overview of the data contained in a dataset.

In [6]:
vds.summarize().report()

         Samples: 1000
        Variants: 10961
       Call Rate: 0.983163
         Contigs: ['X', '12', '8', '19', '4', '15', '11', '9', '22', '13', '16', '5', '10', '21', '6', '1', '17', '14', '20', '2', '18', '7', '3']
   Multiallelics: 0
            SNPs: 10961
            MNPs: 0
      Insertions: 0
       Deletions: 0
 Complex Alleles: 0
    Star Alleles: 0
     Max Alleles: 2

The query_variants method is the first time we’ll see the Hail expression language. The expression language allows for a variety of incredibly expressive queries and computations, but is probably the most complex part of Hail. See the pair of tutorials on the expression language to learn more!

Here, we can use query_variants to pull out 5 variants to see what they look like.

In [7]:
vds.query_variants('variants.take(5)')
Out[7]:
[Variant(contig=1, start=904165, ref=G, alts=[AltAllele(ref=G, alt=A)]),
 Variant(contig=1, start=909917, ref=G, alts=[AltAllele(ref=G, alt=A)]),
 Variant(contig=1, start=986963, ref=C, alts=[AltAllele(ref=C, alt=T)]),
 Variant(contig=1, start=1563691, ref=T, alts=[AltAllele(ref=T, alt=G)]),
 Variant(contig=1, start=1707740, ref=T, alts=[AltAllele(ref=T, alt=G)])]

There are often several ways to do something in Hail. Here are two ways to peek at the first few sample IDs:

In [8]:
vds.query_samples('samples.take(5)')
Out[8]:
[u'HG00096', u'HG00097', u'HG00099', u'HG00100', u'HG00101']
In [9]:
vds.sample_ids[:5]
Out[9]:
[u'HG00096', u'HG00097', u'HG00099', u'HG00100', u'HG00101']

There’s a similar interface for looking at the genotypes in a dataset. We use query_genotypes to look at the first few genotype calls.

In [10]:
vds.query_genotypes('gs.take(5)')
Out[10]:
[Genotype(GT=0, AD=[4, 0], DP=4, GQ=12, PL=[0, 12, 194]),
 Genotype(GT=1, AD=[4, 3], DP=7, GQ=85, PL=[85, 0, 109]),
 Genotype(GT=0, AD=[1, 0], DP=1, GQ=3, PL=[0, 3, 42]),
 Genotype(GT=0, AD=[14, 0], DP=14, GQ=42, PL=[0, 42, 533]),
 Genotype(GT=0, AD=[12, 0], DP=12, GQ=36, PL=[0, 36, 420])]

Integrate sample annotations

Hail treats variant and sample annotations as first-class citizens. Annotations are usually a critical part of any genetic study. Sample annotations are where you’ll store information about sample phenotypes, ancestry, sex, and covariates. Variant annotations can be used to store information like gene membership and functional impact for use in QC or analysis.

In this tutorial, we demonstrate how to take a text file and use it to annotate the samples in a VDS.

iPython supports various cell “magics”. The %%sh magic is one which interprets the cell with bash, rather than Python. We can use this to look at the first few lines of our annotation file. This file contains the sample ID, the population and “super-population” designations, the sample sex, and two simulated phenotypes (one binary, one discrete).

In [11]:
%%sh
head data/1kg_annotations.txt | column -t
sh: 1: column: not found
head: write error: Broken pipe

This file can be imported into Hail with HailContext.import_table. This method produces a KeyTable object. Think of this as a Pandas or R dataframe that isn’t limited by the memory on your machine – behind the scenes, it’s distributed with Spark.

In [12]:
table = hc.import_table('data/1kg_annotations.txt', impute=True)\
    .key_by('Sample')
2018-10-18 01:26:28 Hail: INFO: Reading table to impute column types
2018-10-18 01:26:28 Hail: INFO: Finished type imputation
  Loading column `Sample' as type String (imputed)
  Loading column `Population' as type String (imputed)
  Loading column `SuperPopulation' as type String (imputed)
  Loading column `isFemale' as type Boolean (imputed)
  Loading column `PurpleHair' as type Boolean (imputed)
  Loading column `CaffeineConsumption' as type Int (imputed)

A good way to peek at the structure of a KeyTable is to look at its schema.

In [13]:
print(table.schema)
Struct{Sample:String,Population:String,SuperPopulation:String,isFemale:Boolean,PurpleHair:Boolean,CaffeineConsumption:Int}

The Python pprint method makes illegible printouts pretty:

In [14]:
pprint(table.schema)
Struct{
     Sample: String,
     Population: String,
     SuperPopulation: String,
     isFemale: Boolean,
     PurpleHair: Boolean,
     CaffeineConsumption: Int
 }

Although we used the %%sh magic to look at the first lines of the table, there’s a better way. We can convert the table to a Spark DataFrame and use its .show() method:

In [15]:
table.to_dataframe().show(10)
+-------+----------+---------------+--------+----------+-------------------+
| Sample|Population|SuperPopulation|isFemale|PurpleHair|CaffeineConsumption|
+-------+----------+---------------+--------+----------+-------------------+
|NA19784|       MXL|            AMR|   false|     false|                  8|
|NA19102|       YRI|            AFR|    true|     false|                  6|
|HG00141|       GBR|            EUR|   false|     false|                  6|
|HG01890|       ACB|            AFR|   false|     false|                  8|
|HG00263|       GBR|            EUR|    true|      true|                  6|
|NA20908|       GIH|            SAS|    true|      true|                  9|
|HG04075|       STU|            SAS|    true|     false|                  9|
|NA18982|       JPT|            EAS|   false|     false|                  5|
|NA12873|       CEU|            EUR|    true|     false|                  6|
|HG02677|       GWD|            AFR|    true|      true|                  5|
+-------+----------+---------------+--------+----------+-------------------+
only showing top 10 rows

Now we’ll use this table to add sample annotations to our dataset. First, we’ll print the schema of the sample annotations already there:

In [16]:
pprint(vds.sample_schema)
Empty

We use the annotate_samples_table method to join the table with the VDS.

In [17]:
vds = vds.annotate_samples_table(table, root='sa')
In [18]:
pprint(vds.sample_schema)
Struct{
     Population: String,
     SuperPopulation: String,
     isFemale: Boolean,
     PurpleHair: Boolean,
     CaffeineConsumption: Int
 }

Query functions and the Hail Expression Language

We start by looking at some statistics of the information in our table. The query method uses the expression language to aggregate over the rows of the table.

counter is an aggregation function that counts the number of occurrences of each unique element. We can use this to pull out the population distribution.

In [19]:
pprint(table.query('SuperPopulation.counter()'))
{u'AFR': 1018L, u'AMR': 535L, u'EAS': 617L, u'EUR': 669L, u'SAS': 661L}

stats is an aggregation function that produces some useful statistics about numeric collections. We can use this to see the distribution of the CaffeineConsumption phenotype.

In [20]:
pprint(table.query('CaffeineConsumption.stats()'))
{u'max': 10.0,
 u'mean': 6.219714285714286,
 u'min': 3.0,
 u'nNotMissing': 3500L,
 u'stdev': 1.9387801550964772,
 u'sum': 21769.0}

However, these metrics aren’t perfectly representative of the samples in our dataset. Here’s why:

In [21]:
table.count()
Out[21]:
3500L
In [22]:
vds.num_samples
Out[22]:
1000

Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at annotations on the dataset. We can do this with query_samples.

In [23]:
vds.query_samples('samples.map(s => sa.SuperPopulation).counter()')
Out[23]:
{u'AFR': 101L, u'AMR': 285L, u'EAS': 308L, u'EUR': 298L, u'SAS': 8L}
In [24]:
pprint(vds.query_samples('samples.map(s => sa.CaffeineConsumption).stats()'))
{u'max': 10.0,
 u'mean': 6.783000000000003,
 u'min': 3.0,
 u'nNotMissing': 1000L,
 u'stdev': 1.624780292839619,
 u'sum': 6783.000000000003}

The functionality demonstrated in the last few cells isn’t anything especially new: it’s certainly not difficult to ask these questions with Pandas or R dataframes, or even Unix tools like awk. But Hail can use the same interfaces and query language to analyze collections that are much larger, like the set of variants.

Here we calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base). To do this, we need to map the variants to their alternate allele, filter to SNPs, and count by unique ref/alt pair:

In [25]:
snp_counts = vds.query_variants('variants.map(v => v.altAllele()).filter(aa => aa.isSNP()).counter()')
pprint(Counter(snp_counts).most_common())
[(AltAllele(ref=C, alt=T), 2436L),
 (AltAllele(ref=G, alt=A), 2387L),
 (AltAllele(ref=A, alt=G), 1944L),
 (AltAllele(ref=T, alt=C), 1879L),
 (AltAllele(ref=C, alt=A), 496L),
 (AltAllele(ref=G, alt=T), 480L),
 (AltAllele(ref=T, alt=G), 468L),
 (AltAllele(ref=A, alt=C), 454L),
 (AltAllele(ref=C, alt=G), 150L),
 (AltAllele(ref=G, alt=C), 112L),
 (AltAllele(ref=T, alt=A), 79L),
 (AltAllele(ref=A, alt=T), 76L)]

It’s nice to see that we can actually uncover something biological from this small dataset: we see that these frequencies come in pairs. C/T and G/A are actually the same mutation, just viewed from from opposite strands. Likewise, T/A and A/T are the same mutation on opposite strands. There’s a 30x difference between the frequency of C/T and A/T SNPs. Why?

The same Python, R, and Unix tools could do this work as well, but we’re starting to hit a wall - the latest gnomAD release publishes about 250 million variants, and that won’t fit in memory on a single computer.

What about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 1,000 samples and 10,000 variants produce 10 million unique genotypes. The gnomAD dataset has about 5 trillion unique genotypes.

Here we will use the hist aggregator to produce and plot a histogram of DP values for genotypes in our thousand genomes dataset.

In [26]:
dp_hist = vds.query_genotypes('gs.map(g => g.dp).hist(0, 30, 30)')
plt.xlim(0, 31)
plt.bar(dp_hist.binEdges[1:], dp_hist.binFrequencies)
plt.show()
../_images/tutorials_hail-overview_51_0.png

Quality Control

QC is where analysts spend most of their time with sequencing datasets. QC is an iterative process, and is different for every project: there is no “push-button” solution for QC. Each time the Broad collects a new group of samples, it finds new batch effects. However, by practicing open science and discussing the QC process and decisions with others, we can establish a set of best practices as a community.

QC is entirely based on the ability to understand the properties of a dataset. Hail attempts to make this easier by providing the sample_qc method, which produces a set of useful metrics as sample annotations.

In [27]:
pprint(vds.sample_schema)
Struct{
     Population: String,
     SuperPopulation: String,
     isFemale: Boolean,
     PurpleHair: Boolean,
     CaffeineConsumption: Int
 }
In [28]:
vds = vds.sample_qc()
In [29]:
pprint(vds.sample_schema)
Struct{
     Population: String,
     SuperPopulation: String,
     isFemale: Boolean,
     PurpleHair: Boolean,
     CaffeineConsumption: Int,
     qc: Struct{
         callRate: Double,
         nCalled: Int,
         nNotCalled: Int,
         nHomRef: Int,
         nHet: Int,
         nHomVar: Int,
         nSNP: Int,
         nInsertion: Int,
         nDeletion: Int,
         nSingleton: Int,
         nTransition: Int,
         nTransversion: Int,
         dpMean: Double,
         dpStDev: Double,
         gqMean: Double,
         gqStDev: Double,
         nNonRef: Int,
         rTiTv: Double,
         rHetHomVar: Double,
         rInsertionDeletion: Double
     }
 }

Interoperability is a big part of Hail. We can pull out these new metrics to a Pandas dataframe with one line of code:

In [30]:
df = vds.samples_table().to_pandas()
In [31]:
df.head()
Out[31]:
s sa.Population sa.SuperPopulation sa.isFemale sa.PurpleHair sa.CaffeineConsumption sa.qc.callRate sa.qc.nCalled sa.qc.nNotCalled sa.qc.nHomRef ... sa.qc.nTransition sa.qc.nTransversion sa.qc.dpMean sa.qc.dpStDev sa.qc.gqMean sa.qc.gqStDev sa.qc.nNonRef sa.qc.rTiTv sa.qc.rHetHomVar sa.qc.rInsertionDeletion
0 HG00096 GBR EUR False False 6 0.979017 10731 230 6321 ... 5233 1190 4.566828 2.343927 22.506570 22.531111 4410 4.397479 1.190760 None
1 HG00097 GBR EUR True True 6 0.993705 10892 69 6152 ... 5279 1251 9.440750 3.653101 42.334833 28.494689 4740 4.219824 1.648045 None
2 HG00099 GBR EUR True False 6 0.988687 10837 124 6113 ... 5377 1229 8.240004 3.720313 37.523761 28.130359 4724 4.375102 1.510096 None
3 HG00100 GBR EUR True False 6 0.999909 10960 1 6088 ... 5400 1238 12.756478 3.912233 52.740967 27.669909 4872 4.361874 1.758777 None
4 HG00101 GBR EUR False True 8 0.995347 10910 51 6229 ... 5358 1212 6.573118 2.887462 30.568286 25.611360 4681 4.420792 1.478031 None

5 rows × 26 columns

Plotting the QC metrics is a good place to start.

In [32]:
plt.clf()
plt.subplot(1, 2, 1)
plt.hist(df["sa.qc.callRate"], bins=np.arange(.75, 1.01, .01))
plt.xlabel("Call Rate")
plt.ylabel("Frequency")
plt.xlim(.75, 1)

plt.subplot(1, 2, 2)
plt.hist(df["sa.qc.gqMean"], bins = np.arange(0, 105, 5))
plt.xlabel("Mean Sample GQ")
plt.ylabel("Frequency")
plt.xlim(0, 105)

plt.tight_layout()
plt.show()
../_images/tutorials_hail-overview_61_0.png

Often, these metrics are correlated.

In [33]:
plt.scatter(df["sa.qc.dpMean"], df["sa.qc.callRate"],
            alpha=0.1)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.xlim(0, 20)
plt.show()
../_images/tutorials_hail-overview_63_0.png

Removing outliers from the dataset will generally improve association results. We can draw lines on the above plot to indicate outlier cuts. We’ll want to remove all samples that fall in the bottom left quadrant.

In [34]:
plt.scatter(df["sa.qc.dpMean"], df["sa.qc.callRate"],
            alpha=0.1)
plt.xlabel('Mean DP')
plt.ylabel('Call Rate')
plt.xlim(0, 20)
plt.axhline(0.97, c='k')
plt.axvline(4, c='k')
plt.show()
../_images/tutorials_hail-overview_65_0.png

It’s easy to filter when we’ve got the cutoff values decided:

In [35]:
vds = vds.filter_samples_expr('sa.qc.dpMean >= 4 && sa.qc.callRate >= 0.97')
print('After filter, %d/1000 samples remain.' % vds.num_samples)
After filter, 843/1000 samples remain.

Next is genotype QC. To start, we’ll print the post-sample-QC call rate. It’s actually gone up since the initial summary - dropping low-quality samples disproportionally removed missing genotypes.

In [36]:
call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')
print('pre QC call rate is %.3f' % call_rate)
pre QC call rate is 0.991

It’s a good idea to filter out genotypes where the reads aren’t where they should be: if we find a genotype called homozygous reference with >10% alternate reads, a genotype called homozygous alternate with >10% reference reads, or a genotype called heterozygote without a ref / alt balance near 1:1, it is likely to be an error.

In [37]:
filter_condition_ab = '''let ab = g.ad[1] / g.ad.sum() in
                         ((g.isHomRef && ab <= 0.1) ||
                          (g.isHet && ab >= 0.25 && ab <= 0.75) ||
                          (g.isHomVar && ab >= 0.9))'''
vds = vds.filter_genotypes(filter_condition_ab)
In [38]:
post_qc_call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')
print('post QC call rate is %.3f' % post_qc_call_rate)
post QC call rate is 0.955

Variant QC is a bit more of the same: we can use the variant_qc method to produce a variety of useful statistics, plot them, and filter.

In [39]:
pprint(vds.variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     pass: Boolean,
     info: Struct{
         AC: Array[Int],
         AF: Array[Double],
         AN: Int,
         BaseQRankSum: Double,
         ClippingRankSum: Double,
         DP: Int,
         DS: Boolean,
         FS: Double,
         HaplotypeScore: Double,
         InbreedingCoeff: Double,
         MLEAC: Array[Int],
         MLEAF: Array[Double],
         MQ: Double,
         MQ0: Int,
         MQRankSum: Double,
         QD: Double,
         ReadPosRankSum: Double,
         set: String
     }
 }

The cache is used to optimize some of the downstream operations.

In [40]:
vds = vds.variant_qc().cache()
In [41]:
pprint(vds.variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     pass: Boolean,
     info: Struct{
         AC: Array[Int],
         AF: Array[Double],
         AN: Int,
         BaseQRankSum: Double,
         ClippingRankSum: Double,
         DP: Int,
         DS: Boolean,
         FS: Double,
         HaplotypeScore: Double,
         InbreedingCoeff: Double,
         MLEAC: Array[Int],
         MLEAF: Array[Double],
         MQ: Double,
         MQ0: Int,
         MQRankSum: Double,
         QD: Double,
         ReadPosRankSum: Double,
         set: String
     },
     qc: Struct{
         callRate: Double,
         AC: Int,
         AF: Double,
         nCalled: Int,
         nNotCalled: Int,
         nHomRef: Int,
         nHet: Int,
         nHomVar: Int,
         dpMean: Double,
         dpStDev: Double,
         gqMean: Double,
         gqStDev: Double,
         nNonRef: Int,
         rHeterozygosity: Double,
         rHetHomVar: Double,
         rExpectedHetFrequency: Double,
         pHWE: Double
     }
 }
In [42]:
variant_df = vds.variants_table().to_pandas()

plt.clf()
plt.subplot(2, 2, 1)
variantgq_means = variant_df["va.qc.gqMean"]
plt.hist(variantgq_means, bins = np.arange(0, 84, 2))
plt.xlabel("Variant Mean GQ")
plt.ylabel("Frequency")
plt.xlim(0, 80)

plt.subplot(2, 2, 2)
variant_mleaf = variant_df["va.qc.AF"]
plt.hist(variant_mleaf, bins = np.arange(0, 1.05, .025))
plt.xlabel("Minor Allele Frequency")
plt.ylabel("Frequency")
plt.xlim(0, 1)

plt.subplot(2, 2, 3)
plt.hist(variant_df['va.qc.callRate'], bins = np.arange(0, 1.05, .01))
plt.xlabel("Variant Call Rate")
plt.ylabel("Frequency")
plt.xlim(.5, 1)

plt.subplot(2, 2, 4)
plt.hist(variant_df['va.qc.pHWE'], bins = np.arange(0, 1.05, .025))
plt.xlabel("Hardy-Weinberg Equilibrium p-value")
plt.ylabel("Frequency")
plt.xlim(0, 1)

plt.tight_layout()
plt.show()

../_images/tutorials_hail-overview_78_0.png

These statistics actually look pretty good: we don’t need to filter this dataset. Most datasets require thoughtful quality control, though. The filter_variants_expr method can help!

Let’s do a GWAS!

First, we need to restrict to variants that are :

  • common (we’ll use a cutoff of 1%)
  • uncorrelated (not in linkage disequilibrium)

Both of these are easy in Hail.

In [43]:
common_vds = (vds
              .filter_variants_expr('va.qc.AF > 0.01')
              .ld_prune(memory_per_core=256, num_cores=4))
2018-10-18 01:26:50 Hail: INFO: Running LD prune with nSamples=843, nVariants=9085, nPartitions=4, and maxQueueSize=257123.
2018-10-18 01:26:50 Hail: INFO: LD prune step 1 of 3: nVariantsKept=8478, nPartitions=4, time=351.375ms
2018-10-18 01:26:51 Hail: INFO: LD prune step 2 of 3: nVariantsKept=8478, nPartitions=12, time=1.184s
2018-10-18 01:26:52 Hail: INFO: Coerced sorted dataset
2018-10-18 01:26:52 Hail: INFO: LD prune step 3 of 3: nVariantsKept=8478, time=481.478ms
In [44]:
common_vds.count()
Out[44]:
(843L, 8555L)

These filters removed about 15% of sites (we started with a bit over 10,000). This is NOT representative of most sequencing datasets! We have already downsampled the full thousand genomes dataset to include more common variants than we’d expect by chance.

In Hail, the association tests accept sample annotations for the sample phenotype and covariates. Since we’ve already got our phenotype of interest (caffeine consumption) in the dataset, we are good to go:

In [45]:
gwas = common_vds.linreg('sa.CaffeineConsumption')
pprint(gwas.variant_schema)
2018-10-18 01:26:52 Hail: INFO: Running linear regression on 843 samples with 1 covariate including intercept...
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     pass: Boolean,
     info: Struct{
         AC: Array[Int],
         AF: Array[Double],
         AN: Int,
         BaseQRankSum: Double,
         ClippingRankSum: Double,
         DP: Int,
         DS: Boolean,
         FS: Double,
         HaplotypeScore: Double,
         InbreedingCoeff: Double,
         MLEAC: Array[Int],
         MLEAF: Array[Double],
         MQ: Double,
         MQ0: Int,
         MQRankSum: Double,
         QD: Double,
         ReadPosRankSum: Double,
         set: String
     },
     qc: Struct{
         callRate: Double,
         AC: Int,
         AF: Double,
         nCalled: Int,
         nNotCalled: Int,
         nHomRef: Int,
         nHet: Int,
         nHomVar: Int,
         dpMean: Double,
         dpStDev: Double,
         gqMean: Double,
         gqStDev: Double,
         nNonRef: Int,
         rHeterozygosity: Double,
         rHetHomVar: Double,
         rExpectedHetFrequency: Double,
         pHWE: Double
     },
     linreg: Struct{
         beta: Double,
         se: Double,
         tstat: Double,
         pval: Double
     }
 }

Looking at the bottom of the above printout, you can see the linear regression adds new variant annotations for the beta, standard error, t-statistic, and p-value.

In [46]:
def qqplot(pvals, xMax, yMax):
    spvals = sorted(filter(lambda x: x and not(isnan(x)), pvals))
    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]
    obs = [-log(p, 10) for p in spvals]
    plt.clf()
    plt.scatter(exp, obs)
    plt.plot(np.arange(0, max(xMax, yMax)), c="red")
    plt.xlabel("Expected p-value (-log10 scale)")
    plt.ylabel("Observed p-value (-log10 scale)")
    plt.xlim(0, xMax)
    plt.ylim(0, yMax)
    plt.show()

Python makes it easy to make a Q-Q (quantile-quantile) plot.

In [47]:
qqplot(gwas.query_variants('variants.map(v => va.linreg.pval).collect()'),
       5, 6)
../_images/tutorials_hail-overview_88_0.png

Confounded!

The observed p-values drift away from the expectation immediately. Either every SNP in our dataset is causally linked to caffeine consumption (unlikely), or there’s a confounder.

We didn’t tell you, but sample ancestry was actually used to simulate this phenotype. This leads to a stratified distribution of the phenotype. The solution is to include ancestry as a covariate in our regression.

The linreg method can also take sample annotations to use as covariates. We already annotated our samples with reported ancestry, but it is good to be skeptical of these labels due to human error. Genomes don’t have that problem! Instead of using reported ancestry, we will use genetic ancestry by including computed principal components in our model.

The pca method produces sample PCs in sample annotations, and can also produce variant loadings and global eigenvalues when asked.

In [48]:
pca = common_vds.pca('sa.pca', k=5, eigenvalues='global.eigen')
2018-10-18 01:26:55 Hail: INFO: Running PCA with 5 components...
In [49]:
pprint(pca.globals)
{u'eigen': {u'PC1': 56.34707905481798,
            u'PC2': 37.8109003010398,
            u'PC3': 16.91974301822238,
            u'PC4': 2.707349935634387,
            u'PC5': 2.0851252187821174}}
In [50]:
pprint(pca.sample_schema)
Struct{
     Population: String,
     SuperPopulation: String,
     isFemale: Boolean,
     PurpleHair: Boolean,
     CaffeineConsumption: Int,
     qc: Struct{
         callRate: Double,
         nCalled: Int,
         nNotCalled: Int,
         nHomRef: Int,
         nHet: Int,
         nHomVar: Int,
         nSNP: Int,
         nInsertion: Int,
         nDeletion: Int,
         nSingleton: Int,
         nTransition: Int,
         nTransversion: Int,
         dpMean: Double,
         dpStDev: Double,
         gqMean: Double,
         gqStDev: Double,
         nNonRef: Int,
         rTiTv: Double,
         rHetHomVar: Double,
         rInsertionDeletion: Double
     },
     pca: Struct{
         PC1: Double,
         PC2: Double,
         PC3: Double,
         PC4: Double,
         PC5: Double
     }
 }

Now that we’ve got principal components per sample, we may as well plot them! Human history exerts a strong effect in genetic datasets. Even with a 50MB sequencing dataset, we can recover the major human populations.

In [51]:
pca_table = pca.samples_table().to_pandas()
colors = {'AFR': 'green', 'AMR': 'red', 'EAS': 'black', 'EUR': 'blue', 'SAS': 'cyan'}
plt.scatter(pca_table["sa.pca.PC1"], pca_table["sa.pca.PC2"],
            c = pca_table["sa.SuperPopulation"].map(colors),
            alpha = .5)
plt.xlim(-0.6, 0.6)
plt.xlabel("PC1")
plt.ylabel("PC2")
legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]
plt.legend(handles=legend_entries, loc=2)
plt.show()
../_images/tutorials_hail-overview_94_0.png

Now we can rerun our linear regression, controlling for the first few principal components and sample sex.

In [52]:
pvals = (common_vds
        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')
        .linreg('sa.CaffeineConsumption', covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'])
        .query_variants('variants.map(v => va.linreg.pval).collect()'))
2018-10-18 01:27:07 Hail: INFO: Running linear regression on 843 samples with 5 covariates including intercept...
In [53]:
qqplot(pvals, 5, 6)
../_images/tutorials_hail-overview_97_0.png
In [54]:
pvals = (common_vds
        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')
        .linreg('sa.CaffeineConsumption',
                covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'],
                use_dosages=True)
        .query_variants('variants.map(v => va.linreg.pval).collect()'))
2018-10-18 01:27:08 Hail: INFO: Running linear regression on 843 samples with 5 covariates including intercept...
In [55]:
qqplot(pvals, 5, 6)
../_images/tutorials_hail-overview_99_0.png

That’s more like it! We may not be publishing ten new coffee-drinking loci in Nature, but we shouldn’t expect to find anything but the strongest signals from a dataset of 1000 individuals anyway.

Rare variant analysis

Hail doesn’t yet have rare variant kernel-based methods, but we have linear and logistic burden tests.

We won’t be showing those here, though. Instead, we’ll demonstrate how one can use the expression language to group and count by any arbitrary properties in variant or sample annotations.

In [56]:
kt = (vds.genotypes_table()
         .aggregate_by_key(key_expr=['pop = sa.SuperPopulation', 'chromosome = v.contig'],
                           agg_expr=['n_het = g.filter(g => g.isHet()).count()']))
In [57]:
kt.to_dataframe().show()
+---+----------+-----+
|pop|chromosome|n_het|
+---+----------+-----+
|EUR|        14|16380|
|SAS|        17|  511|
|EUR|         5|30717|
|AFR|         7|11889|
|EAS|         9|23951|
|AFR|        21| 3529|
|EAS|         X| 7403|
|EAS|         1|49375|
|EUR|        19|13483|
|AMR|        15|18935|
|AMR|         7|31527|
|EUR|        13|17321|
|EUR|        12|26134|
|EUR|        15|15807|
|EUR|         6|33910|
|EAS|        20|17466|
|SAS|        11|  901|
|AFR|         3|15829|
|EAS|         2|45384|
|AMR|        18|18982|
+---+----------+-----+
only showing top 20 rows

What if we want to group by minor allele frequency bin and hair color, and calculate the mean GQ?

In [58]:
kt2 = (vds.genotypes_table()
          .aggregate_by_key(key_expr=['''maf_bin = if (va.qc.AF < 0.01) "< 1%"
                                                   else if (va.qc.AF < 0.05) "1%-5%"
                                                   else "> 5%" ''',
                                     'purple_hair = sa.PurpleHair'],
                           agg_expr=['mean_gq = g.map(g => g.gq).stats().mean',
                                     'mean_dp = g.map(g => g.dp).stats().mean']))
In [59]:
kt2.to_dataframe().show()
+-------+-----------+------------------+-----------------+
|maf_bin|purple_hair|           mean_gq|          mean_dp|
+-------+-----------+------------------+-----------------+
|   > 5%|       true| 36.09305651197578|7.407450459057423|
|   < 1%|       true| 22.68197887434976|7.374254453728496|
|   < 1%|      false|22.986128698357074|7.492131714314245|
|   > 5%|      false|36.341259980753755|7.533399982371768|
|  1%-5%|       true|24.093123033233528|7.269552536649012|
|  1%-5%|      false|  24.3519587208908|7.405582424428774|
+-------+-----------+------------------+-----------------+

We’ve shown that it’s easy to aggregate by a couple of arbitrary statistics. This specific examples may not provide especially useful pieces of information, but this same pattern can be used to detect effects of rare variation:

  • Count the number of heterozygous genotypes per gene by functional category (synonymous, missense, or loss-of-function) to estimate per-gene functional constraint
  • Count the number of singleton loss-of-function mutations per gene in cases and controls to detect genes involved in disease

Eplilogue

Congrats! If you’ve made it this far, you’re perfectly primed to read the Overview, look through the Hail objects representing many core concepts in genetics, and check out the many Hail functions defined in the Python API. If you use Hail for your own science, we’d love to hear from you on Gitter chat or the discussion forum.

There’s also a lot of functionality inside Hail that we didn’t get to in this broad overview. Things like:

  • Flexible import and export to a variety of data and annotation formats (VCF, BGEN, PLINK, JSON, TSV, …)
  • Simulation
  • Burden tests
  • Kinship and pruning (IBD, GRM, RRM)
  • Family-based tests and utilities
  • Distributed file system utilities
  • Interoperability with Python and Spark machine learning libraries
  • More!

For reference, here’s the full workflow to all tutorial endpoints combined into one cell. It may take a minute! It’s doing a lot of work.

In [60]:
table = hc.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')
common_vds = (hc.read('data/1kg.vds')
              .annotate_samples_table(table, root='sa')
              .sample_qc()
              .filter_samples_expr('sa.qc.dpMean >= 4 && sa.qc.callRate >= 0.97')
              .filter_genotypes('''let ab = g.ad[1] / g.ad.sum() in
                         ((g.isHomRef && ab <= 0.1) ||
                          (g.isHet && ab >= 0.25 && ab <= 0.75) ||
                          (g.isHomVar && ab >= 0.9))''')
              .variant_qc()
              .filter_variants_expr('va.qc.AF > 0.01')
              .ld_prune(memory_per_core=512, num_cores=4))

pca = common_vds.pca('sa.pca', k=5, eigenvalues='global.eigen')
pvals = (common_vds
        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')
        .linreg('sa.CaffeineConsumption', covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'])
        .query_variants('variants.map(v => va.linreg.pval).collect()'))
2018-10-18 01:27:14 Hail: INFO: Reading table to impute column types
2018-10-18 01:27:14 Hail: INFO: Finished type imputation
  Loading column `Sample' as type String (imputed)
  Loading column `Population' as type String (imputed)
  Loading column `SuperPopulation' as type String (imputed)
  Loading column `isFemale' as type Boolean (imputed)
  Loading column `PurpleHair' as type Boolean (imputed)
  Loading column `CaffeineConsumption' as type Int (imputed)
2018-10-18 01:27:21 Hail: INFO: Running LD prune with nSamples=843, nVariants=9085, nPartitions=4, and maxQueueSize=514245.
2018-10-18 01:27:27 Hail: INFO: LD prune step 1 of 3: nVariantsKept=8478, nPartitions=4, time=6.067s
2018-10-18 01:27:28 Hail: INFO: LD prune step 2 of 3: nVariantsKept=8478, nPartitions=12, time=655.317ms
2018-10-18 01:27:28 Hail: INFO: Coerced sorted dataset
2018-10-18 01:27:28 Hail: INFO: LD prune step 3 of 3: nVariantsKept=8478, time=255.964ms
2018-10-18 01:27:28 Hail: INFO: Running PCA with 5 components...
2018-10-18 01:27:49 Hail: INFO: Running linear regression on 843 samples with 5 covariates including intercept...