{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the expression language to slice, dice, and query genetic data\n", "\n", "This notebook uses the Hail expression language to query, filter, and annotate the same thousand genomes dataset from the overview. We also cover how to compute aggregate statistics from a dataset using the expression language.\n", "\n", "Every Hail practical notebook starts the same: import the necessary modules, and construct a [HailContext](https://hail.is/hail/hail.HailContext.html#hail.HailContext). This is the entry point for Hail functionality. This object also wraps a SparkContext, which can be accessed with `hc.sc`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from hail import *\n", "hc = HailContext()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the above cell ran without error, we're ready to go!\n", "\n", "Before using Hail, we import some standard Python libraries for use throughout the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pprint import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for tutorial data or download if necessary\n", "\n", "This cell downloads the necessary data from Google Storage if it isn't found in the current working directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "if os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt'):\n", " print('All files are present and accounted for!')\n", "else:\n", " import sys\n", " sys.stderr.write('Downloading data (~50M) from Google Storage...\\n')\n", " import urllib\n", " import tarfile\n", " urllib.urlretrieve('https://storage.googleapis.com/hail-1kg/tutorial_data.tar',\n", " 'tutorial_data.tar')\n", " sys.stderr.write('Download finished!\\n')\n", " sys.stderr.write('Extracting...\\n')\n", " tarfile.open('tutorial_data.tar').extractall()\n", " if not (os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt')):\n", " raise RuntimeError('Something went wrong!')\n", " else:\n", " sys.stderr.write('Done!\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will read a dataset from disk, and print some summary statistics about it to re-familiarize ourselves." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds = hc.read('data/1kg.vds')\n", "vds.summarize().report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types in action\n", "\n", "We'll produce some sample annotations with the [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) method, then use these annotations to demonstrate some of the expression language features." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "vds = vds.variant_qc().cache().sample_qc()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pprint(vds.sample_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering with expressions\n", "\n", "The schema printed above is the type of the sample annotations, which are given the variable name 'sa' wherever they appear. Here, we use the `filter_samples_expr` method to filter samples based on these annotations. If we want to filter on the \"dpMean\" above, we need to select the 'qc' field from the 'sa' struct, then select the 'dpMean' field from the 'qc' struct. These selections are done with dots.\n", "\n", "There are four Hail methods that use the expression language to filter a dataset: \n", " - [filter_variants_expr](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_variants_expr)\n", " - [filter_samples_expr](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_samples_expr)\n", " - [filter_genotypes](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_genotypes)\n", " - [filter_alleles](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_alleles)\n", "\n", "All these methods take a Hail expression as a string argument, and return a filtered dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# unfiltered\n", "vds.num_samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_samples_expr('sa.qc.dpMean > 5', keep=True).num_samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_samples_expr('sa.qc.dpMean <= 5', keep=False).num_samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_samples_expr('sa.qc.callRate > 0.95', keep=True).num_samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_samples_expr('sa.qc.callRate > 0.95 && sa.qc.dpMean > 5', keep=True).num_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add some sample annotations from the metadata file to allow for some more interesting expressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "kt = hc.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')\n", "vds = vds.annotate_samples_table(kt, root='sa.metadata')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pprint(vds.sample_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply conditional filters on things like population with `if/else`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_samples_expr('if (sa.metadata.Population == \"EAS\") sa.qc.dpMean > 8 else sa.qc.dpMean > 4').num_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering variants and genotypes\n", "\n", "One of the advantages of Hail's filtering interface is that it's equally easy to filter samples, variants, or genotypes. If one is handed a fresh VCF text file, it's pretty easy to write a program to filter variants, but much harder to filter samples or genotypes. Other data representations may lend themselves to a different operation being easy, and the others hard. In Hail, we've abstracted away all of this -- it's easy to filter anything!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.count_variants()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Filter on allele frequency\n", "vds.filter_variants_expr('va.qc.AF > 0.1', keep=True).count_variants()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Filter on allele frequency and GQ mean\n", "vds.filter_variants_expr('va.qc.AF > 0.1 && va.qc.gqMean > 20').count_variants()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Genotype call rate across the entire dataset\n", "vds.summarize().call_rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see in the previous cell, the overall call rate of this dataset is 98.7%." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.filter_genotypes('g.gq >= 20', keep=True).summarize().call_rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, 40% of those called genotypes are called with GQ 20 or less! This corresponds to less than 99% confidence in the call." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Annotating with expressions\n", "\n", "It is also possible to produce new annotations with the expression language. These take an expression of the form:\n", "```\n", " = \n", "```\n", "\n", "To annotate samples, the new annotation name must also start with `sa`. To annotate variants, it must always begin with `va`.\n", "\n", "Here are some simple examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "(vds.annotate_samples_expr('sa.keepThisSample = sa.qc.callRate > 0.95 && sa.qc.dpMean > 5')\n", " .filter_samples_expr('sa.keepThisSample', keep=True).num_samples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "(vds.annotate_variants_expr('va.keepThisVariant = va.qc.AF > 0.1 && va.qc.gqMean > 20')\n", " .filter_variants_expr('va.keepThisVariant').count_variants())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Key tables also have an [annotate](https://hail.is/hail/hail.KeyTable.html#hail.KeyTable.annotate) method. We can use this to produce new columns or redefine old ones:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kt.to_dataframe().show(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kt.annotate('is_american = SuperPopulation == \"AMR\"').to_dataframe().show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregables\n", "\n", "We've now seen how it's possible to use the Hail expression language to manipulate various things like numbers and arrays. We can compute the mean of an array of numbers with `.mean()`, find their max with `.max()`, and so on. \n", "\n", "But what if we wanted to compute the mean of **5 trillion** numbers? That's a lot of data, and turns out to be the rough number of genotypes in the preprocessed [gnomAD](http://gnomad.broadinstitute.org/) VCF, which contained about 20 thousand samples and 250 million variants. Hail is designed to handle datasets of this size and larger, and does so by computing in parallel on many computers using [Apache Spark](http://spark.apache.org/).\n", "\n", "But we still want a simple programming model that allows us to query and transform such distributed data. That is where the `Aggregable` comes in. First, an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.map(g => g.gq).stats()').mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above statement computes the mean GQ of all genotypes in a dataset. This code can compute the mean GQ of a megabyte-scale thousand genomes subset on a laptop, or compute the mean GQ of a 300 TB .vcf on a massive cloud cluster. Hail is scalable!\n", "\n", "An `Aggregable[T]` is distributed collection of elements of type `T`. The interface is modeled on `Array[T]`, but aggregables can be arbitrarily large and they are unordered, so they don't support operations like indexing.\n", "\n", "Aggregables support map and filter. Like sum, max, etc. on arrays, aggregables support operations which we call \"aggregators\" that operate on the entire aggregable collection and produce a summary or derived statistic. See the [documentation](https://hail.is/hail/types.html#aggregable) for a complete list of aggregators.\n", "\n", "Aggregables are available in expressions on various methods on [VariantDataset](https://hail.is/hail/hail.VariantDataset.html). Above, [query_genotypes](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_genotypes) exposes the aggregable `gs: Aggregable[Genotype]` which is the collection of all the genotypes in the dataset. \n", "\n", "First, we map the genotypes to their GQ values. Then, we use the `stats()` aggregator to compute a struct with information like mean and standard deviation. We can see the other values in the struct produced as well:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "pprint(vds.query_genotypes('gs.map(g => g.gq).stats()'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Count\n", "\n", "The `count` aggregator is pretty simple - it counts the number of elements in the aggregable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.count()')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.num_samples * vds.count_variants()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's one genotype per sample per variant, so the count of `gs` is equal to the number of samples times the number of variants, or about 11 million. How can we make this more useful? With `filter`!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.filter(g => g.isHet()).count()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of the 11 million genotypes in the dataset, about 2.5M are heterozygous.\n", "\n", "What about combining sample annotations with genotype information? How many heterozygote genotypes are found in the American samples? A simple way to implement this is by filtering to American samples first and then running the same query." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "(vds.filter_samples_expr('sa.metadata.SuperPopulation == \"AMR\"')\n", " .query_genotypes('gs.filter(g => g.isHet()).count()'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell is a bit tricky - aggregables have an extra \"context\" that they carry around. We can actually access the sample, sample annotations, variant, and variant annotations inside of operations on `gs`. We don't need to filter samples first, we can do it inside the query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.filter(g => g.isHet() && sa.metadata.SuperPopulation == \"AMR\").count()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example where we use the variant annotations to count the number of heterozygous genotypes in Americans at rare loci." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('''gs.filter(g => g.isHet() \n", " && sa.metadata.SuperPopulation == \"AMR\"\n", " && va.qc.AF < 0.01).count()''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sum\n", "\n", "The `sum` aggregator can be used to compute useful statistics per sample or variant. For example, we may want to count the total number of non-reference alleles per sample:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "(vds.annotate_samples_expr('sa.nNonRefAlleles = gs.map(g => g.nNonRefAlleles()).sum()')\n", " .query_samples('samples.map(s => sa.nNonRefAlleles).take(10)'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fraction\n", "\n", "The `fraction` aggregator can actually be implemented in terms of `filter` and 2 `count`s, but it's a common enough operation that we have a separate function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.summarize().call_rate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.fraction(g => g.isCalled())')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats\n", "\n", "The `stats` aggregator computes six useful statistics about a numeric aggregable. We can get quality distributions per sample or variant easily with this function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pprint((vds.annotate_variants_expr('va.gq_stats = gs.map(g => g.gq).stats()')\n", " .query_variants('variants.map(v => {v: v, stats: va.gq_stats}).take(3)')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counter\n", "\n", "The `counter` aggregator counts the number of occurrences of each unique key in an aggregable. You may have seen it in the overview:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kt.query('SuperPopulation.counter()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's useful for other computations, too. We can compute the het counts by population per variant:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pprint((vds.annotate_variants_expr('''\n", " va.pop_counts = gs.filter(g => g.isHet())\n", " .map(g => sa.metadata.SuperPopulation).counter()''')\n", " .query_variants('variants.map(v => {at: str(v), populations: va.pop_counts}).take(3)')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## FlatMap\n", "\n", "`flatMap` is not an aggregator, but a transformer. It is like `map`, but takes a lambda function that returns either an `Array[T]` or `Set[T]`, and flattens the elements. Here's a didactic example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_samples('samples.count()')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_samples('samples.flatMap(s => [1,2,3]).count()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Take\n", "\n", "`take` is an aggregator that takes `n` elements of an aggregable, non-deterministically." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vds.query_genotypes('gs.take(5)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collect\n", "\n", "`collect` is an aggregator that collects all elements of an aggregable into an array. It's usually not a good idea to use this without `filter`ing the aggregable first. For example, we can collect the set of sample IDs that are called non-reference at each variant." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pprint((vds.filter_variants_expr('va.qc.AF < 0.01')\n", " .annotate_variants_expr('''\n", " va.nonref_samples = gs.filter(g => g.isCalledNonRef())\n", " .map(g => s).collect()''')\n", " .query_variants('variants.map(v => {at: str(v), homvars: va.nonref_samples}).take(3)')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## takeBy\n", "\n", "`takeBy` is an aggregator that takes elements of an aggregable ordered by a lambda function (smallest to largest). We can easily select the variants with the lowest p-values after regression:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "top_5_pvals = (vds.linreg('sa.metadata.CaffeineConsumption')\n", " .query_variants('variants.map(v => {at: str(v), pval: va.linreg.pval}).takeBy(x => x.pval, 5)'))\n", "pprint(top_5_pvals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating by key\n", "\n", "The [aggregate_by_key](https://hail.is/hail/hail.KeyTable.html#hail.KeyTable.aggregate_by_key) method is likely the most powerful piece of query functionality in Hail. It's a method on [KeyTable](https://hail.is/hail/hail.KeyTable.html). You can produce key tables from a [VariantDataset](https://hail.is/hail/hail.VariantDataset.html) with three methods:\n", "\n", " - [variants_table()](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variants_table): a key table with the variant and variant annotations as columns. There is one row per variant.\n", " - [samples_table()](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.samples_table): a key table with the sample and sample annotations as columns. There is one row per sample.\n", " - [genotypes_table()](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.genotypes_table): a key table that is the coordinate representation of the genetic matrix. The columns are the variant, variant annotations, sample, sample annotations, and genotype. There is one row per variant/sample combination: `(N * M)` total rows!\n", "\n", "Using [aggregate_by_key](https://hail.is/hail/hail.KeyTable.html#hail.KeyTable.aggregate_by_key) with [genotypes_table](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.genotypes_table) can produce counts of loss of function variants in cases and controls per gene, compute the mean depth per sample per exon, and much more. You define the aggregation keys, and you define how to combine the rows. This method produces another [KeyTable](https://hail.is/hail/hail.KeyTable.html).\n", "\n", "We use it here to compute the mean depth and quality by alt allele type by population. This particular aggregation isn't particularly exciting, but illustrates the complete flexibility of this model. You can group by gene, or by gene and consequence, or by frequency bin and gene, or any combination of groupings you think may be scientifically useful. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "agg = (vds.genotypes_table()\n", " .aggregate_by_key(key_expr=['alt = v.altAllele()', 'pop = sa.metadata.Population'],\n", " agg_expr=['meanDP = g.map(g => g.dp).stats().mean',\n", " 'meanGQ = g.map(g => g.gq).stats().mean']))\n", "\n", "agg.to_dataframe().show(20)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }