{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic dataset. We walk through a genome-wide SNP association test, and demonstrate the need to control for confounding caused by population stratification.\n",
    "\n",
    "Each notebook starts the same: we import the `hail` package and create a [HailContext](https://hail.is/hail/hail.HailContext.html). This object is the entry point for most Hail functionality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from hail import *\n",
    "hc = HailContext()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the above cell ran without error, we're ready to go! \n",
    "\n",
    "Before using Hail, we import some standard Python libraries for use throughout the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "from collections import Counter\n",
    "from math import log, isnan\n",
    "from pprint import pprint\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Installing and importing [seaborn](http://seaborn.pydata.org/installing.html) is optional; it just makes the plots prettier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# optional\n",
    "import seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for tutorial data or download if necessary\n",
    "\n",
    "This cell downloads the necessary data from Google Storage if it isn't found in the current working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "if os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt'):\n",
    "    print('All files are present and accounted for!')\n",
    "else:\n",
    "    import sys\n",
    "    sys.stderr.write('Downloading data (~50M) from Google Storage...\\n')\n",
    "    import urllib\n",
    "    import tarfile\n",
    "    urllib.urlretrieve('https://storage.googleapis.com/hail-1kg/tutorial_data.tar',\n",
    "                       'tutorial_data.tar')\n",
    "    sys.stderr.write('Download finished!\\n')\n",
    "    sys.stderr.write('Extracting...\\n')\n",
    "    tarfile.open('tutorial_data.tar').extractall()\n",
    "    if not (os.path.isdir('data/1kg.vds') and os.path.isfile('data/1kg_annotations.txt')):\n",
    "        raise RuntimeError('Something went wrong!')\n",
    "    else:\n",
    "        sys.stderr.write('Done!\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading data from disk\n",
    "\n",
    "Hail has its own internal data representation, called a Variant Dataset (VDS). This is both an on-disk file format and a [Python object](https://hail.is/hail/hail.VariantDataset.html). See the [overview](https://hail.is/hail/overview.html) for a complete story. Here, we read a VDS from disk.\n",
    "\n",
    "This dataset was created by downsampling a public 1000 genomes VCF to about 50 MB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds = hc.read('data/1kg.vds')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting to know our data\n",
    "\n",
    "It's important to have easy ways to slice, dice, query, and summarize a dataset. Some of these methods are demonstrated below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [summarize](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.summarize) method is useful for providing a broad overview of the data contained in a dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds.summarize().report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [query_variants](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_variants) method is the first time we'll see the [Hail expression language](https://hail.is/hail/exprlang.html). The expression language allows for a variety of incredibly expressive queries and computations, but is probably the most complex part of Hail. See the pair of tutorials on the expression language to learn more!\n",
    "\n",
    "Here, we can use `query_variants` to pull out 5 variants to see what they look like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds.query_variants('variants.take(5)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are often several ways to do something in Hail. Here are two ways to peek at the first few sample IDs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "vds.query_samples('samples.take(5)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds.sample_ids[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a similar interface for looking at the genotypes in a dataset. We use [query_genotypes](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_genotypes) to look at the first few genotype calls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "vds.query_genotypes('gs.take(5)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Integrate sample annotations\n",
    "\n",
    "Hail treats variant and sample annotations as first-class citizens. Annotations are usually a critical part of any genetic study. Sample annotations are where you'll store information about sample phenotypes, ancestry, sex, and covariates. Variant annotations can be used to store information like gene membership and functional impact for use in QC or analysis.\n",
    "\n",
    "In this tutorial, we demonstrate how to take a text file and use it to annotate the samples in a VDS. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "iPython supports various cell \"magics\". The `%%sh` magic is one which interprets the cell with bash, rather than Python. We can use this to look at the first few lines of our annotation file. This file contains the sample ID, the population and \"super-population\" designations, the sample sex, and two simulated phenotypes (one binary, one discrete)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%sh\n",
    "head data/1kg_annotations.txt | column -t "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file can be imported into Hail with [HailContext.import_table](https://hail.is/hail/hail.HailContext.html#hail.HailContext.import_table). This method produces a [KeyTable](https://hail.is/hail/hail.KeyTable.html#hail.KeyTable) object. Think of this as a Pandas or R dataframe that isn't limited by the memory on your machine -- behind the scenes, it's distributed with Spark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "table = hc.import_table('data/1kg_annotations.txt', impute=True)\\\n",
    "    .key_by('Sample')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good way to peek at the structure of a `KeyTable` is to look at its `schema`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(table.schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Python `pprint` method makes illegible printouts pretty:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(table.schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although we used the `%%sh` magic to look at the first lines of the table, there's a better way. We can convert the table to a [Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) and use its `.show()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "table.to_dataframe().show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll use this table to add sample annotations to our dataset. First, we'll print the schema of the sample annotations already there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pprint(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the [annotate_samples_table](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.annotate_samples_table) method to join the table with the VDS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds = vds.annotate_samples_table(table, root='sa')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pprint(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query functions and the Hail Expression Language\n",
    "\n",
    "We start by looking at some statistics of the information in our table. The [query](https://hail.is/hail/hail.KeyTable.html#hail.KeyTable.query) method uses the expression language to aggregate over the rows of the table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`counter` is an aggregation function that counts the number of occurrences of each unique element. We can use this to pull out the population distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(table.query('SuperPopulation.counter()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`stats` is an aggregation function that produces some useful statistics about numeric collections. We can use this to see the distribution of the CaffeineConsumption phenotype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(table.query('CaffeineConsumption.stats()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, these metrics aren't perfectly representative of the samples in our dataset. Here's why:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "table.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds.num_samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at annotations on the dataset. We can do this with [query_samples](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.query_samples)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds.query_samples('samples.map(s => sa.SuperPopulation).counter()')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pprint(vds.query_samples('samples.map(s => sa.CaffeineConsumption).stats()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The functionality demonstrated in the last few cells isn't anything especially new: it's certainly not difficult to ask these questions with Pandas or R dataframes, or even Unix tools like `awk`. But Hail can use the same interfaces and query language to analyze collections that are much larger, like the set of variants. \n",
    "\n",
    "Here we calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base). To do this, we need to map the variants to their alternate allele, filter to SNPs, and count by unique ref/alt pair:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "snp_counts = vds.query_variants('variants.map(v => v.altAllele()).filter(aa => aa.isSNP()).counter()')\n",
    "pprint(Counter(snp_counts).most_common())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's nice to see that we can actually uncover something biological from this small dataset: we see that these frequencies come in pairs. C/T and G/A are actually the same mutation, just viewed from from opposite strands. Likewise, T/A and A/T are the same mutation on opposite strands. There's a 30x difference between the frequency of C/T and A/T SNPs. Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same Python, R, and Unix tools could do this work as well, but we're starting to hit a wall - the latest [gnomAD release](http://gnomad.broadinstitute.org/) publishes about 250 million variants, and that won't fit in memory on a single computer.\n",
    "\n",
    "What about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 1,000 samples and 10,000 variants produce 10 million unique genotypes. The gnomAD dataset has about **5 trillion** unique genotypes.\n",
    "\n",
    "Here we will use the `hist` aggregator to produce and plot a histogram of DP values for genotypes in our thousand genomes dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dp_hist = vds.query_genotypes('gs.map(g => g.dp).hist(0, 30, 30)')\n",
    "plt.xlim(0, 31)\n",
    "plt.bar(dp_hist.binEdges[1:], dp_hist.binFrequencies)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality Control\n",
    "\n",
    "QC is where analysts spend most of their time with sequencing datasets. QC is an iterative process, and is different for every project: there is no \"push-button\" solution for QC. Each time the Broad collects a new group of samples, it finds new batch effects. However, by practicing open science and discussing the QC process and decisions with others, we can establish a set of best practices as a community."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "QC is entirely based on the ability to understand the properties of a dataset. Hail attempts to make this easier by providing the [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) method, which produces a set of useful metrics as sample annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vds = vds.sample_qc()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(vds.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interoperability is a big part of Hail. We can pull out these new metrics to a Pandas dataframe with one line of code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = vds.samples_table().to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting the QC metrics is a good place to start."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.clf()\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.hist(df[\"sa.qc.callRate\"], bins=np.arange(.75, 1.01, .01))\n",
    "plt.xlabel(\"Call Rate\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(.75, 1)\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.hist(df[\"sa.qc.gqMean\"], bins = np.arange(0, 105, 5))\n",
    "plt.xlabel(\"Mean Sample GQ\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(0, 105)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, these metrics are correlated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(df[\"sa.qc.dpMean\"], df[\"sa.qc.callRate\"], \n",
    "            alpha=0.1)\n",
    "plt.xlabel('Mean DP')\n",
    "plt.ylabel('Call Rate')\n",
    "plt.xlim(0, 20)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Removing outliers from the dataset will generally improve association results. We can draw lines on the above plot to indicate outlier cuts. We'll want to remove all samples that fall in the bottom left quadrant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(df[\"sa.qc.dpMean\"], df[\"sa.qc.callRate\"], \n",
    "            alpha=0.1)\n",
    "plt.xlabel('Mean DP')\n",
    "plt.ylabel('Call Rate')\n",
    "plt.xlim(0, 20)\n",
    "plt.axhline(0.97, c='k')\n",
    "plt.axvline(4, c='k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's easy to filter when we've got the cutoff values decided:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds = vds.filter_samples_expr('sa.qc.dpMean >= 4 && sa.qc.callRate >= 0.97')\n",
    "print('After filter, %d/1000 samples remain.' % vds.num_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next is genotype QC. To start, we'll print the post-sample-QC call rate. It's actually gone _up_ since the initial summary - dropping low-quality samples disproportionally removed missing genotypes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')\n",
    "print('pre QC call rate is %.3f' % call_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a good idea to filter out genotypes where the reads aren't where they should be: if we find a genotype called homozygous reference with >10% alternate reads, a genotype called homozygous alternate with >10% reference reads, or a genotype called heterozygote without a ref / alt balance near 1:1, it is likely to be an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filter_condition_ab = '''let ab = g.ad[1] / g.ad.sum() in\n",
    "                         ((g.isHomRef && ab <= 0.1) ||\n",
    "                          (g.isHet && ab >= 0.25 && ab <= 0.75) ||\n",
    "                          (g.isHomVar && ab >= 0.9))'''\n",
    "vds = vds.filter_genotypes(filter_condition_ab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "post_qc_call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')\n",
    "print('post QC call rate is %.3f' % post_qc_call_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Variant QC is a bit more of the same: we can use the [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) method to produce a variety of useful statistics, plot them, and filter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(vds.variant_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [cache](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.cache) is used to optimize some of the downstream operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vds = vds.variant_qc().cache()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(vds.variant_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "variant_df = vds.variants_table().to_pandas()\n",
    "\n",
    "plt.clf()\n",
    "plt.subplot(2, 2, 1)\n",
    "variantgq_means = variant_df[\"va.qc.gqMean\"]\n",
    "plt.hist(variantgq_means, bins = np.arange(0, 84, 2))\n",
    "plt.xlabel(\"Variant Mean GQ\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(0, 80)\n",
    "\n",
    "plt.subplot(2, 2, 2)\n",
    "variant_mleaf = variant_df[\"va.qc.AF\"]\n",
    "plt.hist(variant_mleaf, bins = np.arange(0, 1.05, .025))\n",
    "plt.xlabel(\"Minor Allele Frequency\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(0, 1)\n",
    "\n",
    "plt.subplot(2, 2, 3)\n",
    "plt.hist(variant_df['va.qc.callRate'], bins = np.arange(0, 1.05, .01))\n",
    "plt.xlabel(\"Variant Call Rate\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(.5, 1)\n",
    "\n",
    "plt.subplot(2, 2, 4)\n",
    "plt.hist(variant_df['va.qc.pHWE'], bins = np.arange(0, 1.05, .025))\n",
    "plt.xlabel(\"Hardy-Weinberg Equilibrium p-value\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.xlim(0, 1)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These statistics actually look pretty good: we don't need to filter this dataset. Most datasets require thoughtful quality control, though. The [filter_variants_expr](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_variants_expr) method can help!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's do a GWAS!\n",
    "\n",
    "First, we need to restrict to variants that are : \n",
    "\n",
    " - common (we'll use a cutoff of 1%)\n",
    " - uncorrelated (not in linkage disequilibrium)\n",
    " \n",
    "Both of these are easy in Hail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "common_vds = (vds\n",
    "              .filter_variants_expr('va.qc.AF > 0.01')\n",
    "              .ld_prune(memory_per_core=256, num_cores=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "common_vds.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These filters removed about 15% of sites (we started with a bit over 10,000). This is _NOT_ representative of most sequencing datasets! We have already downsampled the full thousand genomes dataset to include more common variants than we'd expect by chance.\n",
    "\n",
    "In Hail, the association tests accept sample annotations for the sample phenotype and covariates. Since we've already got our phenotype of interest (caffeine consumption) in the dataset, we are good to go:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gwas = common_vds.linreg('sa.CaffeineConsumption')\n",
    "pprint(gwas.variant_schema)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the bottom of the above printout, you can see the linear regression adds new variant annotations for the beta, standard error, t-statistic, and p-value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def qqplot(pvals, xMax, yMax):\n",
    "    spvals = sorted(filter(lambda x: x and not(isnan(x)), pvals))\n",
    "    exp = [-log(float(i) / len(spvals), 10) for i in np.arange(1, len(spvals) + 1, 1)]\n",
    "    obs = [-log(p, 10) for p in spvals]\n",
    "    plt.clf()\n",
    "    plt.scatter(exp, obs)\n",
    "    plt.plot(np.arange(0, max(xMax, yMax)), c=\"red\")\n",
    "    plt.xlabel(\"Expected p-value (-log10 scale)\")\n",
    "    plt.ylabel(\"Observed p-value (-log10 scale)\")\n",
    "    plt.xlim(0, xMax)\n",
    "    plt.ylim(0, yMax)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python makes it easy to make a [Q-Q (quantile-quantile) plot](https://en.wikipedia.org/wiki/Q-Q_plot)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "qqplot(gwas.query_variants('variants.map(v => va.linreg.pval).collect()'), \n",
    "       5, 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confounded!\n",
    "\n",
    "The observed p-values drift away from the expectation immediately. Either every SNP in our dataset is causally linked to caffeine consumption (unlikely), or there's a confounder.\n",
    "\n",
    "We didn't tell you, but sample ancestry was actually used to simulate this phenotype. This leads to a [stratified](https://en.wikipedia.org/wiki/Population_stratification) distribution of the phenotype. The solution is to include ancestry as a covariate in our regression. \n",
    "\n",
    "The [linreg](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.linreg) method can also take sample annotations to use as covariates. We already annotated our samples with reported ancestry, but it is good to be skeptical of these labels due to human error. Genomes don't have that problem! Instead of using reported ancestry, we will use genetic ancestry by including computed principal components in our model.\n",
    "\n",
    "The [pca](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.pca) method produces sample PCs in sample annotations, and can also produce variant loadings and global eigenvalues when asked."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pca = common_vds.pca('sa.pca', k=5, eigenvalues='global.eigen')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(pca.globals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(pca.sample_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've got principal components per sample, we may as well plot them! Human history exerts a strong effect in genetic datasets. Even with a 50MB sequencing dataset, we can recover the major human populations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pca_table = pca.samples_table().to_pandas()\n",
    "colors = {'AFR': 'green', 'AMR': 'red', 'EAS': 'black', 'EUR': 'blue', 'SAS': 'cyan'}\n",
    "plt.scatter(pca_table[\"sa.pca.PC1\"], pca_table[\"sa.pca.PC2\"], \n",
    "            c = pca_table[\"sa.SuperPopulation\"].map(colors), \n",
    "            alpha = .5)\n",
    "plt.xlim(-0.6, 0.6)\n",
    "plt.xlabel(\"PC1\")\n",
    "plt.ylabel(\"PC2\")\n",
    "legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\n",
    "plt.legend(handles=legend_entries, loc=2)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can rerun our linear regression, controlling for the first few principal components and sample sex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pvals = (common_vds\n",
    "        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')\n",
    "        .linreg('sa.CaffeineConsumption', covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'])\n",
    "        .query_variants('variants.map(v => va.linreg.pval).collect()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "qqplot(pvals, 5, 6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pvals = (common_vds\n",
    "        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')\n",
    "        .linreg('sa.CaffeineConsumption', \n",
    "                covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'],\n",
    "                use_dosages=True)\n",
    "        .query_variants('variants.map(v => va.linreg.pval).collect()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "qqplot(pvals, 5, 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's more like it! We may not be publishing ten new coffee-drinking loci in _Nature_, but we shouldn't expect to find anything but the strongest signals from a dataset of 1000 individuals anyway. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rare variant analysis\n",
    "\n",
    "Hail doesn't yet have rare variant kernel-based methods, but we have [linear](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.linreg_burden) and [logistic](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.logreg_burden) burden tests. \n",
    "\n",
    "We won't be showing those here, though. Instead, we'll demonstrate how one can use the expression language to group and count by any arbitrary properties in variant or sample annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "kt = (vds.genotypes_table()\n",
    "         .aggregate_by_key(key_expr=['pop = sa.SuperPopulation', 'chromosome = v.contig'],\n",
    "                           agg_expr=['n_het = g.filter(g => g.isHet()).count()']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kt.to_dataframe().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we want to group by minor allele frequency bin and hair color, and calculate the mean GQ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kt2 = (vds.genotypes_table()\n",
    "          .aggregate_by_key(key_expr=['''maf_bin = if (va.qc.AF < 0.01) \"< 1%\" \n",
    "                                                   else if (va.qc.AF < 0.05) \"1%-5%\"\n",
    "                                                   else \"> 5%\" ''',\n",
    "                                     'purple_hair = sa.PurpleHair'],\n",
    "                           agg_expr=['mean_gq = g.map(g => g.gq).stats().mean',\n",
    "                                     'mean_dp = g.map(g => g.dp).stats().mean']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kt2.to_dataframe().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've shown that it's easy to aggregate by a couple of arbitrary statistics. This specific examples may not provide especially useful pieces of information, but this same pattern can be used to detect effects of rare variation:\n",
    "\n",
    " - Count the number of heterozygous genotypes per gene by functional category (synonymous, missense, or loss-of-function) to estimate per-gene functional constraint\n",
    " - Count the number of singleton loss-of-function mutations per gene in cases and controls to detect genes involved in disease"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eplilogue\n",
    "\n",
    "Congrats! If you've made it this far, you're perfectly primed to read the [Overview](https://hail.is/hail/overview.html), look through the [Hail objects](https://hail.is/hail/types.html) representing many core concepts in genetics, and check out the many Hail functions defined in the [Python API](https://hail.is/hail/api.html). If you use Hail for your own science, we'd love to hear from you on [Gitter chat](https://gitter.im/hail-is/hail) or the [discussion forum](http://discuss.hail.is).\n",
    "\n",
    "There's also a lot of functionality inside Hail that we didn't get to in this broad overview. Things like:\n",
    "\n",
    " - Flexible import and export to a variety of data and annotation formats (VCF, BGEN, PLINK, JSON, TSV, ...)\n",
    " - Simulation\n",
    " - Burden tests\n",
    " - Kinship and pruning (IBD, GRM, RRM)\n",
    " - Family-based tests and utilities\n",
    " - Distributed file system utilities\n",
    " - Interoperability with Python and Spark machine learning libraries\n",
    " - More!\n",
    "\n",
    "For reference, here's the full workflow to all tutorial endpoints combined into one cell. It may take a minute! It's doing a lot of work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "table = hc.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')\n",
    "common_vds = (hc.read('data/1kg.vds')\n",
    "              .annotate_samples_table(table, root='sa')\n",
    "              .sample_qc()\n",
    "              .filter_samples_expr('sa.qc.dpMean >= 4 && sa.qc.callRate >= 0.97')\n",
    "              .filter_genotypes('''let ab = g.ad[1] / g.ad.sum() in\n",
    "                         ((g.isHomRef && ab <= 0.1) ||\n",
    "                          (g.isHet && ab >= 0.25 && ab <= 0.75) ||\n",
    "                          (g.isHomVar && ab >= 0.9))''')\n",
    "              .variant_qc()\n",
    "              .filter_variants_expr('va.qc.AF > 0.01')\n",
    "              .ld_prune(memory_per_core=512, num_cores=4))\n",
    "\n",
    "pca = common_vds.pca('sa.pca', k=5, eigenvalues='global.eigen')\n",
    "pvals = (common_vds\n",
    "        .annotate_samples_table(pca.samples_table(), expr='sa.pca = table.pca')\n",
    "        .linreg('sa.CaffeineConsumption', covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.isFemale'])\n",
    "        .query_variants('variants.map(v => va.linreg.pval).collect()'))"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}