MatrixTable Tutorial

If you’ve gotten this far, you’re probably thinking:

  • “Can’t I do all of this in pandas or R?”
  • “What does this have to do with biology?”

The two crucial features that Hail adds are scalability and the domain-specific primitives needed to work easily with biological data. Fear not! You’ve learned most of the basic concepts of Hail and now are ready for the bit that makes it possible to represent and compute on genetic matrices: the MatrixTable.

In the last example of the Table Joins Tutorial, the ratings table had a compound key: movie_id and user_id. The ratings were secretly a movie-by-user matrix!

However, since this matrix is very sparse, it is reasonably represented in a so-called “coordinate form” Table, where each row of the table is an entry of the sparse matrix. For large and dense matrices (like sequencing data), the per-row overhead of coordinate reresentations is untenable. That’s why we built MatrixTable, a 2-dimensional generalization of Table.

MatrixTable Anatomy

Recall that Table has two kinds of fields:

  • global fields
  • row fields

MatrixTable has four kinds of fields:

  • global fields
  • row fields
  • column fields
  • entry fields

Row fields are fields that are stored once per row. These can contain information about the rows, or summary data calculated per row.

Column fields are stored once per column. These can contain information about the columns, or summary data calculated per column.

Entry fields are the piece that makes this structure a matrix – there is an entry for each (row, column) pair.

Importing and Reading

Like tables, matrix tables can be imported from a variety of formats: VCF, (B)GEN, PLINK, TSV, etc. Matrix tables can also be read from a “native” matrix table format. Let’s read a sample of prepared 1KG data.

In [1]:
import hail as hl
import seaborn
from bokeh.io import output_notebook, show
output_notebook()
seaborn.set()

hl.utils.get_1kg('data/')
Loading BokehJS ...
Initializing Spark and Hail with default parameters...
Running on Apache Spark version 2.2.0
SparkUI available at http://10.32.4.4:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.5-2595d91d83e0
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181215-1614-0.2.5-2595d91d83e0.log
2018-12-15 16:14:03 Hail: INFO: 1KG files found
In [2]:
mt = hl.read_matrix_table('data/1kg.mt')
mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        BaseQRankSum: float64,
        ClippingRankSum: float64,
        DP: int32,
        DS: bool,
        FS: float64,
        HaplotypeScore: float64,
        InbreedingCoeff: float64,
        MLEAC: array<int32>,
        MLEAF: array<float64>,
        MQ: float64,
        MQ0: int32,
        MQRankSum: float64,
        QD: float64,
        ReadPosRankSum: float64,
        set: str
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

There are a few things to note:

  • There is a single column field s. This is the sample ID from the VCF. It is also the column key.
  • There is a compound row key: locus and alleles.
    • locus has type locus<GRCh37>
    • alleles has type array<str>
  • GT has type call. That’s a genotype call!

Whereas table expressions could be indexed by nothing or indexed by rows, matrix table expression have four options: nothing, indexed by row, indexed by column, or indexed by row and column (the entries). Let’s see some examples.

In [3]:
mt.s.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fe575dc6898>
Index:
    ['column']
--------------------------------------------------------
In [4]:
mt.GT.describe()
--------------------------------------------------------
Type:
    call
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fe575dc6898>
Index:
    ['column', 'row']
--------------------------------------------------------

MatrixTable operations

We belabored the operations on tables because they all have natural analogs (sometimes several) on matrix tables. For example:

  • count => count_{rows, cols} (and count which returns both)
  • filter => filter_{rows, cols, entries}
  • annotate => annotate_{rows, cols, entries} (and globals for both)
  • select => select_{rows, cols, entries} (and globals for both)
  • transmute => transmute_{rows, cols, entries} (and globals for both)
  • group_by => group_{rows, cols}_by
  • explode => expode_{rows, cols}
  • aggregate => aggregate_{rows, cols, entries}

Some operations are unique to MatrixTable:

  • The row fields can be accessed as a Table with rows
  • The column fields can be accessed as a Table with cols.
  • The entire field space of a MatrixTable can be accessed as a coordinate-form Table with entries. Be careful with this! While it’s fast to aggregate or query, trying to write this Table to disk could produce files thousands of times larger than the corresponding MatrixTable.

Let’s explore mt using these tools. Let’s get the size of the dataset.

In [5]:
mt.count() # (rows, cols)
Out[5]:
(10961, 284)

Let’s look at the first few row keys (variants) and column keys (sample IDs).

In [6]:
mt.rows().select().show()
+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh37> | array<str> |
+---------------+------------+
| 1:904165      | ["G","A"]  |
| 1:909917      | ["G","A"]  |
| 1:986963      | ["C","T"]  |
| 1:1563691     | ["T","G"]  |
| 1:1707740     | ["T","G"]  |
| 1:2252970     | ["C","T"]  |
| 1:2284195     | ["T","C"]  |
| 1:2779043     | ["T","C"]  |
| 1:2944527     | ["G","A"]  |
| 1:3761547     | ["C","A"]  |
+---------------+------------+
showing top 10 rows

In [7]:
mt.s.show()
+-----------+
| s         |
+-----------+
| str       |
+-----------+
| "HG00096" |
| "HG00099" |
| "HG00105" |
| "HG00118" |
| "HG00129" |
| "HG00148" |
| "HG00177" |
| "HG00182" |
| "HG00242" |
| "HG00254" |
+-----------+
showing top 10 rows

Let’s investigate the genotypes and the call rate. Let’s look at the first few genotypes:

In [8]:
mt.GT.show()
+---------------+------------+-----------+------+
| locus         | alleles    | s         | GT   |
+---------------+------------+-----------+------+
| locus<GRCh37> | array<str> | str       | call |
+---------------+------------+-----------+------+
| 1:904165      | ["G","A"]  | "HG00096" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00099" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00105" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00118" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00129" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00148" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00177" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00182" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00242" | 0/0  |
| 1:904165      | ["G","A"]  | "HG00254" | 0/0  |
+---------------+------------+-----------+------+
showing top 10 rows

All homozygous reference, which is not surprising. Let’s look at the distribution of genotype calls:

In [9]:
mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))
Out[9]:
{0: 1741192, 1: 747875, 2: 582108, None: 41749}

Let’s compute the overall call rate directly, and then plot the distribution of call rate per variant.

In [10]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
Out[10]:
0.986588493647773

Here’s a nice trick: you can use an aggregator inside annotate_rows and it will aggregate over columns, that is, summarize the values in the row using the aggregator. Let’s compute and plot call rate per variant.

In [11]:
mt2 = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
mt2.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        BaseQRankSum: float64,
        ClippingRankSum: float64,
        DP: int32,
        DS: bool,
        FS: float64,
        HaplotypeScore: float64,
        InbreedingCoeff: float64,
        MLEAC: array<int32>,
        MLEAF: array<float64>,
        MQ: float64,
        MQ0: int32,
        MQRankSum: float64,
        QD: float64,
        ReadPosRankSum: float64,
        set: str
    }
    'call_rate': float64
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
In [12]:
p = hl.plot.histogram(mt2.call_rate, range=(0,1.0), bins=100,
                      title='Variant Call Rate Histogram', legend='Call Rate')
show(p)

Exercise: GQ vs DP

In this exercise, you’ll use Hail to investigate a strange property of sequencing datasets.

The DP field is the sequencing depth (the number of reads).

Let’s first plot a histogram of DP:

In [13]:
p = hl.plot.histogram(mt.DP, range=(0,40), bins=40, title='DP Histogram', legend='DP')
show(p)

Now, let’s do the same thing for GQ.

The GQ field is the phred-scaled “genotype quality”. The formula to convert to a linear-scale confidence (0 to 1) is 10 ** -(mt.GQ / 10). GQ is truncated to lie between 0 and 99.

In [14]:
p = hl.plot.histogram(mt.GQ, range=(0,100), bins=100, title='GQ Histogram', legend='GQ')
show(p)

Whoa! That’s a strange distribution! There’s a big spike at 100. The rest of the values have roughly the same shape as the DP distribution, but form a Dimetrodon. Use Hail to figure out what’s going on!