MatrixTable Overview

A MatrixTable is a distributed two-dimensional extension of a Table.

Unlike a table, which has two field groups (row fields and global fields), a matrix table consists of four components:

a two-dimensional matrix of entry fields where each entry is indexed by row key(s) and column key(s)
a corresponding rows table that stores all of the row fields that are constant for every column in the dataset
a corresponding columns table that stores all of the column fields that are constant for every row in the dataset
a set of global fields that are constant for every entry in the dataset

There are different operations on the matrix for each field group. For instance, Table has Table.select() and Table.select_globals(), while MatrixTable has MatrixTable.select_rows(), MatrixTable.select_cols(), MatrixTable.select_entries(), and MatrixTable.select_globals().

It is possible to represent matrix data by coordinate in a table , storing one record per entry of the matrix. However, the MatrixTable represents this data far more efficiently and exposes natural interfaces for computing on it.

The MatrixTable.rows() and MatrixTable.cols() methods return the row and column fields as separate tables. The MatrixTable.entries() method returns the matrix as a table in coordinate form – use this object with caution, because this representation is costly to compute and is significantly larger in memory

Keys

Matrix tables have keys just as tables do. However, instead of one key, matrix tables have two keys: a row key and a column key. Row fields are indexed by the row key, column fields are indexed by the column key, and entry fields are indexed by the row key and the column key. The key structs can be accessed with MatrixTable.row_key and MatrixTable.col_key. It is possible to change the keys with MatrixTable.key_rows_by() and MatrixTable.key_cols_by().

Due to the data representation of a matrix table, changing a row key is often an expensive operation.

Referencing Fields

All fields (row, column, global, entry) are top-level and exposed as attributes on the MatrixTable object. For example, if the matrix table mt had a row field locus, this field could be referenced with either mt.locus or mt['locus']. The former access pattern does not work with field names with spaces or punctuation.

The result of referencing a field from a matrix table is an Expression which knows its type, its source matrix table, and whether it is a row field, column field, entry field, or global field. Hail uses this context to know which operations are allowed for a given expression.

When evaluated in a Python interpreter, we can see mt.locus is a LocusExpression with type locus<GRCh37>.

>>> mt  
<hail.matrixtable.MatrixTable at 0x1107e54a8>

>>> mt.locus  
<LocusExpression of type locus<GRCh37>>

Likewise, mt.DP is an Int32Expression with type int32 and is an entry field of mt.

Hail expressions can also Expression.describe() themselves, providing information about their source matrix table or table and which keys index the expression, if any. For example, mt.DP.describe() tells us that mt.DP has type int32 and is an entry field of mt, since it is indexed by both rows and columns:

>>> mt.DP.describe()  
--------------------------------------------------------
Type:
    int32
--------------------------------------------------------
Source:
    <class 'hail.matrixtable.MatrixTable'>
Index:
    ['row', 'column']
--------------------------------------------------------

Import

Text files may be imported with import_matrix_table(). Additionally, Hail provides functions to import genetic datasets as matrix tables from a variety of file formats: import_vcf(), import_plink(), import_bgen(), and import_gen().

>>> mt = hl.import_vcf('data/sample.vcf.bgz')

The MatrixTable.describe() method prints all fields in the table and their types, as well as the keys.

>>> mt.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        NEGATIVE_TRAIN_SITE: bool,
        AC: array<int32>,
        ...
        DS: bool
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key:
    's': str
Row key:
    'locus': locus<GRCh37>
    'alleles': array<str>
----------------------------------------

Common Operations

Like tables, Hail provides a number of methods for manipulating data in a matrix table.

Filter

MatrixTable has three methods to filter based on expressions:

Filter methods take a BooleanExpression argument. These expressions are generated by applying computations to the fields of the matrix table:

>>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2)

>>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20)

>>> filt_mt = mt.filter_entries(mt.DP < 5)

These expressions can compute arbitrarily over the data: the MatrixTable.filter_cols() example above aggregates entries per column of the matrix table to compute the mean of the GQ field, and removes columns where the result is smaller than 20.

Annotate

MatrixTable has four methods to add new fields or update existing fields:

Annotate methods take keyword arguments where the key is the name of the new field to add and the value is an expression specifying what should be added.

The simplest example is adding a new global field foo that just contains the constant 5.

>>> mt_new = mt.annotate_globals(foo = 5)
>>> print(mt_new.globals.dtype.pretty())
struct {
    foo: int32
}

Another example is adding a new row field call_rate which computes the fraction of non-missing entries GT per row:

>>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))

Annotate methods are also useful for updating values. For example, to update the GT entry field to be missing if GQ is less than 20, we can do the following:

>>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT))

Select

Select is used to create a new schema for a dimension of the matrix table. Key fields are always preserved even when not selected. For example, following the matrix table schemas from importing a VCF file (shown above), to create a hard calls dataset where each entry only contains the GT field we can do the following:

>>> mt_new = mt.select_entries('GT')
>>> print(mt_new.entry.dtype.pretty())
struct {
    GT: call
}

MatrixTable has four select methods that select and create new fields:

Each method can take either strings referring to top-level fields, an attribute reference (useful for accessing nested fields), as well as keyword arguments KEY=VALUE to compute new fields. The Python unpack operator ** can be used to specify that all fields of a Struct should become top level fields. However, be aware that all top-level field names must be unique. In the following example, **mt[‘info’] would fail if DP already exists as an entry field.

>>> mt_new = mt.select_rows(**mt['info']) 

The example below adds two new row fields. Keys are always preserved, so the row keys locus and alleles will also be present in the new table. AC = mt.info.AC turns the subfield AC into a top-level field.

>>> mt_new = mt.select_rows(AC = mt.info.AC,
...                         n_filters = hl.len(mt['filters']))

The order of the fields entered as arguments will be maintained in the new matrix table.

Drop

The complement of select methods, MatrixTable.drop() can remove any top level field. An example of removing the GQ entry field is:

>>> mt_new = mt.drop('GQ')

Explode

Explode operations can is used to unpack a row or column field that is of type array or set.

One use case of explode is to duplicate rows:

>>> mt_new = mt.annotate_rows(replicate_num = [1, 2])
>>> mt_new = mt_new.explode_rows(mt_new['replicate_num'])
>>> mt.count_rows()
346

>>> mt_new.count_rows()
692

>>> mt_new.replicate_num.show() 
+---------------+------------+---------------+
| locus         | alleles    | replicate_num |
+---------------+------------+---------------+
| locus<GRCh37> | array<str> |         int32 |
+---------------+------------+---------------+
| 20:10019093   | ["A","G"]  |             1 |
| 20:10019093   | ["A","G"]  |             2 |
| 20:10026348   | ["A","G"]  |             1 |
| 20:10026348   | ["A","G"]  |             2 |
| 20:10026357   | ["T","C"]  |             1 |
| 20:10026357   | ["T","C"]  |             2 |
| 20:10030188   | ["T","A"]  |             1 |
| 20:10030188   | ["T","A"]  |             2 |
| 20:10030452   | ["G","A"]  |             1 |
| 20:10030452   | ["G","A"]  |             2 |
+---------------+------------+---------------+
showing top 10 rows

Aggregation

MatrixTable has three methods to compute aggregate statistics.

These methods take an aggregated expression and evaluate it, returning a Python value.

An example of querying entries is to compute the global mean of field GQ:

>>> mt.aggregate_entries(hl.agg.mean(mt.GQ))  
67.73196915777027

It is possible to compute multiple values simultaneously by creating a tuple or struct. This is encouraged, because grouping two computations together is far more efficient by traversing the dataset only once rather than twice.

>>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ)))  
(Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995),
Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135))

See the Aggregators page for the complete list of aggregator functions.

Group-By

Matrix tables can be aggregated along the row or column axis to produce a new matrix table.

First let’s add a random phenotype as a new column field case_status and then compute statistics about the entry field GQ for each grouping of case_status.

>>> mt_ann = mt.annotate_cols(case_status = hl.if_else(hl.rand_bool(0.5),
...                                                    "CASE",
...                                                    "CONTROL"))

Next we group the columns by case_status and aggregate:

>>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status)
...                 .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ)))
>>> print(mt_grouped.entry.dtype.pretty())
struct {
    gq_stats: struct {
        mean: float64,
        stdev: float64,
        min: float64,
        max: float64,
        n: int64,
        sum: float64
    }
}
>>> print(mt_grouped.col.dtype)
struct{case_status: str}

Joins

Joins on two-dimensional data are significantly more complicated than joins in one dimension, and Hail does not yet support the full range of joins on both dimensions of a matrix table.

MatrixTable has methods for concatenating rows or columns:

MatrixTable.union_cols() joins matrix tables together by performing an inner join on rows while concatenating columns together (similar to paste in Unix). Likewise, MatrixTable.union_rows() performs an inner join on columns while concatenating rows together (similar to cat in Unix).

In addition, Hail provides support for joining data from multiple sources together if the keys of each source are compatible. Keys are compatible if they are the same type, and share the same ordering in the case where tables have multiple keys.

If the keys are compatible, joins can then be performed using Python’s bracket notation []. This looks like right_table[left_table.key]. The argument inside the brackets is the key of the destination (left) table as a single value, or a tuple if there are multiple destination keys.

For example, we can join a matrix table and a table in order to annotate the rows of the matrix table with a field from the table. Let gnomad_data be a Table keyed by two row fields with type locus and array<str>, which matches the row keys of mt:

>>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles])

If we only cared about adding one new row field such as AF from gnomad_data, we could do the following:

>>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF'])

To add all fields as top-level row fields, the following syntax unpacks the gnomad_data row as keyword arguments to MatrixTable.annotate_rows():

>>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles])

Interacting with Matrix Tables Locally

Some useful methods to interact with matrix tables locally are MatrixTable.describe(), MatrixTable.head(), and MatrixTable.sample_rows(). describe prints out the schema for all row fields, column fields, entry fields, and global fields as well as the row keys and column keys. head returns a new matrix table with only the first N rows. sample_rows returns a new matrix table where the rows are randomly sampled with frequency p.

To get the dimensions of the matrix table, use MatrixTable.count_rows() and MatrixTable.count_cols().