MatrixTable Overview
A MatrixTable
is a distributed two-dimensional extension of a
Table
.
Unlike a table, which has two field groups (row fields and global fields), a matrix table consists of four components:
a two-dimensional matrix of entry fields where each entry is indexed by row key(s) and column key(s)
a corresponding rows table that stores all of the row fields that are constant for every column in the dataset
a corresponding columns table that stores all of the column fields that are constant for every row in the dataset
a set of global fields that are constant for every entry in the dataset
There are different operations on the matrix for each field group.
For instance, Table
has Table.select()
and
Table.select_globals()
, while MatrixTable
has
MatrixTable.select_rows()
, MatrixTable.select_cols()
,
MatrixTable.select_entries()
, and MatrixTable.select_globals()
.
It is possible to represent matrix data by coordinate in a table , storing one
record per entry of the matrix. However, the MatrixTable
represents
this data far more efficiently and exposes natural interfaces for computing on
it.
The MatrixTable.rows()
and MatrixTable.cols()
methods return the
row and column fields as separate tables. The MatrixTable.entries()
method returns the matrix as a table in coordinate form – use this object with
caution, because this representation is costly to compute and is significantly
larger in memory
Keys
Matrix tables have keys just as tables do. However, instead of one key, matrix
tables have two keys: a row key and a column key. Row fields are indexed by the
row key, column fields are indexed by the column key, and entry fields are
indexed by the row key and the column key. The key structs can be accessed with
MatrixTable.row_key
and MatrixTable.col_key
. It is possible to
change the keys with MatrixTable.key_rows_by()
and
MatrixTable.key_cols_by()
.
Due to the data representation of a matrix table, changing a row key is often an expensive operation.
Referencing Fields
All fields (row, column, global, entry) are top-level and exposed as attributes
on the MatrixTable
object. For example, if the matrix table mt had a
row field locus, this field could be referenced with either mt.locus
or
mt['locus']
. The former access pattern does not work with field names with
spaces or punctuation.
The result of referencing a field from a matrix table is an Expression
which knows its type, its source matrix table, and whether it is a row field,
column field, entry field, or global field. Hail uses this context to know which
operations are allowed for a given expression.
When evaluated in a Python interpreter, we can see mt.locus
is a
LocusExpression
with type locus<GRCh37>
.
>>> mt
<hail.matrixtable.MatrixTable at 0x1107e54a8>
>>> mt.locus
<LocusExpression of type locus<GRCh37>>
Likewise, mt.DP
is an Int32Expression
with type int32
and is an entry field of mt
.
Hail expressions can also Expression.describe()
themselves, providing
information about their source matrix table or table and which keys index the
expression, if any. For example, mt.DP.describe()
tells us that mt.DP
has type int32
and is an entry field of mt
, since it is indexed
by both rows and columns:
>>> mt.DP.describe()
--------------------------------------------------------
Type:
int32
--------------------------------------------------------
Source:
<class 'hail.matrixtable.MatrixTable'>
Index:
['row', 'column']
--------------------------------------------------------
Import
Text files may be imported with import_matrix_table()
. Additionally, Hail
provides functions to import genetic datasets as matrix tables from a
variety of file formats: import_vcf()
, import_plink()
,
import_bgen()
, and import_gen()
.
>>> mt = hl.import_vcf('data/sample.vcf.bgz')
The MatrixTable.describe()
method prints all fields in the table and
their types, as well as the keys.
>>> mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
NEGATIVE_TRAIN_SITE: bool,
AC: array<int32>,
...
DS: bool
}
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key:
's': str
Row key:
'locus': locus<GRCh37>
'alleles': array<str>
----------------------------------------
Common Operations
Like tables, Hail provides a number of methods for manipulating data in a matrix table.
Filter
MatrixTable
has three methods to filter based on expressions:
Filter methods take a BooleanExpression
argument. These expressions
are generated by applying computations to the fields of the matrix table:
>>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20)
>>> filt_mt = mt.filter_entries(mt.DP < 5)
These expressions can compute arbitrarily over the data: the MatrixTable.filter_cols()
example above aggregates entries per column of the matrix table to compute the
mean of the GQ field, and removes columns where the result is smaller than 20.
Annotate
MatrixTable
has four methods to add new fields or update existing fields:
Annotate methods take keyword arguments where the key is the name of the new field to add and the value is an expression specifying what should be added.
The simplest example is adding a new global field foo that just contains the constant 5.
>>> mt_new = mt.annotate_globals(foo = 5)
>>> print(mt_new.globals.dtype.pretty())
struct {
foo: int32
}
Another example is adding a new row field call_rate which computes the fraction of non-missing entries GT per row:
>>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
Annotate methods are also useful for updating values. For example, to update the GT entry field to be missing if GQ is less than 20, we can do the following:
>>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT))
Select
Select is used to create a new schema for a dimension of the matrix table. Key fields are always preserved even when not selected. For example, following the matrix table schemas from importing a VCF file (shown above), to create a hard calls dataset where each entry only contains the GT field we can do the following:
>>> mt_new = mt.select_entries('GT')
>>> print(mt_new.entry.dtype.pretty())
struct {
GT: call
}
MatrixTable
has four select methods that select and create new fields:
Each method can take either strings referring to top-level fields, an attribute
reference (useful for accessing nested fields), as well as keyword arguments
KEY=VALUE
to compute new fields. The Python unpack operator **
can be
used to specify that all fields of a Struct should become top level fields.
However, be aware that all top-level field names must be unique. In the
following example, **mt[‘info’] would fail if DP already exists as an entry
field.
>>> mt_new = mt.select_rows(**mt['info'])
The example below adds two new row fields. Keys are always preserved, so the
row keys locus
and alleles
will also be present in the new table.
AC = mt.info.AC
turns the subfield AC
into a top-level field.
>>> mt_new = mt.select_rows(AC = mt.info.AC,
... n_filters = hl.len(mt['filters']))
The order of the fields entered as arguments will be maintained in the new matrix table.
Drop
The complement of select methods, MatrixTable.drop()
can remove any top
level field. An example of removing the GQ entry field is:
>>> mt_new = mt.drop('GQ')
Explode
Explode operations can is used to unpack a row or column field that is of type array or set.
One use case of explode is to duplicate rows:
>>> mt_new = mt.annotate_rows(replicate_num = [1, 2])
>>> mt_new = mt_new.explode_rows(mt_new['replicate_num'])
>>> mt.count_rows()
346
>>> mt_new.count_rows()
692
>>> mt_new.replicate_num.show()
+---------------+------------+---------------+
| locus | alleles | replicate_num |
+---------------+------------+---------------+
| locus<GRCh37> | array<str> | int32 |
+---------------+------------+---------------+
| 20:10019093 | ["A","G"] | 1 |
| 20:10019093 | ["A","G"] | 2 |
| 20:10026348 | ["A","G"] | 1 |
| 20:10026348 | ["A","G"] | 2 |
| 20:10026357 | ["T","C"] | 1 |
| 20:10026357 | ["T","C"] | 2 |
| 20:10030188 | ["T","A"] | 1 |
| 20:10030188 | ["T","A"] | 2 |
| 20:10030452 | ["G","A"] | 1 |
| 20:10030452 | ["G","A"] | 2 |
+---------------+------------+---------------+
showing top 10 rows
Aggregation
MatrixTable
has three methods to compute aggregate statistics.
These methods take an aggregated expression and evaluate it, returning a Python value.
An example of querying entries is to compute the global mean of field GQ:
>>> mt.aggregate_entries(hl.agg.mean(mt.GQ))
67.73196915777027
It is possible to compute multiple values simultaneously by creating a tuple or struct. This is encouraged, because grouping two computations together is far more efficient by traversing the dataset only once rather than twice.
>>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ)))
(Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995),
Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135))
See the Aggregators page for the complete list of aggregator functions.
Group-By
Matrix tables can be aggregated along the row or column axis to produce a new matrix table.
First let’s add a random phenotype as a new column field case_status and then compute statistics about the entry field GQ for each grouping of case_status.
>>> mt_ann = mt.annotate_cols(case_status = hl.if_else(hl.rand_bool(0.5),
... "CASE",
... "CONTROL"))
Next we group the columns by case_status and aggregate:
>>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status)
... .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ)))
>>> print(mt_grouped.entry.dtype.pretty())
struct {
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64,
n: int64,
sum: float64
}
}
>>> print(mt_grouped.col.dtype)
struct{case_status: str}
Joins
Joins on two-dimensional data are significantly more complicated than joins in one dimension, and Hail does not yet support the full range of joins on both dimensions of a matrix table.
MatrixTable
has methods for concatenating rows or columns:
MatrixTable.union_cols()
joins matrix tables together by performing an
inner join on rows while concatenating columns together (similar to paste in
Unix). Likewise, MatrixTable.union_rows()
performs an inner join on
columns while concatenating rows together (similar to cat in Unix).
In addition, Hail provides support for joining data from multiple sources together if the keys of each source are compatible. Keys are compatible if they are the same type, and share the same ordering in the case where tables have multiple keys.
If the keys are compatible, joins can then be performed using Python’s bracket
notation []
. This looks like right_table[left_table.key]
. The argument
inside the brackets is the key of the destination (left) table as a single value, or a
tuple if there are multiple destination keys.
For example, we can join a matrix table and a table in order to annotate the
rows of the matrix table with a field from the table. Let gnomad_data be a
Table
keyed by two row fields with type
locus
and array<str>
, which matches the row keys of mt:
>>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles])
If we only cared about adding one new row field such as AF from gnomad_data, we could do the following:
>>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF'])
To add all fields as top-level row fields, the following syntax unpacks the gnomad_data
row as keyword arguments to MatrixTable.annotate_rows()
:
>>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles])
Interacting with Matrix Tables Locally
Some useful methods to interact with matrix tables locally are
MatrixTable.describe()
, MatrixTable.head()
, and
MatrixTable.sample_rows()
. describe prints out the schema for all row
fields, column fields, entry fields, and global fields as well as the row keys
and column keys. head returns a new matrix table with only the first N rows.
sample_rows returns a new matrix table where the rows are randomly sampled with
frequency p.
To get the dimensions of the matrix table, use MatrixTable.count_rows()
and MatrixTable.count_cols()
.