-------------------- MatrixTable Overview -------------------- A :class:`.MatrixTable` is a distributed two-dimensional extension of a :class:`.Table`. Unlike a table, which has two field groups (row fields and global fields), a matrix table consists of four components: 1. a two-dimensional matrix of **entry fields** where each entry is indexed by row key(s) and column key(s) 2. a corresponding rows table that stores all of the **row fields** that are constant for every column in the dataset 3. a corresponding columns table that stores all of the **column fields** that are constant for every row in the dataset 4. a set of **global fields** that are constant for every entry in the dataset There are different operations on the matrix for each field group. For instance, :class:`.Table` has :meth:`.Table.select` and :meth:`.Table.select_globals`, while :class:`.MatrixTable` has :meth:`.MatrixTable.select_rows`, :meth:`.MatrixTable.select_cols`, :meth:`.MatrixTable.select_entries`, and :meth:`.MatrixTable.select_globals`. It is possible to represent matrix data by coordinate in a table , storing one record per entry of the matrix. However, the :class:`.MatrixTable` represents this data far more efficiently and exposes natural interfaces for computing on it. The :meth:`.MatrixTable.rows` and :meth:`.MatrixTable.cols` methods return the row and column fields as separate tables. The :meth:`.MatrixTable.entries` method returns the matrix as a table in coordinate form -- use this object with caution, because this representation is costly to compute and is significantly larger in memory Keys ==== Matrix tables have keys just as tables do. However, instead of one key, matrix tables have two keys: a row key and a column key. Row fields are indexed by the row key, column fields are indexed by the column key, and entry fields are indexed by the row key and the column key. The key structs can be accessed with :attr:`.MatrixTable.row_key` and :attr:`.MatrixTable.col_key`. It is possible to change the keys with :meth:`.MatrixTable.key_rows_by` and :meth:`.MatrixTable.key_cols_by`. Due to the data representation of a matrix table, changing a row key is often an expensive operation. Referencing Fields ================== All fields (row, column, global, entry) are top-level and exposed as attributes on the :class:`.MatrixTable` object. For example, if the matrix table `mt` had a row field `locus`, this field could be referenced with either ``mt.locus`` or ``mt['locus']``. The former access pattern does not work with field names with spaces or punctuation. The result of referencing a field from a matrix table is an :class:`.Expression` which knows its type, its source matrix table, and whether it is a row field, column field, entry field, or global field. Hail uses this context to know which operations are allowed for a given expression. When evaluated in a Python interpreter, we can see ``mt.locus`` is a :class:`.LocusExpression` with type ``locus``. >>> mt # doctest: +SKIP_OUTPUT_CHECK >>> mt.locus # doctest: +SKIP_OUTPUT_CHECK > Likewise, ``mt.DP`` is an :class:`.Int32Expression` with type ``int32`` and is an entry field of ``mt``. Hail expressions can also :meth:`.Expression.describe` themselves, providing information about their source matrix table or table and which keys index the expression, if any. For example, ``mt.DP.describe()`` tells us that ``mt.DP`` has type ``int32`` and is an entry field of ``mt``, since it is indexed by both rows and columns: >>> mt.DP.describe() # doctest: +SKIP_OUTPUT_CHECK -------------------------------------------------------- Type: int32 -------------------------------------------------------- Source: Index: ['row', 'column'] -------------------------------------------------------- Import ====== Text files may be imported with :func:`.import_matrix_table`. Additionally, Hail provides functions to import genetic datasets as matrix tables from a variety of file formats: :func:`.import_vcf`, :func:`.import_plink`, :func:`.import_bgen`, and :func:`.import_gen`. >>> mt = hl.import_vcf('data/sample.vcf.bgz') The :meth:`.MatrixTable.describe` method prints all fields in the table and their types, as well as the keys. >>> mt.describe() # doctest: +SKIP_OUTPUT_CHECK ---------------------------------------- Global fields: None ---------------------------------------- Column fields: 's': str ---------------------------------------- Row fields: 'locus': locus 'alleles': array 'rsid': str 'qual': float64 'filters': set 'info': struct { NEGATIVE_TRAIN_SITE: bool, AC: array, ... DS: bool } ---------------------------------------- Entry fields: 'GT': call 'AD': array 'DP': int32 'GQ': int32 'PL': array ---------------------------------------- Column key: 's': str Row key: 'locus': locus 'alleles': array ---------------------------------------- Common Operations ================= Like tables, Hail provides a number of methods for manipulating data in a matrix table. **Filter** :class:`.MatrixTable` has three methods to filter based on expressions: - :meth:`.MatrixTable.filter_rows` - :meth:`.MatrixTable.filter_cols` - :meth:`.MatrixTable.filter_entries` Filter methods take a :class:`.BooleanExpression` argument. These expressions are generated by applying computations to the fields of the matrix table: >>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2) >>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20) >>> filt_mt = mt.filter_entries(mt.DP < 5) These expressions can compute arbitrarily over the data: the :meth:`.MatrixTable.filter_cols` example above aggregates entries per column of the matrix table to compute the mean of the `GQ` field, and removes columns where the result is smaller than 20. **Annotate** :class:`.MatrixTable` has four methods to add new fields or update existing fields: - :meth:`.MatrixTable.annotate_globals` - :meth:`.MatrixTable.annotate_rows` - :meth:`.MatrixTable.annotate_cols` - :meth:`.MatrixTable.annotate_entries` Annotate methods take keyword arguments where the key is the name of the new field to add and the value is an expression specifying what should be added. The simplest example is adding a new global field `foo` that just contains the constant 5. >>> mt_new = mt.annotate_globals(foo = 5) >>> print(mt_new.globals.dtype.pretty()) struct { foo: int32 } Another example is adding a new row field `call_rate` which computes the fraction of non-missing entries `GT` per row: >>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT))) Annotate methods are also useful for updating values. For example, to update the GT entry field to be missing if `GQ` is less than 20, we can do the following: >>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT)) **Select** Select is used to create a new schema for a dimension of the matrix table. Key fields are always preserved even when not selected. For example, following the matrix table schemas from importing a VCF file (shown above), to create a hard calls dataset where each entry only contains the `GT` field we can do the following: >>> mt_new = mt.select_entries('GT') >>> print(mt_new.entry.dtype.pretty()) struct { GT: call } :class:`.MatrixTable` has four select methods that select and create new fields: - :meth:`.MatrixTable.select_globals` - :meth:`.MatrixTable.select_rows` - :meth:`.MatrixTable.select_cols` - :meth:`.MatrixTable.select_entries` Each method can take either strings referring to top-level fields, an attribute reference (useful for accessing nested fields), as well as keyword arguments ``KEY=VALUE`` to compute new fields. The Python unpack operator ``**`` can be used to specify that all fields of a Struct should become top level fields. However, be aware that all top-level field names must be unique. In the following example, `**mt['info']` would fail if `DP` already exists as an entry field. >>> mt_new = mt.select_rows(**mt['info']) # doctest: +SKIP The example below adds two new row fields. Keys are always preserved, so the row keys ``locus`` and ``alleles`` will also be present in the new table. ``AC = mt.info.AC`` turns the subfield ``AC`` into a top-level field. >>> mt_new = mt.select_rows(AC = mt.info.AC, ... n_filters = hl.len(mt['filters'])) The order of the fields entered as arguments will be maintained in the new matrix table. **Drop** The complement of `select` methods, :meth:`.MatrixTable.drop` can remove any top level field. An example of removing the `GQ` entry field is: >>> mt_new = mt.drop('GQ') **Explode** Explode operations can is used to unpack a row or column field that is of type array or set. - :meth:`.MatrixTable.explode_rows` - :meth:`.MatrixTable.explode_cols` One use case of explode is to duplicate rows: >>> mt_new = mt.annotate_rows(replicate_num = [1, 2]) >>> mt_new = mt_new.explode_rows(mt_new['replicate_num']) >>> mt.count_rows() 346 >>> mt_new.count_rows() 692 >>> mt_new.replicate_num.show() # doctest: +SKIP_OUTPUT_CHECK +---------------+------------+---------------+ | locus | alleles | replicate_num | +---------------+------------+---------------+ | locus | array | int32 | +---------------+------------+---------------+ | 20:10019093 | ["A","G"] | 1 | | 20:10019093 | ["A","G"] | 2 | | 20:10026348 | ["A","G"] | 1 | | 20:10026348 | ["A","G"] | 2 | | 20:10026357 | ["T","C"] | 1 | | 20:10026357 | ["T","C"] | 2 | | 20:10030188 | ["T","A"] | 1 | | 20:10030188 | ["T","A"] | 2 | | 20:10030452 | ["G","A"] | 1 | | 20:10030452 | ["G","A"] | 2 | +---------------+------------+---------------+ showing top 10 rows Aggregation =========== :class:`.MatrixTable` has three methods to compute aggregate statistics. - :meth:`.MatrixTable.aggregate_rows` - :meth:`.MatrixTable.aggregate_cols` - :meth:`.MatrixTable.aggregate_entries` These methods take an aggregated expression and evaluate it, returning a Python value. An example of querying entries is to compute the global mean of field `GQ`: >>> mt.aggregate_entries(hl.agg.mean(mt.GQ)) # doctest: +SKIP_OUTPUT_CHECK 67.73196915777027 It is possible to compute multiple values simultaneously by creating a tuple or struct. This is encouraged, because grouping two computations together is far more efficient by traversing the dataset only once rather than twice. >>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ))) # doctest: +SKIP_OUTPUT_CHECK (Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995), Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135)) See the :ref:`sec-aggregators` page for the complete list of aggregator functions. Group-By ======== Matrix tables can be aggregated along the row or column axis to produce a new matrix table. - :meth:`.MatrixTable.group_rows_by` - :meth:`.MatrixTable.group_cols_by` First let's add a random phenotype as a new column field `case_status` and then compute statistics about the entry field `GQ` for each grouping of `case_status`. >>> mt_ann = mt.annotate_cols(case_status = hl.if_else(hl.rand_bool(0.5), ... "CASE", ... "CONTROL")) Next we group the columns by `case_status` and aggregate: >>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status) ... .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ))) >>> print(mt_grouped.entry.dtype.pretty()) struct { gq_stats: struct { mean: float64, stdev: float64, min: float64, max: float64, n: int64, sum: float64 } } >>> print(mt_grouped.col.dtype) struct{case_status: str} Joins ===== Joins on two-dimensional data are significantly more complicated than joins in one dimension, and Hail does not yet support the full range of joins on both dimensions of a matrix table. :class:`.MatrixTable` has methods for concatenating rows or columns: - :meth:`.MatrixTable.union_cols` - :meth:`.MatrixTable.union_rows` :meth:`.MatrixTable.union_cols` joins matrix tables together by performing an inner join on rows while concatenating columns together (similar to `paste` in Unix). Likewise, :meth:`.MatrixTable.union_rows` performs an inner join on columns while concatenating rows together (similar to `cat` in Unix). In addition, Hail provides support for joining data from multiple sources together if the keys of each source are compatible. Keys are compatible if they are the same type, and share the same ordering in the case where tables have multiple keys. If the keys are compatible, joins can then be performed using Python's bracket notation ``[]``. This looks like ``right_table[left_table.key]``. The argument inside the brackets is the key of the destination (left) table as a single value, or a tuple if there are multiple destination keys. For example, we can join a matrix table and a table in order to annotate the rows of the matrix table with a field from the table. Let `gnomad_data` be a :class:`.Table` keyed by two row fields with type ``locus`` and ``array``, which matches the row keys of `mt`: >>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles]) If we only cared about adding one new row field such as `AF` from `gnomad_data`, we could do the following: >>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF']) To add all fields as top-level row fields, the following syntax unpacks the gnomad_data row as keyword arguments to :meth:`.MatrixTable.annotate_rows`: >>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles]) Interacting with Matrix Tables Locally ====================================== Some useful methods to interact with matrix tables locally are :meth:`.MatrixTable.describe`, :meth:`.MatrixTable.head`, and :meth:`.MatrixTable.sample_rows`. `describe` prints out the schema for all row fields, column fields, entry fields, and global fields as well as the row keys and column keys. `head` returns a new matrix table with only the first N rows. `sample_rows` returns a new matrix table where the rows are randomly sampled with frequency `p`. To get the dimensions of the matrix table, use :meth:`.MatrixTable.count_rows` and :meth:`.MatrixTable.count_cols`.