MatrixTable

class hail.MatrixTable(mir)[source]

Hail’s distributed implementation of a structured matrix.

Use read_matrix_table() to read a matrix table that was written with MatrixTable.write().

Examples

Add annotations:

>>> dataset = dataset.annotate_globals(pli={'SCN1A': 0.999, 'SONIC': 0.014},
...                                    populations = ['AFR', 'EAS', 'EUR', 'SAS', 'AMR', 'HIS'])
>>> dataset = dataset.annotate_cols(pop = dataset.populations[hl.int(hl.rand_unif(0, 6))],
...                                 sample_gq = agg.mean(dataset.GQ),
...                                 sample_dp = agg.mean(dataset.DP))
>>> dataset = dataset.annotate_rows(variant_gq = agg.mean(dataset.GQ),
...                                 variant_dp = agg.mean(dataset.GQ),
...                                 sas_hets = agg.count_where(dataset.GT.is_het()))
>>> dataset = dataset.annotate_entries(gq_by_dp = dataset.GQ / dataset.DP)

Filter:

>>> dataset = dataset.filter_cols(dataset.pop != 'EUR')
>>> datasetm = dataset.filter_rows((dataset.variant_gq > 10) & (dataset.variant_dp > 5))
>>> dataset = dataset.filter_entries(dataset.gq_by_dp > 1)

Query:

>>> col_stats = dataset.aggregate_cols(hl.struct(pop_counts=agg.counter(dataset.pop),
...                                              high_quality=agg.fraction((dataset.sample_gq > 10) & (dataset.sample_dp > 5))))
>>> print(col_stats.pop_counts)
>>> print(col_stats.high_quality)
>>> het_dist = dataset.aggregate_rows(agg.stats(dataset.sas_hets))
>>> print(het_dist)
>>> entry_stats = dataset.aggregate_entries(hl.struct(call_rate=agg.fraction(hl.is_defined(dataset.GT)),
...                                                   global_gq_mean=agg.mean(dataset.GQ)))
>>> print(entry_stats.call_rate)
>>> print(entry_stats.global_gq_mean)

Attributes

col Returns a struct expression of all column-indexed fields, including keys.
col_key Column key struct.
col_value Returns a struct expression including all non-key column-indexed fields.
entry Returns a struct expression including all row-and-column-indexed fields.
globals Returns a struct expression including all global fields.
row Returns a struct expression of all row-indexed fields, including keys.
row_key Row key struct.
row_value Returns a struct expression including all non-key row-indexed fields.

Methods

__init__ Initialize self.
add_col_index Add the integer index of each column as a new column field.
add_row_index Add the integer index of each row as a new row field.
aggregate_cols Aggregate over columns to a local value.
aggregate_entries Aggregate over entries to a local value.
aggregate_rows Aggregate over rows to a local value.
annotate_cols Create new column-indexed fields by name.
annotate_entries Create new row-and-column-indexed fields by name.
annotate_globals Create new global fields by name.
annotate_rows Create new row-indexed fields by name.
cache Persist the dataset in memory.
choose_cols Choose a new set of columns from a list of old column indices.
collect_cols_by_key Collect values for each unique column key into arrays.
cols Returns a table with all column fields in the matrix.
count Count the number of rows and columns in the matrix.
count_cols Count the number of columns in the matrix.
count_rows Count the number of rows in the matrix.
describe Print information about the fields in the matrix.
distinct_by_col Remove columns with a duplicate row key.
distinct_by_row Remove rows with a duplicate row key.
drop Drop fields.
entries Returns a matrix in coordinate table form.
explode_cols Explodes a column field of type array or set, copying the entire column for each element.
explode_rows Explodes a row field of type array or set, copying the entire row for each element.
filter_cols Filter columns of the matrix.
filter_entries Filter entries of the matrix.
filter_rows Filter rows of the matrix.
from_rows_table Construct matrix table with no columns from a table.
globals_table Returns a table with a single row with the globals of the matrix table.
group_cols_by Group columns, used with GroupedMatrixTable.aggregate().
group_rows_by Group rows, used with GroupedMatrixTable.aggregate().
head Subset matrix to first n rows.
index_cols Expose the column values as if looked up in a dictionary, indexing with exprs.
index_entries Expose the entries as if looked up in a dictionary, indexing with exprs.
index_globals Return this matrix table’s global variables for use in another expression context.
index_rows Expose the row values as if looked up in a dictionary, indexing with exprs.
key_cols_by Key columns by a new set of fields.
key_rows_by Key rows by a new set of fields.
make_table Make a table from a matrix table with one field per sample.
n_partitions Number of partitions.
naive_coalesce Naively decrease the number of partitions.
persist Persist this table in memory or on disk.
rename Rename fields of a matrix table.
repartition Increase or decrease the number of partitions.
rows Returns a table with all row fields in the matrix.
sample_rows Downsample the matrix table by keeping each row with probability p.
select_cols Select existing column fields or create new fields by name, dropping the rest.
select_entries Select existing entry fields or create new fields by name, dropping the rest.
select_globals Select existing global fields or create new fields by name, dropping the rest.
select_rows Select existing row fields or create new fields by name, dropping all other non-key fields.
transmute_cols Similar to MatrixTable.annotate_cols(), but drops referenced fields.
transmute_entries Similar to MatrixTable.annotate_entries(), but drops referenced fields.
transmute_globals Similar to MatrixTable.annotate_globals(), but drops referenced fields.
transmute_rows Similar to MatrixTable.annotate_rows(), but drops referenced fields.
union_cols Take the union of dataset columns.
union_rows Take the union of dataset rows.
unpersist Unpersists this dataset from memory/disk.
write Write to disk.
add_col_index(name: str = 'col_idx') → MatrixTable[source]

Add the integer index of each column as a new column field.

Examples

>>> dataset_result = dataset.add_col_index()

Notes

The field added is type tint32.

The column index is 0-indexed; the values are found in the range [0, N), where N is the total number of columns.

Parameters:name (str) – Name for column index field.
Returns:MatrixTable – Dataset with new field.
add_row_index(name: str = 'row_idx') → MatrixTable[source]

Add the integer index of each row as a new row field.

Examples

>>> dataset_result = dataset.add_row_index()

Notes

The field added is type tint64.

The row index is 0-indexed; the values are found in the range [0, N), where N is the total number of rows.

Parameters:name (str) – Name for row index field.
Returns:MatrixTable – Dataset with new field.
aggregate_cols(expr) → Any[source]

Aggregate over columns to a local value.

Examples

Aggregate over columns:

>>> dataset.aggregate_cols(
...    hl.struct(fraction_female=agg.fraction(dataset.pheno.is_female),
...              case_ratio=agg.count_where(dataset.is_case) / agg.count()))
Struct(fraction_female=0.48, case_ratio=1.0)

Notes

Unlike most MatrixTable methods, this method does not support meaningful references to fields that are not global or indexed by column.

This method should be thought of as a more convenient alternative to the following:

>>> cols_table = dataset.cols()
>>> cols_table.aggregate(
...     hl.struct(fraction_female=agg.fraction(cols_table.pheno.is_female),
...               case_ratio=agg.count_where(cols_table.is_case) / agg.count()))

Note

This method supports (and expects!) aggregation over columns.

Parameters:expr (Expression) – Aggregation expression.
Returns:any – Aggregated value dependent on expr.
aggregate_entries(expr) → Any[source]

Aggregate over entries to a local value.

Examples

Aggregate over entries:

>>> dataset.aggregate_entries(hl.struct(global_gq_mean=agg.mean(dataset.GQ),
...                                     call_rate=agg.fraction(hl.is_defined(dataset.GT))))
Struct(global_gq_mean=64.01841473178543, call_rate=0.9607692307692308)

Notes

This method should be thought of as a more convenient alternative to the following:

>>> entries_table = dataset.entries()
>>> entries_table.aggregate(hl.struct(global_gq_mean=agg.mean(entries_table.GQ),
...                                   call_rate=agg.fraction(hl.is_defined(entries_table.GT))))

Note

This method supports (and expects!) aggregation over entries.

Parameters:expr (Expression) – Aggregation expressions.
Returns:any – Aggregated value dependent on expr.
aggregate_rows(expr) → Any[source]

Aggregate over rows to a local value.

Examples

Aggregate over rows:

>>> dataset.aggregate_rows(hl.struct(n_high_quality=agg.count_where(dataset.qual > 40),
...                                  mean_qual=agg.mean(dataset.qual)))
Struct(n_high_quality=13, mean_qual=544323.8915384616)

Notes

Unlike most MatrixTable methods, this method does not support meaningful references to fields that are not global or indexed by row.

This method should be thought of as a more convenient alternative to the following:

>>> rows_table = dataset.rows()
>>> rows_table.aggregate(hl.struct(n_high_quality=agg.count_where(rows_table.qual > 40),
...                                mean_qual=agg.mean(rows_table.qual)))

Note

This method supports (and expects!) aggregation over rows.

Parameters:expr (Expression) – Aggregation expression.
Returns:any – Aggregated value dependent on expr.
annotate_cols(**named_exprs) → hail.matrixtable.MatrixTable[source]

Create new column-indexed fields by name.

Examples

Compute statistics about the GQ distribution per sample:

>>> dataset_result = dataset.annotate_cols(sample_gq_stats = agg.stats(dataset.GQ))

Add sample metadata from a hail.Table.

>>> dataset_result = dataset.annotate_cols(population = s_metadata[dataset.s].pop)

Note

This method supports aggregation over rows. For instance, the usage:

>>> dataset_result = dataset.annotate_cols(mean_GQ = agg.mean(dataset.GQ))

will compute the mean per column.

Notes

This method creates new column fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an column field foo. However, it would be possible to create an column field foo and later create another column field foo, overwriting the first.

The arguments to the method should either be Expression objects, or should be implicitly interpretable as expressions.

Parameters:named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:MatrixTable – Matrix table with new column-indexed field(s).
annotate_entries(**named_exprs) → hail.matrixtable.MatrixTable[source]

Create new row-and-column-indexed fields by name.

Examples

Compute the allele dosage using the PL field:

>>> def get_dosage(pl):
...    # convert to linear scale
...    linear_scaled = pl.map(lambda x: 10 ** - (x / 10))
...
...    # normalize to sum to 1
...    ls_sum = hl.sum(linear_scaled)
...    linear_scaled = linear_scaled.map(lambda x: x / ls_sum)
...
...    # multiply by [0, 1, 2] and sum
...    return hl.sum(linear_scaled * [0, 1, 2])
>>>
>>> dataset_result = dataset.annotate_entries(dosage = get_dosage(dataset.PL))

Note

This method does not support aggregation.

Notes

This method creates new entry fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an entry field foo. However, it would be possible to create an entry field foo and later create another entry field foo, overwriting the first.

The arguments to the method should either be Expression objects, or should be implicitly interpretable as expressions.

Parameters:named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:MatrixTable – Matrix table with new row-and-column-indexed field(s).
annotate_globals(**named_exprs) → hail.matrixtable.MatrixTable[source]

Create new global fields by name.

Examples

Add two global fields:

>>> pops_1kg = {'EUR', 'AFR', 'EAS', 'SAS', 'AMR'}
>>> dataset_result = dataset.annotate_globals(pops_in_1kg = pops_1kg,
...                                           gene_list = ['SHH', 'SCN1A', 'SPTA1', 'DISC1'])

Add global fields from another table and matrix table:

>>> dataset_result = dataset.annotate_globals(thing1 = dataset2.index_globals().global_field,
...                                           thing2 = v_metadata.index_globals().global_field)

Note

This method does not support aggregation.

Notes

This method creates new global fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a row field foo and later create an global field foo. However, it would be possible to create an global field foo and later create another global field foo, overwriting the first.

The arguments to the method should either be Expression objects, or should be implicitly interpretable as expressions.

Parameters:named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:MatrixTable – Matrix table with new global field(s).
annotate_rows(**named_exprs) → hail.matrixtable.MatrixTable[source]

Create new row-indexed fields by name.

Examples

Compute call statistics for high quality samples per variant:

>>> high_quality_calls = agg.filter(dataset.sample_qc.gq_stats.mean > 20,
...                                 agg.call_stats(dataset.GT, dataset.alleles))
>>> dataset_result = dataset.annotate_rows(call_stats = high_quality_calls)

Add functional annotations from a Table keyed by TVariant:, and another MatrixTable.

>>> dataset_result = dataset.annotate_rows(consequence = v_metadata[dataset.locus, dataset.alleles].consequence,
...                                        dataset2_AF = dataset2.index_rows(dataset.row_key).info.AF)

Note

This method supports aggregation over columns. For instance, the usage:

>>> dataset_result = dataset.annotate_rows(mean_GQ = agg.mean(dataset.GQ))

will compute the mean per row.

Notes

This method creates new row fields, but can also overwrite existing fields. Only non-key, same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an row field foo. However, it would be possible to create an row field foo and later create another row field foo, overwriting the first, as long as foo is not a row key.

The arguments to the method should either be Expression objects, or should be implicitly interpretable as expressions.

Parameters:named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:MatrixTable – Matrix table with new row-indexed field(s).
cache() → hail.matrixtable.MatrixTable[source]

Persist the dataset in memory.

Examples

Persist the dataset in memory:

>>> dataset = dataset.cache() 

Notes

This method is an alias for persist("MEMORY_ONLY").

Returns:MatrixTable – Cached dataset.
choose_cols(indices: List[int]) → MatrixTable[source]

Choose a new set of columns from a list of old column indices.

Examples

Randomly shuffle column order:

>>> import random
>>> indices = list(range(dataset.count_cols()))
>>> random.shuffle(indices)
>>> dataset_reordered = dataset.choose_cols(indices)

Take the first ten columns:

>>> dataset_result = dataset.choose_cols(list(range(10)))
Parameters:indices (list of int) – List of old column indices.
Returns:MatrixTable
col

Returns a struct expression of all column-indexed fields, including keys.

Examples

Get all column field names:

>>> list(dataset.col)  
['s', 'sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
Returns:StructExpression – Struct of all column fields.
col_key

Column key struct.

Examples

Get the column key field names:

>>> list(dataset.col_key)
['s']
Returns:StructExpression
col_value

Returns a struct expression including all non-key column-indexed fields.

Examples

Get all non-key column field names:

>>> list(dataset.col_value)  
['sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
Returns:StructExpression – Struct of all column fields, minus keys.
collect_cols_by_key() → hail.matrixtable.MatrixTable[source]

Collect values for each unique column key into arrays.

Examples

>>> mt = hl.utils.range_matrix_table(3, 3)
>>> col_dict = hl.literal({0: [1], 1: [2, 3], 2: [4, 5, 6]})
>>> mt = (mt.annotate_cols(foo = col_dict.get(mt.col_idx))
...     .explode_cols('foo'))
>>> mt = mt.annotate_entries(bar = mt.row_idx * mt.foo)
>>> mt.cols().show()
+---------+-------+
| col_idx |   foo |
+---------+-------+
|   int32 | int32 |
+---------+-------+
|       0 |     1 |
|       1 |     2 |
|       1 |     3 |
|       2 |     4 |
|       2 |     5 |
|       2 |     6 |
+---------+-------+
>>> mt.entries().show()
+---------+---------+-------+-------+
| row_idx | col_idx |   foo |   bar |
+---------+---------+-------+-------+
|   int32 |   int32 | int32 | int32 |
+---------+---------+-------+-------+
|       0 |       0 |     1 |     0 |
|       0 |       1 |     2 |     0 |
|       0 |       1 |     3 |     0 |
|       0 |       2 |     4 |     0 |
|       0 |       2 |     5 |     0 |
|       0 |       2 |     6 |     0 |
|       1 |       0 |     1 |     1 |
|       1 |       1 |     2 |     2 |
|       1 |       1 |     3 |     3 |
|       1 |       2 |     4 |     4 |
+---------+---------+-------+-------+
showing top 10 rows
>>> mt = mt.collect_cols_by_key()
>>> mt.cols().show()
+---------+--------------+
| col_idx | foo          |
+---------+--------------+
|   int32 | array<int32> |
+---------+--------------+
|       0 | [1]          |
|       1 | [2,3]        |
|       2 | [4,5,6]      |
+---------+--------------+
>>> mt.entries().show()
+---------+---------+--------------+--------------+
| row_idx | col_idx | foo          | bar          |
+---------+---------+--------------+--------------+
|   int32 |   int32 | array<int32> | array<int32> |
+---------+---------+--------------+--------------+
|       0 |       0 | [1]          | [0]          |
|       0 |       1 | [2,3]        | [0,0]        |
|       0 |       2 | [4,5,6]      | [0,0,0]      |
|       1 |       0 | [1]          | [1]          |
|       1 |       1 | [2,3]        | [2,3]        |
|       1 |       2 | [4,5,6]      | [4,5,6]      |
|       2 |       0 | [1]          | [2]          |
|       2 |       1 | [2,3]        | [4,6]        |
|       2 |       2 | [4,5,6]      | [8,10,12]    |
+---------+---------+--------------+--------------+

Notes

Each entry field and each non-key column field of type t is replaced by a field of type array<t>. The value of each such field is an array containing all values of that field sharing the corresponding column key. In each column, the newly collected arrays all have the same length, and the values of each pre-collection column are guaranteed to be located at the same index in their corresponding arrays.

Note

The order of the columns is not guaranteed.

Returns:MatrixTable
cols() → hail.table.Table[source]

Returns a table with all column fields in the matrix.

Examples

Extract the column table:

>>> cols_table = dataset.cols()

Warning

Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the column key (which becomes the table key). To preserve the original column order as the table row order, first unkey the columns using key_cols_by() with no arguments.

Returns:Table – Table with all column fields from the matrix, with one row per column of the matrix.
count() → Tuple[int, int][source]

Count the number of rows and columns in the matrix.

Examples

>>> dataset.count()
Returns:int, int – Number of rows, number of cols.
count_cols() → int[source]

Count the number of columns in the matrix.

Examples

Count the number of columns:

>>> n_cols = dataset.count_cols()
Returns:int – Number of columns in the matrix.
count_rows() → int[source]

Count the number of rows in the matrix.

Examples

Count the number of rows:

>>> n_rows = dataset.count_rows()
Returns:int – Number of rows in the matrix.
describe(handler=<built-in function print>)[source]

Print information about the fields in the matrix.

distinct_by_col()[source]

Remove columns with a duplicate row key.

Returns:MatrixTable
distinct_by_row()[source]

Remove rows with a duplicate row key.

Returns:MatrixTable
drop(*exprs) → MatrixTable[source]

Drop fields.

Examples

Drop fields PL (an entry field), info (a row field), and pheno (a column field): using strings:

>>> dataset_result = dataset.drop('PL', 'info', 'pheno')

Drop fields PL (an entry field), info (a row field), and pheno (a column field): using field references:

>>> dataset_result = dataset.drop(dataset.PL, dataset.info, dataset.pheno)

Drop a list of fields:

>>> fields_to_drop = ['PL', 'info', 'pheno']
>>> dataset_result = dataset.drop(*fields_to_drop)

Notes

This method can be used to drop global, row-indexed, column-indexed, or row-and-column-indexed (entry) fields. The arguments can be either strings ('field'), or top-level field references (table.field or table['field']).

Key fields (belonging to either the row key or the column key) cannot be dropped using this method. In order to drop a key field, use key_rows_by() or key_cols_by() to remove the field from the key before dropping.

While many operations exist independently for rows, columns, entries, and globals, only one is needed for dropping due to the lack of any necessary contextual information.

Parameters:exprs (varargs of str or Expression) – Names of fields to drop or field reference expressions.
Returns:MatrixTable – Matrix table without specified fields.
entries() → hail.table.Table[source]

Returns a matrix in coordinate table form.

Examples

Extract the entry table:

>>> entries_table = dataset.entries()

Warning

The table returned by this method should be used for aggregation or queries, but never exported or written to disk without extensive filtering and field selection – the disk footprint of an entries_table could be 100x (or more!) larger than its parent matrix. This means that if you try to export the entries table of a 10 terabyte matrix, you could write a petabyte of data!

Warning

Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the compound (row key, column key) which becomes the table key. To preserve the original row-major entry order as the table row order, first unkey the columns using key_cols_by() with no arguments.

Returns:Table – Table with all non-global fields from the matrix, with one row per entry of the matrix.
entry

Returns a struct expression including all row-and-column-indexed fields.

Examples

Get all entry field names:

>>> list(dataset.entry)
['GT', 'AD', 'DP', 'GQ', 'PL']
Returns:StructExpression – Struct of all entry fields.
explode_cols(field_expr) → MatrixTable[source]

Explodes a column field of type array or set, copying the entire column for each element.

Examples

Explode columns by annotated cohorts:

>>> dataset_result = dataset.explode_cols(dataset.cohorts)

Notes

The new matrix table will have N copies of each column, where N is the number of elements that column contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated columns by the sequence of elements in the array or set. All other fields remain the same, including entry fields.

If the field referenced with field_expr is missing or empty, the column is removed entirely.

Parameters:field_expr (str or Expression) – Field name or (possibly nested) field reference expression.
Returns:MatrixTable – Matrix table exploded column-wise for each element of field_expr.
explode_rows(field_expr) → MatrixTable[source]

Explodes a row field of type array or set, copying the entire row for each element.

Examples

Explode rows by annotated genes:

>>> dataset_result = dataset.explode_rows(dataset.gene)

Notes

The new matrix table will have N copies of each row, where N is the number of elements that row contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated rows by the sequence of elements in the array or set. All other fields remain the same, including entry fields.

If the field referenced with field_expr is missing or empty, the row is removed entirely.

Parameters:field_expr (str or Expression) – Field name or (possibly nested) field reference expression.
Returns:class:MatrixTable` – Matrix table exploded row-wise for each element of field_expr.
filter_cols(expr, keep: bool = True) → MatrixTable[source]

Filter columns of the matrix.

Examples

Keep columns where pheno.is_case is True and pheno.age is larger than 50:

>>> dataset_result = dataset.filter_cols(dataset.pheno.is_case &
...                                      (dataset.pheno.age > 50),
...                                      keep=True)

Remove columns where sample_qc.gq_stats.mean is less than 20:

>>> dataset_result = dataset.filter_cols(dataset.sample_qc.gq_stats.mean < 20,
...                                      keep=False)

Remove columns where s is found in a Python set:

>>> samples_to_remove = {'NA12878', 'NA12891', 'NA12892'}
>>> set_to_remove = hl.literal(samples_to_remove)
>>> dataset_result = dataset.filter_cols(~set_to_remove.contains(dataset['s']))

Notes

The expression expr will be evaluated for every column of the table. If keep is True, then columns where expr evaluates to False will be removed (the filter keeps the columns where the predicate evaluates to True). If keep is False, then columns where expr evaluates to False will be removed (the filter removes the columns where the predicate evaluates to True).

Warning

When expr evaluates to missing, the column will be removed regardless of keep.

Note

This method supports aggregation over rows. For instance,

>>> dataset_result = dataset.filter_cols(agg.mean(dataset.GQ) > 20.0)

will remove columns where the mean GQ of all entries in the column is smaller than 20.

Parameters:
  • expr (bool or BooleanExpression) – Filter expression.
  • keep (bool) – Keep columns where expr is true.
Returns:

MatrixTable – Filtered matrix table.

filter_entries(expr, keep: bool = True) → MatrixTable[source]

Filter entries of the matrix.

Examples

Keep entries where the sum of AD is greater than 10 and GQ is greater than 20:

>>> dataset_result = dataset.filter_entries((hl.sum(dataset.AD) > 10) & (dataset.GQ > 20))

Notes

The expression expr will be evaluated for every entry of the table. If keep is True, then entries where expr evaluates to False will be removed (the filter keeps the entries where the predicate evaluates to True). If keep is False, then entries where expr evaluates to False will be removed (the filter removes the entries where the predicate evaluates to True).

Note

“Removal” of an entry constitutes setting all its fields to missing. There is some debate about what removing an entry of a matrix means semantically, given the representation of a MatrixTable as a whole workspace in Hail.

Warning

When expr evaluates to missing, the entry will be removed regardless of keep.

Note

This method does not support aggregation.

Parameters:
  • expr (bool or BooleanExpression) – Filter expression.
  • keep (bool) – Keep entries where expr is true.
Returns:

MatrixTable – Filtered matrix table.

filter_rows(expr, keep: bool = True) → MatrixTable[source]

Filter rows of the matrix.

Examples

Keep rows where variant_qc.AF is below 1%:

>>> dataset_result = dataset.filter_rows(dataset.variant_qc.AF[1] < 0.01, keep=True)

Remove rows where filters is non-empty:

>>> dataset_result = dataset.filter_rows(dataset.filters.size() > 0, keep=False)

Notes

The expression expr will be evaluated for every row of the table. If keep is True, then rows where expr evaluates to False will be removed (the filter keeps the rows where the predicate evaluates to True). If keep is False, then rows where expr evaluates to False will be removed (the filter removes the rows where the predicate evaluates to True).

Warning

When expr evaluates to missing, the row will be removed regardless of keep.

Note

This method supports aggregation over columns. For instance,

>>> dataset_result = dataset.filter_rows(agg.mean(dataset.GQ) > 20.0)

will remove rows where the mean GQ of all entries in the row is smaller than 20.

Parameters:
  • expr (bool or BooleanExpression) – Filter expression.
  • keep (bool) – Keep rows where expr is true.
Returns:

MatrixTable – Filtered matrix table.

classmethod from_rows_table(table: hail.table.Table) → MatrixTable[source]

Construct matrix table with no columns from a table.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Examples

Import a text table and construct a rows-only matrix table:

>>> table = hl.import_table('data/variant-lof.tsv')
>>> table = table.transmute(**hl.parse_variant(table['v'])).key_by('locus', 'alleles')
>>> sites_vds = hl.MatrixTable.from_rows_table(table)

Notes

All fields in the table become row-indexed fields in the result.

Parameters:table (Table) – The table to be converted.
Returns:MatrixTable
globals

Returns a struct expression including all global fields.

Returns:StructExpression
globals_table() → hail.table.Table[source]

Returns a table with a single row with the globals of the matrix table.

Examples

Extract the globals table:

>>> globals_table = dataset.globals_table()
Returns:Table – Table with the globals from the matrix, with a single row.
group_cols_by(*exprs, **named_exprs) → GroupedMatrixTable[source]

Group columns, used with GroupedMatrixTable.aggregate().

Examples

Aggregate to a matrix with cohort as column keys, computing the call rate as an entry field:

>>> dataset_result = (dataset.group_cols_by(dataset.cohort)
...                          .aggregate(call_rate = agg.fraction(hl.is_defined(dataset.GT))))

Notes

All complex expressions must be passed as named expressions.

Parameters:
  • exprs (args of str or Expression) – Column fields to group by.
  • named_exprs (keyword args of Expression) – Column-indexed expressions to group by.
Returns:

GroupedMatrixTable – Grouped matrix, can be used to call GroupedMatrixTable.aggregate().

group_rows_by(*exprs, **named_exprs) → GroupedMatrixTable[source]

Group rows, used with GroupedMatrixTable.aggregate().

Examples

Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:

>>> dataset_result = (dataset.group_rows_by(dataset.gene)
...                          .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))

Notes

All complex expressions must be passed as named expressions.

Parameters:
  • exprs (args of str or Expression) – Row fields to group by.
  • named_exprs (keyword args of Expression) – Row-indexed expressions to group by.
Returns:

GroupedMatrixTable – Grouped matrix. Can be used to call GroupedMatrixTable.aggregate().

head(n: int) → MatrixTable[source]

Subset matrix to first n rows.

Examples

Subset to the first three rows of the matrix:

>>> dataset_result = dataset.head(3)
>>> dataset_result.count_rows()
3

Notes

The number of partitions in the new matrix is equal to the number of partitions containing the first n rows.

Parameters:n (int) – Number of rows to include.
Returns:MatrixTable – Matrix including the first n rows.
index_cols(*exprs)[source]

Expose the column values as if looked up in a dictionary, indexing with exprs.

Examples

>>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.s).pheno)

Or equivalently: >>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.col_key).pheno)

Parameters:exprs (variable-length args of Expression) – Index expressions.

Notes

index_cols(exprs)() is equivalent to cols().index(exprs) or cols()[exprs].

The type of the resulting struct is the same as the type of col_value().

Returns:StructExpression
index_entries(row_exprs, col_exprs)[source]

Expose the entries as if looked up in a dictionary, indexing with exprs.

Examples

>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2.index_entries(dataset.row_key, dataset.col_key).GQ)

Or equivalently: >>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)

Parameters:
  • row_exprs (tuple of Expression) – Row index expressions.
  • col_exprs (tuple of Expression) – Column index expressions.

Notes

The type of the resulting struct is the same as the type of entry().

Note

There is a shorthand syntax for MatrixTable.index_entries() using square brackets (the Python __getitem__ syntax). This syntax is preferred.

>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)
Returns:StructExpression
index_globals() → hail.expr.expressions.base_expression.Expression[source]

Return this matrix table’s global variables for use in another expression context.

Examples

>>> dataset1 = dataset.annotate_globals(pli={'SCN1A': 0.999, 'SONIC': 0.014})
>>> pli_dict = dataset1.index_globals().pli
>>> dataset_result = dataset2.annotate_rows(gene_pli = dataset2.gene.map(lambda x: pli_dict.get(x)))
Returns:StructExpression
index_rows(*exprs)[source]

Expose the row values as if looked up in a dictionary, indexing with exprs.

Examples

>>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.locus, dataset.alleles).qual)

Or equivalently: >>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.row_key).qual)

Parameters:exprs (variable-length args of Expression) – Index expressions.

Notes

index_rows(exprs)() is equivalent to rows().index(exprs) or rows()[exprs].

The type of the resulting struct is the same as the type of row_value().

Returns:StructExpression
key_cols_by(*keys, **named_keys) → MatrixTable[source]

Key columns by a new set of fields.

See Table.key_by() for more information on defining a key.

Parameters:
  • keys (varargs of str or Expression.) – Column fields to key by.
  • named_keys (keyword args of Expression.) – Column fields to key by.
Returns:

MatrixTable

key_rows_by(*keys, **named_keys) → MatrixTable[source]

Key rows by a new set of fields.

Examples

>>> dataset_result = dataset.key_rows_by('locus')
>>> dataset_result = dataset.key_rows_by(dataset['locus'])
>>> dataset_result = dataset.key_rows_by(**dataset.row_key.drop('alleles'))

All of these expressions key the dataset by the ‘locus’ field, dropping the ‘alleles’ field from the row key.

>>> dataset_result = dataset.key_rows_by(contig=dataset['locus'].contig,
...                                      position=dataset['locus'].position,
...                                      alleles=dataset['alleles'])

This keys the dataset by the newly defined fields, ‘contig’ and ‘position’, and the ‘alleles’ field. The old row key field, ‘locus’, is preserved as a non-key field.

Notes

See Table.key_by() for more information on defining a key.

Parameters:
  • keys (varargs of str or Expression.) – Row fields to key by.
  • named_keys (keyword args of Expression.) – Row fields to key by.
Returns:

MatrixTable

make_table(separator='.') → hail.table.Table[source]

Make a table from a matrix table with one field per sample.

Examples

Consider a matrix table with the following schema:

Global fields:
    'batch': str
Column fields:
    's': str
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
Entry fields:
    'GT': call
    'GQ': int32
Column key:
    's': str
Row key:
    'locus': locus<GRCh37>
    'alleles': array<str>

and three sample IDs: A, B and C. Then the result of make_table():

>>> ht = mt.make_table() 

has the original row fields along with 6 additional fields, one for each sample and entry field:

Global fields:
    'batch': str
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'A.GT': call
    'A.GQ': int32
    'B.GT': call
    'B.GQ': int32
    'C.GT': call
    'C.GQ': int32
Key:
    'locus': locus<GRCh37>
    'alleles': array<str>

Notes

The table has one row for each row of the input matrix. The per sample and entry fields are formed by concatenating the sample ID with the entry field name using separator. If the entry field name is empty, the separator is omitted.

The table inherits the globals from the matrix table.

Parameters:separator (str) – Separator between sample IDs and entry field names.
Returns:Table
n_partitions() → int[source]

Number of partitions.

Notes

The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. Partitions are a core concept of distributed computation in Spark, see here for details.

Returns:int – Number of partitions.
naive_coalesce(max_partitions: int) → MatrixTable[source]

Naively decrease the number of partitions.

Example

Naively repartition to 10 partitions:

>>> dataset_result = dataset.naive_coalesce(10)

Warning

naive_coalesce() simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlike repartition(), so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.

Parameters:max_partitions (int) – Desired number of partitions. If the current number of partitions is less than or equal to max_partitions, do nothing.
Returns:MatrixTable – Matrix table with at most max_partitions partitions.
persist(storage_level: str = 'MEMORY_AND_DISK') → MatrixTable[source]

Persist this table in memory or on disk.

Examples

Persist the dataset to both memory and disk:

>>> dataset = dataset.persist() 

Notes

The MatrixTable.persist() and MatrixTable.cache() methods store the current dataset on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution for Table.write(), which stores a permanent file.

Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.

Parameters:storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
Returns:MatrixTable – Persisted dataset.
rename(fields: Dict[str, str]) → MatrixTable[source]

Rename fields of a matrix table.

Examples

Rename column key s to SampleID, still keying by SampleID.

>>> dataset_result = dataset.rename({'s': 'SampleID'})

You can rename a field to a field name that already exists, as long as that field also gets renamed (no name collisions). Here, we rename the column key s to info, and the row field info to vcf_info:

>>> dataset_result = dataset.rename({'s': 'info', 'info': 'vcf_info'})
Parameters:fields (dict from str to str) – Mapping from old field names to new field names.
Returns:MatrixTable – Matrix table with renamed fields.
repartition(n_partitions: int, shuffle: bool = True) → MatrixTable[source]

Increase or decrease the number of partitions.

Examples

Repartition to 500 partitions:

>>> dataset_result = dataset.repartition(500)

Notes

Check the current number of partitions with n_partitions().

The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a matrix with \(M\) rows is first imported, each of the \(k\) partitions will contain about \(M/k\) of the rows. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores. Partitions are a core concept of distributed computation in Spark, see their documentation for details. With shuffle=True, Hail does a full shuffle of the data and creates equal sized partitions. With shuffle=False, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to the repartition and coalesce commands in Spark, respectively. In particular, when shuffle=False, n_partitions cannot exceed current number of partitions.

Note

If shuffle is False, the number of partitions may only be reduced, not increased.

Parameters:
  • n_partitions (int) – Desired number of partitions.
  • shuffle (bool) – If True, use full shuffle to repartition.
Returns:

MatrixTable – Repartitioned dataset.

row

Returns a struct expression of all row-indexed fields, including keys.

Examples

Get the first five row field names:

>>> list(dataset.row)[:5]
['locus', 'alleles', 'rsid', 'qual', 'filters']
Returns:StructExpression – Struct of all row fields.
row_key

Row key struct.

Examples

Get the row key field names:

>>> list(dataset.row_key)
['locus', 'alleles']
Returns:StructExpression
row_value

Returns a struct expression including all non-key row-indexed fields.

Examples

Get the first five non-key row field names:

>>> list(dataset.row_value)[:5]
['rsid', 'qual', 'filters', 'info', 'use_as_marker']
Returns:StructExpression – Struct of all row fields, minus keys.
rows() → hail.table.Table[source]

Returns a table with all row fields in the matrix.

Examples

Extract the row table:

>>> rows_table = dataset.rows()
Returns:Table – Table with all row fields from the matrix, with one row per row of the matrix.
sample_rows(p: float, seed=None) → MatrixTable[source]

Downsample the matrix table by keeping each row with probability p.

Examples

Downsample the dataset to approximately 1% of its rows.

>>> small_dataset = dataset.sample_rows(0.01)
Parameters:
  • p (float) – Probability of keeping each row.
  • seed (int) – Random seed.
Returns:

MatrixTable – Matrix table with approximately p * n_rows rows.

select_cols(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]

Select existing column fields or create new fields by name, dropping the rest.

Examples

Select existing fields and compute a new one:

>>> dataset_result = dataset.select_cols(
...     dataset.sample_qc,
...     dataset.pheno.age,
...     isCohort1 = dataset.pheno.cohort_name == 'Cohort1')

Notes

This method creates new column fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.

Note

See Table.select() for more information about using select methods.

Note

This method supports aggregation over rows. For instance, the usage:

>>> dataset_result = dataset.select_cols(mean_GQ = agg.mean(dataset.GQ))

will compute the mean per column.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

MatrixTable – MatrixTable with specified column fields.

select_entries(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]

Select existing entry fields or create new fields by name, dropping the rest.

Examples

Drop all entry fields aside from GT:

>>> dataset_result = dataset.select_entries(dataset.GT)

Notes

This method creates new entry fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.

Note

See Table.select() for more information about using select methods.

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

MatrixTable – MatrixTable with specified entry fields.

select_globals(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]

Select existing global fields or create new fields by name, dropping the rest.

Examples

Select one existing field and compute a new one:

>>> dataset_result = dataset.select_globals(dataset.global_field_1,
...                                         another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])

Notes

This method creates new global fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.

Note

See Table.select() for more information about using select methods.

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

MatrixTable – MatrixTable with specified global fields.

select_rows(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]

Select existing row fields or create new fields by name, dropping all other non-key fields.

Examples

Select existing fields and compute a new one:

>>> dataset_result = dataset.select_rows(
...    dataset.variant_qc.gq_stats.mean,
...    high_quality_cases = agg.count_where((dataset.GQ > 20) &
...                                         dataset.is_case))

Notes

This method creates new row fields. If a created field shares its name with a differently-indexed field of the table, or with a row key, the method will fail.

Row keys are preserved. To drop or change a row key field, use MatrixTable.key_rows_by().

Note

See Table.select() for more information about using select methods.

Note

This method supports aggregation over columns. For instance, the usage:

>>> dataset_result = dataset.select_rows(mean_GQ = agg.mean(dataset.GQ))

will compute the mean per row.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

MatrixTable – MatrixTable with specified row fields.

transmute_cols(**named_exprs) → hail.matrixtable.MatrixTable[source]

Similar to MatrixTable.annotate_cols(), but drops referenced fields.

Notes

This method adds new column fields according to named_exprs, and drops all column fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Note

transmute_cols() will not drop key fields.

Note

This method supports aggregation over rows.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:MatrixTable
transmute_entries(**named_exprs)[source]

Similar to MatrixTable.annotate_entries(), but drops referenced fields.

Notes

This method adds new entry fields according to named_exprs, and drops all entry fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:MatrixTable
transmute_globals(**named_exprs) → hail.matrixtable.MatrixTable[source]

Similar to MatrixTable.annotate_globals(), but drops referenced fields.

Notes

This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:MatrixTable
transmute_rows(**named_exprs) → hail.matrixtable.MatrixTable[source]

Similar to MatrixTable.annotate_rows(), but drops referenced fields.

Notes

This method adds new row fields according to named_exprs, and drops all row fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Note

transmute_rows() will not drop key fields.

Note

This method supports aggregation over columns.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:MatrixTable
union_cols(other: MatrixTable) → MatrixTable[source]

Take the union of dataset columns.

Examples

Union the columns of two datasets:

>>> dataset_result = dataset_to_union_1.union_cols(dataset_to_union_2)

Notes

In order to combine two datasets, three requirements must be met:

  • The row keys must match.
  • The column key schemas and column schemas must match.
  • The entry schemas must match.

The row fields in the resulting dataset are the row fields from the first dataset; the row schemas do not need to match.

This method performs an inner join on rows and concatenates entries from the two datasets for each row.

This method does not deduplicate; if a column key exists identically in two datasets, then it will be duplicated in the result.

Parameters:other (MatrixTable) – Dataset to concatenate.
Returns:MatrixTable – Dataset with columns from both datasets.
union_rows() → MatrixTable[source]

Take the union of dataset rows.

Examples

Union the rows of two datasets:

>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2)

Given a list of datasets, take the union of all rows:

>>> all_datasets = [dataset_to_union_1, dataset_to_union_2]

The following three syntaxes are equivalent:

>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2)
>>> dataset_result = all_datasets[0].union_rows(*all_datasets[1:])
>>> dataset_result = hl.MatrixTable.union_rows(*all_datasets)

Notes

In order to combine two datasets, three requirements must be met:

  • The column keys must be identical, both in type, value, and ordering.
  • The row key schemas and row schemas must match.
  • The entry schemas must match.

The column fields in the resulting dataset are the column fields from the first dataset; the column schemas do not need to match.

This method does not deduplicate; if a row exists identically in two datasets, then it will be duplicated in the result.

Warning

This method can trigger a shuffle, if partitions from two datasets overlap.

Parameters:datasets (varargs of MatrixTable) – Datasets to combine.
Returns:MatrixTable – Dataset with rows from each member of datasets.
unpersist() → hail.matrixtable.MatrixTable[source]

Unpersists this dataset from memory/disk.

Notes

This function will have no effect on a dataset that was not previously persisted.

Returns:MatrixTable – Unpersisted dataset.
write(output: str, overwrite: bool = False, stage_locally: bool = False, _codec_spec: Union[str, NoneType] = None)[source]

Write to disk.

Examples

>>> dataset.write('output/dataset.mt')

Warning

Do not write to a path that is being read from in the same computation.

Parameters:
  • output (str) – Path at which to write.
  • stage_locally (bool) – If True, major output will be written to temporary local storage before being copied to output
  • overwrite (bool) – If True, overwrite an existing file at the destination.