MatrixTable
- class hail.MatrixTable[source]
Hail’s distributed implementation of a structured matrix.
Use
read_matrix_table()
to read a matrix table that was written withMatrixTable.write()
.Examples
Add annotations:
>>> dataset = dataset.annotate_globals(pli = {'SCN1A': 0.999, 'SONIC': 0.014}, ... populations = ['AFR', 'EAS', 'EUR', 'SAS', 'AMR', 'HIS'])
>>> dataset = dataset.annotate_cols(pop = dataset.populations[hl.int(hl.rand_unif(0, 6))], ... sample_gq = hl.agg.mean(dataset.GQ), ... sample_dp = hl.agg.mean(dataset.DP))
>>> dataset = dataset.annotate_rows(variant_gq = hl.agg.mean(dataset.GQ), ... variant_dp = hl.agg.mean(dataset.GQ), ... sas_hets = hl.agg.count_where(dataset.GT.is_het()))
>>> dataset = dataset.annotate_entries(gq_by_dp = dataset.GQ / dataset.DP)
Filter:
>>> dataset = dataset.filter_cols(dataset.pop != 'EUR')
>>> datasetm = dataset.filter_rows((dataset.variant_gq > 10) & (dataset.variant_dp > 5))
>>> dataset = dataset.filter_entries(dataset.gq_by_dp > 1)
Query:
>>> col_stats = dataset.aggregate_cols(hl.struct(pop_counts=hl.agg.counter(dataset.pop), ... high_quality=hl.agg.fraction((dataset.sample_gq > 10) & (dataset.sample_dp > 5)))) >>> print(col_stats.pop_counts) >>> print(col_stats.high_quality)
>>> het_dist = dataset.aggregate_rows(hl.agg.stats(dataset.sas_hets)) >>> print(het_dist)
>>> entry_stats = dataset.aggregate_entries(hl.struct(call_rate=hl.agg.fraction(hl.is_defined(dataset.GT)), ... global_gq_mean=hl.agg.mean(dataset.GQ))) >>> print(entry_stats.call_rate) >>> print(entry_stats.global_gq_mean)
Attributes
Returns a struct expression of all column-indexed fields, including keys.
Column key struct.
Returns a struct expression including all non-key column-indexed fields.
Returns a struct expression including all row-and-column-indexed fields.
Returns a struct expression including all global fields.
Returns a struct expression of all row-indexed fields, including keys.
Row key struct.
Returns a struct expression including all non-key row-indexed fields.
Methods
Add the integer index of each column as a new column field.
Add the integer index of each row as a new row field.
Aggregate over columns to a local value.
Aggregate over entries to a local value.
Aggregate over rows to a local value.
Create new column-indexed fields by name.
Create new row-and-column-indexed fields by name.
Create new global fields by name.
Create new row-indexed fields by name.
Filters the table to columns whose key does not appear in other.
Filters the table to rows whose key does not appear in other.
Persist the dataset in memory.
Checkpoint the matrix table to disk by writing and reading using a fast, but less space-efficient codec.
Choose a new set of columns from a list of old column indices.
Collect values for each unique column key into arrays.
Returns a table with all column fields in the matrix.
Compute statistics about the number and fraction of filtered entries.
Count the number of rows and columns in the matrix.
Count the number of columns in the matrix.
Count the number of rows in the matrix.
Print information about the fields in the matrix table.
Remove columns with a duplicate row key, keeping exactly one column for each unique key.
Remove rows with a duplicate row key, keeping exactly one row for each unique key.
Drop fields.
Returns a matrix in coordinate table form.
Explodes a column field of type array or set, copying the entire column for each element.
Explodes a row field of type array or set, copying the entire row for each element.
Filter columns of the matrix.
Filter entries of the matrix.
Filter rows of the matrix.
Create a MatrixTable from its component parts.
Construct matrix table with no columns from a table.
Returns a table with a single row with the globals of the matrix table.
Group columns, used with
GroupedMatrixTable.aggregate()
.Group rows, used with
GroupedMatrixTable.aggregate()
.Subset matrix to first n_rows rows and n_cols cols.
Expose the column values as if looked up in a dictionary, indexing with exprs.
Expose the entries as if looked up in a dictionary, indexing with exprs.
Return this matrix table's global variables for use in another expression context.
Expose the row values as if looked up in a dictionary, indexing with exprs.
Key columns by a new set of fields.
Key rows by a new set of fields.
Convert the matrix table to a table with entries localized as an array of structs.
Make a table from a matrix table with one field per sample.
Number of partitions.
Naively decrease the number of partitions.
Persist this table in memory or on disk.
Rename fields of a matrix table.
Change the number of partitions.
Returns a table with all row fields in the matrix.
Downsample the matrix table by keeping each column with probability
p
.Downsample the matrix table by keeping each row with probability
p
.Select existing column fields or create new fields by name, dropping the rest.
Select existing entry fields or create new fields by name, dropping the rest.
Select existing global fields or create new fields by name, dropping the rest.
Select existing row fields or create new fields by name, dropping all other non-key fields.
Filters the matrix table to columns whose key appears in other.
Filters the matrix table to rows whose key appears in other.
Print the first few rows of the matrix table to the console.
Compute and print summary information about the fields in the matrix table.
Subset matrix to last n rows.
Similar to
MatrixTable.annotate_cols()
, but drops referenced fields.Similar to
MatrixTable.annotate_entries()
, but drops referenced fields.Similar to
MatrixTable.annotate_globals()
, but drops referenced fields.Similar to
MatrixTable.annotate_rows()
, but drops referenced fields.Unfilters filtered entries, populating fields with missing values.
Take the union of dataset columns.
Take the union of dataset rows.
Unpersists this dataset from memory/disk.
Write to disk.
- add_col_index(name='col_idx')[source]
Add the integer index of each column as a new column field.
Examples
>>> dataset_result = dataset.add_col_index()
Notes
The field added is type
tint32
.The column index is 0-indexed; the values are found in the range
[0, N)
, whereN
is the total number of columns.- Parameters:
name (
str
) – Name for column index field.- Returns:
MatrixTable
– Dataset with new field.
- add_row_index(name='row_idx')[source]
Add the integer index of each row as a new row field.
Examples
>>> dataset_result = dataset.add_row_index()
Notes
The field added is type
tint64
.The row index is 0-indexed; the values are found in the range
[0, N)
, whereN
is the total number of rows.- Parameters:
name (
str
) – Name for row index field.- Returns:
MatrixTable
– Dataset with new field.
- aggregate_cols(expr, _localize=True)[source]
Aggregate over columns to a local value.
Examples
Aggregate over columns:
>>> dataset.aggregate_cols( ... hl.struct(fraction_female=hl.agg.fraction(dataset.pheno.is_female), ... case_ratio=hl.agg.count_where(dataset.is_case) / hl.agg.count())) Struct(fraction_female=0.44, case_ratio=1.0)
Notes
Unlike most
MatrixTable
methods, this method does not support meaningful references to fields that are not global or indexed by column.This method should be thought of as a more convenient alternative to the following:
>>> cols_table = dataset.cols() >>> cols_table.aggregate( ... hl.struct(fraction_female=hl.agg.fraction(cols_table.pheno.is_female), ... case_ratio=hl.agg.count_where(cols_table.is_case) / hl.agg.count()))
Note
This method supports (and expects!) aggregation over columns.
- Parameters:
expr (
Expression
) – Aggregation expression.- Returns:
any – Aggregated value dependent on expr.
- aggregate_entries(expr, _localize=True)[source]
Aggregate over entries to a local value.
Examples
Aggregate over entries:
>>> dataset.aggregate_entries(hl.struct(global_gq_mean=hl.agg.mean(dataset.GQ), ... call_rate=hl.agg.fraction(hl.is_defined(dataset.GT)))) Struct(global_gq_mean=69.60514541387025, call_rate=0.9933333333333333)
Notes
This method should be thought of as a more convenient alternative to the following:
>>> entries_table = dataset.entries() >>> entries_table.aggregate(hl.struct(global_gq_mean=hl.agg.mean(entries_table.GQ), ... call_rate=hl.agg.fraction(hl.is_defined(entries_table.GT))))
Note
This method supports (and expects!) aggregation over entries.
- Parameters:
expr (
Expression
) – Aggregation expressions.- Returns:
any – Aggregated value dependent on expr.
- aggregate_rows(expr, _localize=True)[source]
Aggregate over rows to a local value.
Examples
Aggregate over rows:
>>> dataset.aggregate_rows(hl.struct(n_high_quality=hl.agg.count_where(dataset.qual > 40), ... mean_qual=hl.agg.mean(dataset.qual))) Struct(n_high_quality=9, mean_qual=140054.73333333334)
Notes
Unlike most
MatrixTable
methods, this method does not support meaningful references to fields that are not global or indexed by row.This method should be thought of as a more convenient alternative to the following:
>>> rows_table = dataset.rows() >>> rows_table.aggregate(hl.struct(n_high_quality=hl.agg.count_where(rows_table.qual > 40), ... mean_qual=hl.agg.mean(rows_table.qual)))
Note
This method supports (and expects!) aggregation over rows.
- Parameters:
expr (
Expression
) – Aggregation expression.- Returns:
any – Aggregated value dependent on expr.
- annotate_cols(**named_exprs)[source]
Create new column-indexed fields by name.
Examples
Compute statistics about the GQ distribution per sample:
>>> dataset_result = dataset.annotate_cols(sample_gq_stats = hl.agg.stats(dataset.GQ))
Add sample metadata from a
hail.Table
.>>> dataset_result = dataset.annotate_cols(population = s_metadata[dataset.s].pop)
Note
This method supports aggregation over rows. For instance, the usage:
>>> dataset_result = dataset.annotate_cols(mean_GQ = hl.agg.mean(dataset.GQ))
will compute the mean per column.
Notes
This method creates new column fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an column field foo. However, it would be possible to create an column field foo and later create another column field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.- Parameters:
named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.- Returns:
MatrixTable
– Matrix table with new column-indexed field(s).
- annotate_entries(**named_exprs)[source]
Create new row-and-column-indexed fields by name.
Examples
Compute the allele dosage using the PL field:
>>> def get_dosage(pl): ... # convert to linear scale ... linear_scaled = pl.map(lambda x: 10 ** - (x / 10)) ... ... # normalize to sum to 1 ... ls_sum = hl.sum(linear_scaled) ... linear_scaled = linear_scaled.map(lambda x: x / ls_sum) ... ... # multiply by [0, 1, 2] and sum ... return hl.sum(linear_scaled * [0, 1, 2]) >>> >>> dataset_result = dataset.annotate_entries(dosage = get_dosage(dataset.PL))
Note
This method does not support aggregation.
Notes
This method creates new entry fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an entry field foo. However, it would be possible to create an entry field foo and later create another entry field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.- Parameters:
named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.- Returns:
MatrixTable
– Matrix table with new row-and-column-indexed field(s).
- annotate_globals(**named_exprs)[source]
Create new global fields by name.
Examples
Add two global fields:
>>> pops_1kg = {'EUR', 'AFR', 'EAS', 'SAS', 'AMR'} >>> dataset_result = dataset.annotate_globals(pops_in_1kg = pops_1kg, ... gene_list = ['SHH', 'SCN1A', 'SPTA1', 'DISC1'])
Add global fields from another table and matrix table:
>>> dataset_result = dataset.annotate_globals(thing1 = dataset2.index_globals().global_field, ... thing2 = v_metadata.index_globals().global_field)
Note
This method does not support aggregation.
Notes
This method creates new global fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a row field foo and later create an global field foo. However, it would be possible to create an global field foo and later create another global field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.- Parameters:
named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.- Returns:
MatrixTable
– Matrix table with new global field(s).
- annotate_rows(**named_exprs)[source]
Create new row-indexed fields by name.
Examples
Compute call statistics for high quality samples per variant:
>>> high_quality_calls = hl.agg.filter(dataset.sample_qc.gq_stats.mean > 20, ... hl.agg.call_stats(dataset.GT, dataset.alleles)) >>> dataset_result = dataset.annotate_rows(call_stats = high_quality_calls)
Add functional annotations from a
Table
, v_metadata, and aMatrixTable
, dataset2_AF, both keyed by locus and alleles.>>> dataset_result = dataset.annotate_rows(consequence = v_metadata[dataset.locus, dataset.alleles].consequence, ... dataset2_AF = dataset2.index_rows(dataset.row_key).info.AF)
Note
This method supports aggregation over columns. For instance, the usage:
>>> dataset_result = dataset.annotate_rows(mean_GQ = hl.agg.mean(dataset.GQ))
will compute the mean per row.
Notes
This method creates new row fields, but can also overwrite existing fields. Only non-key, same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an row field foo. However, it would be possible to create an row field foo and later create another row field foo, overwriting the first, as long as foo is not a row key.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.- Parameters:
named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.- Returns:
MatrixTable
– Matrix table with new row-indexed field(s).
- anti_join_cols(other)[source]
Filters the table to columns whose key does not appear in other.
- Parameters:
other (
Table
) – Table with compatible key field(s).- Returns:
Notes
The column key type of the matrix table must match the key type of other.
This method does not change the schema of the table; it is a method of filtering the matrix table to column keys not present in another table.
To restrict to columns whose key is present in other, use
semi_join_cols()
.Examples
>>> ds_result = ds.anti_join_cols(cols_to_remove)
It may be inconvenient to key the matrix table by the right-side key. In this case, it is possible to implement an anti-join using a non-key field as follows:
>>> ds_result = ds.filter_cols(hl.is_missing(cols_to_remove.index(ds['s'])))
See also
- anti_join_rows(other)[source]
Filters the table to rows whose key does not appear in other.
- Parameters:
other (
Table
) – Table with compatible key field(s).- Returns:
Notes
The row key type of the matrix table must match the key type of other.
This method does not change the schema of the table; it is a method of filtering the matrix table to row keys not present in another table.
To restrict to rows whose key is present in other, use
semi_join_rows()
.Examples
>>> ds_result = ds.anti_join_rows(rows_to_remove)
It may be expensive to key the matrix table by the right-side key. In this case, it is possible to implement an anti-join using a non-key field as follows:
>>> ds_result = ds.filter_rows(hl.is_missing(rows_to_remove.index(ds['locus'], ds['alleles'])))
See also
- cache()[source]
Persist the dataset in memory.
Examples
Persist the dataset in memory:
>>> dataset = dataset.cache()
Notes
This method is an alias for
persist("MEMORY_ONLY")
.- Returns:
MatrixTable
– Cached dataset.
- checkpoint(output, overwrite=False, stage_locally=False, _codec_spec=None, _read_if_exists=False, _intervals=None, _filter_intervals=False, _drop_cols=False, _drop_rows=False)[source]
Checkpoint the matrix table to disk by writing and reading using a fast, but less space-efficient codec.
- Parameters:
output (str) – Path at which to write.
stage_locally (bool) – If
True
, major output will be written to temporary local storage before being copied tooutput
overwrite (bool) – If
True
, overwrite an existing file at the destination.
- Returns:
Danger
Do not write or checkpoint to a path that is already an input source for the query. This can cause data loss.
Notes
An alias for
write()
followed byread_matrix_table()
. It is possible to read the file at this path later withread_matrix_table()
. A faster, but less efficient, codec is used or writing the data so the file will be larger than if one usedwrite()
.Examples
>>> dataset = dataset.checkpoint('output/dataset_checkpoint.mt')
- choose_cols(indices)[source]
Choose a new set of columns from a list of old column indices.
Examples
Randomly shuffle column order:
>>> import random >>> indices = list(range(dataset.count_cols())) >>> random.shuffle(indices) >>> dataset_reordered = dataset.choose_cols(indices)
Take the first ten columns:
>>> dataset_result = dataset.choose_cols(list(range(10)))
- Parameters:
- Returns:
- property col
Returns a struct expression of all column-indexed fields, including keys.
Examples
Get all column field names:
>>> list(dataset.col) ['s', 'sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
- Returns:
StructExpression
– Struct of all column fields.
- property col_key
Column key struct.
Examples
Get the column key field names:
>>> list(dataset.col_key) ['s']
- Returns:
- property col_value
Returns a struct expression including all non-key column-indexed fields.
Examples
Get all non-key column field names:
>>> list(dataset.col_value) ['sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
- Returns:
StructExpression
– Struct of all column fields, minus keys.
- collect_cols_by_key()[source]
Collect values for each unique column key into arrays.
Examples
>>> mt = hl.utils.range_matrix_table(3, 3) >>> col_dict = hl.literal({0: [1], 1: [2, 3], 2: [4, 5, 6]}) >>> mt = (mt.annotate_cols(foo = col_dict.get(mt.col_idx)) ... .explode_cols('foo')) >>> mt = mt.annotate_entries(bar = mt.row_idx * mt.foo)
>>> mt.cols().show() +---------+-------+ | col_idx | foo | +---------+-------+ | int32 | int32 | +---------+-------+ | 0 | 1 | | 1 | 2 | | 1 | 3 | | 2 | 4 | | 2 | 5 | | 2 | 6 | +---------+-------+
>>> mt.entries().show() +---------+---------+-------+-------+ | row_idx | col_idx | foo | bar | +---------+---------+-------+-------+ | int32 | int32 | int32 | int32 | +---------+---------+-------+-------+ | 0 | 0 | 1 | 0 | | 0 | 1 | 2 | 0 | | 0 | 1 | 3 | 0 | | 0 | 2 | 4 | 0 | | 0 | 2 | 5 | 0 | | 0 | 2 | 6 | 0 | | 1 | 0 | 1 | 1 | | 1 | 1 | 2 | 2 | | 1 | 1 | 3 | 3 | | 1 | 2 | 4 | 4 | +---------+---------+-------+-------+ showing top 10 rows
>>> mt = mt.collect_cols_by_key() >>> mt.cols().show() +---------+--------------+ | col_idx | foo | +---------+--------------+ | int32 | array<int32> | +---------+--------------+ | 0 | [1] | | 1 | [2,3] | | 2 | [4,5,6] | +---------+--------------+
>>> mt.entries().show() +---------+---------+--------------+--------------+ | row_idx | col_idx | foo | bar | +---------+---------+--------------+--------------+ | int32 | int32 | array<int32> | array<int32> | +---------+---------+--------------+--------------+ | 0 | 0 | [1] | [0] | | 0 | 1 | [2,3] | [0,0] | | 0 | 2 | [4,5,6] | [0,0,0] | | 1 | 0 | [1] | [1] | | 1 | 1 | [2,3] | [2,3] | | 1 | 2 | [4,5,6] | [4,5,6] | | 2 | 0 | [1] | [2] | | 2 | 1 | [2,3] | [4,6] | | 2 | 2 | [4,5,6] | [8,10,12] | +---------+---------+--------------+--------------+
Notes
Each entry field and each non-key column field of type t is replaced by a field of type array<t>. The value of each such field is an array containing all values of that field sharing the corresponding column key. In each column, the newly collected arrays all have the same length, and the values of each pre-collection column are guaranteed to be located at the same index in their corresponding arrays.
Note
The order of the columns is not guaranteed.
- Returns:
- cols()[source]
Returns a table with all column fields in the matrix.
Examples
Extract the column table:
>>> cols_table = dataset.cols()
Warning
Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the column key (which becomes the table key). To preserve the original column order as the table row order, first unkey the columns using
key_cols_by()
with no arguments.- Returns:
Table
– Table with all column fields from the matrix, with one row per column of the matrix.
- compute_entry_filter_stats(row_field='entry_stats_row', col_field='entry_stats_col')[source]
Compute statistics about the number and fraction of filtered entries.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
- Parameters:
- Returns:
Notes
Adds a new row field, row_field, and a new column field, col_field, each of which are structs with the following fields:
See also
- count_cols(_localize=True)[source]
Count the number of columns in the matrix.
Examples
Count the number of columns:
>>> n_cols = dataset.count_cols()
- Returns:
int
– Number of columns in the matrix.
- count_rows(_localize=True)[source]
Count the number of rows in the matrix.
Examples
Count the number of rows:
>>> n_rows = dataset.count_rows()
- Returns:
int
– Number of rows in the matrix.
- describe(handler=<built-in function print>, *, widget=False)[source]
Print information about the fields in the matrix table.
Note
The widget argument is experimental.
- Parameters:
handler (Callable[[str], None]) – Handler function for returned string.
widget (bool) – Create an interactive IPython widget.
- distinct_by_col()[source]
Remove columns with a duplicate row key, keeping exactly one column for each unique key.
- Returns:
- distinct_by_row()[source]
Remove rows with a duplicate row key, keeping exactly one row for each unique key.
- Returns:
- drop(*exprs)[source]
Drop fields.
Examples
Drop fields PL (an entry field), info (a row field), and pheno (a column field): using strings:
>>> dataset_result = dataset.drop('PL', 'info', 'pheno')
Drop fields PL (an entry field), info (a row field), and pheno (a column field): using field references:
>>> dataset_result = dataset.drop(dataset.PL, dataset.info, dataset.pheno)
Drop a list of fields:
>>> fields_to_drop = ['PL', 'info', 'pheno'] >>> dataset_result = dataset.drop(*fields_to_drop)
Notes
This method can be used to drop global, row-indexed, column-indexed, or row-and-column-indexed (entry) fields. The arguments can be either strings (
'field'
), or top-level field references (table.field
ortable['field']
).Key fields (belonging to either the row key or the column key) cannot be dropped using this method. In order to drop a key field, use
key_rows_by()
orkey_cols_by()
to remove the field from the key before dropping.While many operations exist independently for rows, columns, entries, and globals, only one is needed for dropping due to the lack of any necessary contextual information.
- Parameters:
exprs (varargs of
str
orExpression
) – Names of fields to drop or field reference expressions.- Returns:
MatrixTable
– Matrix table without specified fields.
- entries()[source]
Returns a matrix in coordinate table form.
Examples
Extract the entry table:
>>> entries_table = dataset.entries()
Notes
The coordinate table representation of the source matrix table contains one row for each non-filtered entry of the matrix – if a matrix table has no filtered entries and contains N rows and M columns, the table will contain
M * N
rows, which can be a very large number.This representation can be useful for aggregating over both axes of a matrix table at the same time – it is not possible to aggregate over a matrix table using
group_rows_by()
andgroup_cols_by()
at the same time (aggregating by population and chromosome from a variant-by-sample genetics representation, for instance). After moving to the coordinate representation withentries()
, it is possible to group and aggregate the resulting table much more flexibly, albeit with potentially poorer computational performance.Warning
The table returned by this method should be used for aggregation or queries, but never exported or written to disk without extensive filtering and field selection – the disk footprint of an entries_table could be 100x (or more!) larger than its parent matrix. This means that if you try to export the entries table of a 10 terabyte matrix, you could write a petabyte of data!
Warning
Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the compound (row key, column key) which becomes the table key. To preserve the original row-major entry order as the table row order, first unkey the columns using
key_cols_by()
with no arguments.Warning
If the matrix table has no row key, but has a column key, this operation may require a full shuffle to sort by the column key, depending on the pipeline.
- Returns:
Table
– Table with all non-global fields from the matrix, with one row per entry of the matrix.
- property entry
Returns a struct expression including all row-and-column-indexed fields.
Examples
Get all entry field names:
>>> list(dataset.entry) ['GT', 'AD', 'DP', 'GQ', 'PL']
- Returns:
StructExpression
– Struct of all entry fields.
- explode_cols(field_expr)[source]
Explodes a column field of type array or set, copying the entire column for each element.
Examples
Explode columns by annotated cohorts:
>>> dataset_result = dataset.explode_cols(dataset.cohorts)
Notes
The new matrix table will have N copies of each column, where N is the number of elements that column contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated columns by the sequence of elements in the array or set. All other fields remain the same, including entry fields.
If the field referenced with field_expr is missing or empty, the column is removed entirely.
- Parameters:
field_expr (str or
Expression
) – Field name or (possibly nested) field reference expression.- Returns:
MatrixTable
– Matrix table exploded column-wise for each element of field_expr.
- explode_rows(field_expr)[source]
Explodes a row field of type array or set, copying the entire row for each element.
Examples
Explode rows by annotated genes:
>>> dataset_result = dataset.explode_rows(dataset.gene)
Notes
The new matrix table will have N copies of each row, where N is the number of elements that row contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated rows by the sequence of elements in the array or set. All other fields remain the same, including entry fields.
If the field referenced with field_expr is missing or empty, the row is removed entirely.
- Parameters:
field_expr (str or
Expression
) – Field name or (possibly nested) field reference expression.- Returns:
MatrixTable
– Matrix table exploded row-wise for each element of field_expr.
- filter_cols(expr, keep=True)[source]
Filter columns of the matrix.
Examples
Keep columns where pheno.is_case is
True
and pheno.age is larger than 50:>>> dataset_result = dataset.filter_cols(dataset.pheno.is_case & ... (dataset.pheno.age > 50), ... keep=True)
Remove columns where sample_qc.gq_stats.mean is less than 20:
>>> dataset_result = dataset.filter_cols(dataset.sample_qc.gq_stats.mean < 20, ... keep=False)
Remove columns where s is found in a Python set:
>>> samples_to_remove = {'NA12878', 'NA12891', 'NA12892'} >>> set_to_remove = hl.literal(samples_to_remove) >>> dataset_result = dataset.filter_cols(~set_to_remove.contains(dataset['s']))
Notes
The expression expr will be evaluated for every column of the table. If keep is
True
, then columns where expr evaluates toTrue
will be kept (the filter removes the columns where the predicate evaluates toFalse
). If keep isFalse
, then columns where expr evaluates toTrue
will be removed (the filter keeps the columns where the predicate evaluates toFalse
).Warning
When expr evaluates to missing, the column will be removed regardless of keep.
Note
This method supports aggregation over rows. For instance,
>>> dataset_result = dataset.filter_cols(hl.agg.mean(dataset.GQ) > 20.0)
will remove columns where the mean GQ of all entries in the column is smaller than 20.
- Parameters:
expr (bool or
BooleanExpression
) – Filter expression.keep (bool) – Keep columns where expr is true.
- Returns:
MatrixTable
– Filtered matrix table.
- filter_entries(expr, keep=True)[source]
Filter entries of the matrix.
- Parameters:
expr (bool or
BooleanExpression
) – Filter expression.keep (bool) – Keep entries where expr is true.
- Returns:
MatrixTable
– Filtered matrix table.
Examples
Keep entries where the sum of AD is greater than 10 and GQ is greater than 20:
>>> dataset_result = dataset.filter_entries((hl.sum(dataset.AD) > 10) & (dataset.GQ > 20))
Warning
When expr evaluates to missing, the entry will be removed regardless of keep.
Note
This method does not support aggregation.
Notes
The expression expr will be evaluated for every entry of the table. If keep is
True
, then entries where expr evaluates toTrue
will be kept (the filter removes the entries where the predicate evaluates toFalse
). If keep isFalse
, then entries where expr evaluates toTrue
will be removed (the filter keeps the entries where the predicate evaluates toFalse
).Filtered entries are removed entirely from downstream operations. This means that the resulting matrix table has sparsity – that is, that the number of entries is smaller than the product of
count_rows()
andcount_cols()
. To re-densify a filtered matrix table, use theunfilter_entries()
method to restore filtered entries, populated all fields with missing values. Below are some properties of an entry-filtered matrix table.Filtered entries are not included in the
entries()
table.
>>> mt_range = hl.utils.range_matrix_table(10, 10) >>> mt_range = mt_range.annotate_entries(x = mt_range.row_idx + mt_range.col_idx) >>> mt_range.count() (10, 10)
>>> mt_range.entries().count() 100
>>> mt_filt = mt_range.filter_entries(mt_range.x % 2 == 0) >>> mt_filt.count() (10, 10)
>>> mt_filt.count_rows() * mt_filt.count_cols() 100
>>> mt_filt.entries().count() 50
Filtered entries are not included in aggregation.
>>> mt_filt.aggregate_entries(hl.agg.count()) 50
>>> mt_filt = mt_filt.annotate_cols(col_n = hl.agg.count()) >>> mt_filt.col_n.take(5) [5, 5, 5, 5, 5]
>>> mt_filt = mt_filt.annotate_rows(row_n = hl.agg.count()) >>> mt_filt.row_n.take(5) [5, 5, 5, 5, 5]
Annotating a new entry field will not annotate filtered entries.
>>> mt_filt = mt_filt.annotate_entries(y = 1) >>> mt_filt.aggregate_entries(hl.agg.sum(mt_filt.y)) 50
4. If all the entries in a row or column of a matrix table are filtered, the row or column remains.
>>> mt_filt.filter_entries(False).count() (10, 10)
- filter_rows(expr, keep=True)[source]
Filter rows of the matrix.
Examples
Keep rows where variant_qc.AF is below 1%:
>>> dataset_result = dataset.filter_rows(dataset.variant_qc.AF[1] < 0.01, keep=True)
Remove rows where filters is non-empty:
>>> dataset_result = dataset.filter_rows(dataset.filters.size() > 0, keep=False)
Notes
The expression expr will be evaluated for every row of the table. If keep is
True
, then rows where expr evaluates toTrue
will be kept (the filter removes the rows where the predicate evaluates toFalse
). If keep isFalse
, then rows where expr evaluates toTrue
will be removed (the filter keeps the rows where the predicate evaluates toFalse
).Warning
When expr evaluates to missing, the row will be removed regardless of keep.
Note
This method supports aggregation over columns. For instance,
>>> dataset_result = dataset.filter_rows(hl.agg.mean(dataset.GQ) > 20.0)
will remove rows where the mean GQ of all entries in the row is smaller than 20.
- Parameters:
expr (bool or
BooleanExpression
) – Filter expression.keep (bool) – Keep rows where expr is true.
- Returns:
MatrixTable
– Filtered matrix table.
- static from_parts(globals=None, rows=None, cols=None, entries=None)[source]
Create a MatrixTable from its component parts.
Example
>>> mt = hl.MatrixTable.from_parts( ... globals={'hello':'world'}, ... rows={'foo':[1, 2]}, ... cols={'bar':[3, 4]}, ... entries={'baz':[[1, 2],[3, 4]]} ... ) >>> mt.describe() ---------------------------------------- Global fields: 'hello': str ---------------------------------------- Column fields: 'col_idx': int32 'bar': int32 ---------------------------------------- Row fields: 'row_idx': int32 'foo': int32 ---------------------------------------- Entry fields: 'baz': int32 ---------------------------------------- Column key: ['col_idx'] Row key: ['row_idx'] ---------------------------------------- >>> mt.row.show() +---------+-------+ | row_idx | foo | +---------+-------+ | int32 | int32 | +---------+-------+ | 0 | 1 | | 1 | 2 | +---------+-------+ >>> mt.col.show() +---------+-------+ | col_idx | bar | +---------+-------+ | int32 | int32 | +---------+-------+ | 0 | 3 | | 1 | 4 | +---------+-------+ >>> mt.entry.show() +---------+-------+-------+ | row_idx | 0.baz | 1.baz | +---------+-------+-------+ | int32 | int32 | int32 | +---------+-------+-------+ | 0 | 1 | 2 | | 1 | 3 | 4 | +---------+-------+-------+
Notes
Matrix dimensions are inferred from input data.
You must provide row and column dimensions by specifying rows or entries (inclusive) and cols or entries (inclusive).
The respective dimensions of rows, cols and entries must match should you provide rows and entries or cols and entries (inclusive).
- Parameters:
- Returns:
MatrixTable
– A MatrixTable assembled from inputs whose rows are keyed by row_idx and columns are keyed by col_idx.
- classmethod from_rows_table(table)[source]
Construct matrix table with no columns from a table.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Examples
Import a text table and construct a rows-only matrix table:
>>> table = hl.import_table('data/variant-lof.tsv') >>> table = table.transmute(**hl.parse_variant(table['v'])).key_by('locus', 'alleles') >>> sites_mt = hl.MatrixTable.from_rows_table(table)
Notes
All fields in the table become row-indexed fields in the result.
- Parameters:
table (
Table
) – The table to be converted.- Returns:
- property globals
Returns a struct expression including all global fields.
- Returns:
- globals_table()[source]
Returns a table with a single row with the globals of the matrix table.
Examples
Extract the globals table:
>>> globals_table = dataset.globals_table()
- Returns:
Table
– Table with the globals from the matrix, with a single row.
- group_cols_by(*exprs, **named_exprs)[source]
Group columns, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with cohort as column keys, computing the call rate as an entry field:
>>> dataset_result = (dataset.group_cols_by(dataset.cohort) ... .aggregate(call_rate = hl.agg.fraction(hl.is_defined(dataset.GT))))
Notes
All complex expressions must be passed as named expressions.
- Parameters:
exprs (args of
str
orExpression
) – Column fields to group by.named_exprs (keyword args of
Expression
) – Column-indexed expressions to group by.
- Returns:
GroupedMatrixTable
– Grouped matrix, can be used to callGroupedMatrixTable.aggregate()
.
- group_rows_by(*exprs, **named_exprs)[source]
Group rows, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:
>>> dataset_result = (dataset.group_rows_by(dataset.gene) ... .aggregate(n_non_ref = hl.agg.count_where(dataset.GT.is_non_ref())))
Notes
All complex expressions must be passed as named expressions.
- Parameters:
exprs (args of
str
orExpression
) – Row fields to group by.named_exprs (keyword args of
Expression
) – Row-indexed expressions to group by.
- Returns:
GroupedMatrixTable
– Grouped matrix. Can be used to callGroupedMatrixTable.aggregate()
.
- head(n_rows, n_cols=None, *, n=None)[source]
Subset matrix to first n_rows rows and n_cols cols.
Examples
>>> mt_range = hl.utils.range_matrix_table(100, 100)
Passing only one argument will take the first n_rows rows:
>>> mt_range.head(10).count() (10, 100)
Passing two arguments refers to rows and columns, respectively:
>>> mt_range.head(10, 20).count() (10, 20)
Either argument may be
None
to indicate no filter.First 10 rows, all columns:
>>> mt_range.head(10, None).count() (10, 100)
All rows, first 10 columns:
>>> mt_range.head(None, 10).count() (100, 10)
Notes
The number of partitions in the new matrix is equal to the number of partitions containing the first n_rows rows.
- Parameters:
- Returns:
MatrixTable
– Matrix including the first n_rows rows and first n_cols cols.
- index_cols(*exprs, all_matches=False)[source]
Expose the column values as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.s).pheno)
Or equivalently:
>>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.col_key).pheno)
- Parameters:
exprs (variable-length args of
Expression
) – Index expressions.all_matches (bool) – Experimental. If
True
, value of expression is array of all matches.
Notes
index_cols(cols)
is equivalent tocols().index(exprs)
orcols()[exprs]
.The type of the resulting struct is the same as the type of
col_value()
.- Returns:
- index_entries(row_exprs, col_exprs)[source]
Expose the entries as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2.index_entries(dataset.row_key, dataset.col_key).GQ)
Or equivalently:
>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)
- Parameters:
row_exprs (tuple of
Expression
) – Row index expressions.col_exprs (tuple of
Expression
) – Column index expressions.
Notes
The type of the resulting struct is the same as the type of
entry()
.Note
There is a shorthand syntax for
MatrixTable.index_entries()
using square brackets (the Python__getitem__
syntax). This syntax is preferred.>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)
- Returns:
- index_globals()[source]
Return this matrix table’s global variables for use in another expression context.
Examples
>>> dataset1 = dataset.annotate_globals(pli={'SCN1A': 0.999, 'SONIC': 0.014}) >>> pli_dict = dataset1.index_globals().pli >>> dataset_result = dataset2.annotate_rows(gene_pli = dataset2.gene.map(lambda x: pli_dict.get(x)))
- Returns:
- index_rows(*exprs, all_matches=False)[source]
Expose the row values as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.locus, dataset.alleles).qual)
Or equivalently:
>>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.row_key).qual)
- Parameters:
exprs (variable-length args of
Expression
) – Index expressions.all_matches (bool) – Experimental. If
True
, value of expression is array of all matches.
Notes
index_rows(exprs)
is equivalent torows().index(exprs)
orrows()[exprs]
.The type of the resulting struct is the same as the type of
row_value()
.- Returns:
- key_cols_by(*keys, **named_keys)[source]
Key columns by a new set of fields.
See
Table.key_by()
for more information on defining a key.- Parameters:
keys (varargs of
str
orExpression
.) – Column fields to key by.named_keys (keyword args of
Expression
.) – Column fields to key by.
- Returns:
- key_rows_by(*keys, **named_keys)[source]
Key rows by a new set of fields.
Examples
>>> dataset_result = dataset.key_rows_by('locus') >>> dataset_result = dataset.key_rows_by(dataset['locus']) >>> dataset_result = dataset.key_rows_by(**dataset.row_key.drop('alleles'))
All of these expressions key the dataset by the ‘locus’ field, dropping the ‘alleles’ field from the row key.
>>> dataset_result = dataset.key_rows_by(contig=dataset['locus'].contig, ... position=dataset['locus'].position, ... alleles=dataset['alleles'])
This keys the dataset by the newly defined fields, ‘contig’ and ‘position’, and the ‘alleles’ field. The old row key field, ‘locus’, is preserved as a non-key field.
Notes
See
Table.key_by()
for more information on defining a key.- Parameters:
keys (varargs of
str
orExpression
.) – Row fields to key by.named_keys (keyword args of
Expression
.) – Row fields to key by.
- Returns:
- localize_entries(entries_array_field_name=None, columns_array_field_name=None)[source]
Convert the matrix table to a table with entries localized as an array of structs.
Examples
Build a numpy ndarray from a small
MatrixTable
:>>> mt = hl.utils.range_matrix_table(3,3) >>> mt = mt.select_entries(x = mt.row_idx * mt.col_idx) >>> mt.show() +---------+-------+-------+-------+ | row_idx | 0.x | 1.x | 2.x | +---------+-------+-------+-------+ | int32 | int32 | int32 | int32 | +---------+-------+-------+-------+ | 0 | 0 | 0 | 0 | | 1 | 0 | 1 | 2 | | 2 | 0 | 2 | 4 | +---------+-------+-------+-------+
>>> t = mt.localize_entries('entry_structs', 'columns') >>> t.describe() ---------------------------------------- Global fields: 'columns': array<struct { col_idx: int32 }> ---------------------------------------- Row fields: 'row_idx': int32 'entry_structs': array<struct { x: int32 }> ---------------------------------------- Key: ['row_idx'] ----------------------------------------
>>> t = t.select(entries = t.entry_structs.map(lambda entry: entry.x)) >>> import numpy as np >>> np.array(t.entries.collect()) array([[0, 0, 0], [0, 1, 2], [0, 2, 4]])
Notes
Both of the added fields are arrays of length equal to
mt.count_cols()
. Missing entries are represented as missing structs in the entries array.- Parameters:
- Returns:
Table
– A table whose fields are the row fields of this matrix table plus one field namedentries_array_field_name
. The global fields of this table are the global fields of this matrix table plus one field namedcolumns_array_field_name
.
- make_table(separator='.')[source]
Make a table from a matrix table with one field per sample.
Deprecated since version 0.2.129: use
localize_entries()
instead because it supports more columnsSee also
Notes
The table has one row for each row of the input matrix. The per sample and entry fields are formed by concatenating the sample ID with the entry field name using separator. If the entry field name is empty, the separator is omitted.
The table inherits the globals from the matrix table.
Examples
Consider a matrix table with the following schema:
Global fields: 'batch': str Column fields: 's': str Row fields: 'locus': locus<GRCh37> 'alleles': array<str> Entry fields: 'GT': call 'GQ': int32 Column key: 's': str Row key: 'locus': locus<GRCh37> 'alleles': array<str>
and three sample IDs: A, B and C. Then the result of
make_table()
:>>> ht = mt.make_table()
has the original row fields along with 6 additional fields, one for each sample and entry field:
Global fields: 'batch': str Row fields: 'locus': locus<GRCh37> 'alleles': array<str> 'A.GT': call 'A.GQ': int32 'B.GT': call 'B.GQ': int32 'C.GT': call 'C.GQ': int32 Key: 'locus': locus<GRCh37> 'alleles': array<str>
- n_partitions()[source]
Number of partitions.
Notes
The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. Partitions are a core concept of distributed computation in Spark, see here for details.
- Returns:
int – Number of partitions.
- naive_coalesce(max_partitions)[source]
Naively decrease the number of partitions.
Example
Naively repartition to 10 partitions:
>>> dataset_result = dataset.naive_coalesce(10)
Warning
naive_coalesce()
simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlikerepartition()
, so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.- Parameters:
max_partitions (int) – Desired number of partitions. If the current number of partitions is less than or equal to max_partitions, do nothing.
- Returns:
MatrixTable
– Matrix table with at most max_partitions partitions.
- persist(storage_level='MEMORY_AND_DISK')[source]
Persist this table in memory or on disk.
Examples
Persist the dataset to both memory and disk:
>>> dataset = dataset.persist()
Notes
The
MatrixTable.persist()
andMatrixTable.cache()
methods store the current dataset on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution forTable.write()
, which stores a permanent file.Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.
- Parameters:
storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
- Returns:
MatrixTable
– Persisted dataset.
- rename(fields)[source]
Rename fields of a matrix table.
Examples
Rename column key s to SampleID, still keying by SampleID.
>>> dataset_result = dataset.rename({'s': 'SampleID'})
You can rename a field to a field name that already exists, as long as that field also gets renamed (no name collisions). Here, we rename the column key s to info, and the row field info to vcf_info:
>>> dataset_result = dataset.rename({'s': 'info', 'info': 'vcf_info'})
- Parameters:
fields (
dict
fromstr
tostr
) – Mapping from old field names to new field names.- Returns:
MatrixTable
– Matrix table with renamed fields.
- repartition(n_partitions, shuffle=True)[source]
Change the number of partitions.
Examples
Repartition to 500 partitions:
>>> dataset_result = dataset.repartition(500)
Notes
Check the current number of partitions with
n_partitions()
.The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a matrix with \(M\) rows is first imported, each of the \(k\) partitions will contain about \(M/k\) of the rows. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores. Partitions are a core concept of distributed computation in Spark, see their documentation for details.
When
shuffle=True
, Hail does a full shuffle of the data and creates equal sized partitions. Whenshuffle=False
, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to the repartition and coalesce commands in Spark, respectively. In particular, whenshuffle=False
,n_partitions
cannot exceed current number of partitions.- Parameters:
n_partitions (int) – Desired number of partitions.
shuffle (bool) – If
True
, use full shuffle to repartition.
- Returns:
MatrixTable
– Repartitioned dataset.
- property row
Returns a struct expression of all row-indexed fields, including keys.
Examples
Get the first five row field names:
>>> list(dataset.row)[:5] ['locus', 'alleles', 'rsid', 'qual', 'filters']
- Returns:
StructExpression
– Struct of all row fields.
- property row_key
Row key struct.
Examples
Get the row key field names:
>>> list(dataset.row_key) ['locus', 'alleles']
- Returns:
- property row_value
Returns a struct expression including all non-key row-indexed fields.
Examples
Get the first five non-key row field names:
>>> list(dataset.row_value)[:5] ['rsid', 'qual', 'filters', 'info', 'use_as_marker']
- Returns:
StructExpression
– Struct of all row fields, minus keys.
- rows()[source]
Returns a table with all row fields in the matrix.
Examples
Extract the row table:
>>> rows_table = dataset.rows()
- Returns:
Table
– Table with all row fields from the matrix, with one row per row of the matrix.
- sample_cols(p, seed=None)[source]
Downsample the matrix table by keeping each column with probability
p
.Examples
Downsample the dataset to approximately 1% of its columns.
>>> small_dataset = dataset.sample_cols(0.01)
- Parameters:
- Returns:
MatrixTable
– Matrix table with approximatelyp * n_cols
column.
- sample_rows(p, seed=None)[source]
Downsample the matrix table by keeping each row with probability
p
.Examples
Downsample the dataset to approximately 1% of its rows.
>>> small_dataset = dataset.sample_rows(0.01)
Notes
Although the
MatrixTable
returned by this method may be small, it requires a full pass over the rows of the sampled object.- Parameters:
- Returns:
MatrixTable
– Matrix table with approximatelyp * n_rows
rows.
- select_cols(*exprs, **named_exprs)[source]
Select existing column fields or create new fields by name, dropping the rest.
Examples
Select existing fields and compute a new one:
>>> dataset_result = dataset.select_cols( ... dataset.sample_qc, ... dataset.pheno.age, ... isCohort1 = dataset.pheno.cohort_name == 'Cohort1')
Notes
This method creates new column fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method supports aggregation over rows. For instance, the usage:
>>> dataset_result = dataset.select_cols(mean_GQ = hl.agg.mean(dataset.GQ))
will compute the mean per column.
- Parameters:
exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions.named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
- Returns:
MatrixTable
– MatrixTable with specified column fields.
- select_entries(*exprs, **named_exprs)[source]
Select existing entry fields or create new fields by name, dropping the rest.
Examples
Drop all entry fields aside from GT:
>>> dataset_result = dataset.select_entries(dataset.GT)
Notes
This method creates new entry fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method does not support aggregation.
- Parameters:
exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions.named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
- Returns:
MatrixTable
– MatrixTable with specified entry fields.
- select_globals(*exprs, **named_exprs)[source]
Select existing global fields or create new fields by name, dropping the rest.
Examples
Select one existing field and compute a new one:
>>> dataset_result = dataset.select_globals(dataset.global_field_1, ... another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])
Notes
This method creates new global fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method does not support aggregation.
- Parameters:
exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions.named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
- Returns:
MatrixTable
– MatrixTable with specified global fields.
- select_rows(*exprs, **named_exprs)[source]
Select existing row fields or create new fields by name, dropping all other non-key fields.
Examples
Select existing fields and compute a new one:
>>> dataset_result = dataset.select_rows( ... dataset.variant_qc.gq_stats.mean, ... high_quality_cases = hl.agg.count_where((dataset.GQ > 20) & ... dataset.is_case))
Notes
This method creates new row fields. If a created field shares its name with a differently-indexed field of the table, or with a row key, the method will fail.
Row keys are preserved. To drop or change a row key field, use
MatrixTable.key_rows_by()
.Note
See
Table.select()
for more information about usingselect
methods.Note
This method supports aggregation over columns. For instance, the usage:
>>> dataset_result = dataset.select_rows(mean_GQ = hl.agg.mean(dataset.GQ))
will compute the mean per row.
- Parameters:
exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions.named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
- Returns:
MatrixTable
– MatrixTable with specified row fields.
- semi_join_cols(other)[source]
Filters the matrix table to columns whose key appears in other.
- Parameters:
other (
Table
) – Table with compatible key field(s).- Returns:
Notes
The column key type of the matrix table must match the key type of other.
This method does not change the schema of the matrix table; it is a filtering the matrix table to column keys not present in another table.
To discard collumns whose key is present in other, use
anti_join_cols()
.Examples
>>> ds_result = ds.semi_join_cols(cols_to_keep)
It may be inconvenient to key the matrix table by the right-side key. In this case, it is possible to implement a semi-join using a non-key field as follows:
>>> ds_result = ds.filter_cols(hl.is_defined(cols_to_keep.index(ds['s'])))
See also
- semi_join_rows(other)[source]
Filters the matrix table to rows whose key appears in other.
- Parameters:
other (
Table
) – Table with compatible key field(s).- Returns:
Notes
The row key type of the matrix table must match the key type of other.
This method does not change the schema of the matrix table; it is filtering the matrix table to row keys present in another table.
To discard rows whose key is present in other, use
anti_join_rows()
.Examples
>>> ds_result = ds.semi_join_rows(rows_to_keep)
It may be expensive to key the matrix table by the right-side key. In this case, it is possible to implement a semi-join using a non-key field as follows:
>>> ds_result = ds.filter_rows(hl.is_defined(rows_to_keep.index(ds['locus'], ds['alleles'])))
See also
- show(n_rows=None, n_cols=None, include_row_fields=False, width=None, truncate=None, types=True, handler=None)[source]
Print the first few rows of the matrix table to the console.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Notes
The output can be passed piped to another output source using the handler argument:
>>> mt.show(handler=lambda x: logging.info(x))
- Parameters:
n_rows (
int
) – Maximum number of rows to show.n_cols (
int
) – Maximum number of columns to show.width (
int
) – Horizontal width at which to break fields.truncate (
int
, optional) – Truncate each field to the given number of characters. IfNone
, truncate fields to the given width.types (
bool
) – Print an extra header line with the type of each field.handler (Callable[[str], Any]) – Handler function for data string.
- summarize(*, rows=True, cols=True, entries=True, handler=None)[source]
Compute and print summary information about the fields in the matrix table.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
- tail(n_rows, n_cols=None, *, n=None)[source]
Subset matrix to last n rows.
Examples
>>> mt_range = hl.utils.range_matrix_table(100, 100)
Passing only one argument will take the last n rows:
>>> mt_range.tail(10).count() (10, 100)
Passing two arguments refers to rows and columns, respectively:
>>> mt_range.tail(10, 20).count() (10, 20)
Either argument may be
None
to indicate no filter.Last 10 rows, all columns:
>>> mt_range.tail(10, None).count() (10, 100)
All rows, last 10 columns:
>>> mt_range.tail(None, 10).count() (100, 10)
Notes
For backwards compatibility, the n parameter is not named n_rows, but the parameter refers to the number of rows to keep.
The number of partitions in the new matrix is equal to the number of partitions containing the last n rows.
- Parameters:
- Returns:
MatrixTable
– Matrix including the last n rows and last n_cols cols.
- transmute_cols(**named_exprs)[source]
Similar to
MatrixTable.annotate_cols()
, but drops referenced fields.Notes
This method adds new column fields according to named_exprs, and drops all column fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Note
transmute_cols()
will not drop key fields.Note
This method supports aggregation over rows.
- Parameters:
named_exprs (keyword args of
Expression
) – Annotation expressions.- Returns:
- transmute_entries(**named_exprs)[source]
Similar to
MatrixTable.annotate_entries()
, but drops referenced fields.Notes
This method adds new entry fields according to named_exprs, and drops all entry fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.- Parameters:
named_exprs (keyword args of
Expression
) – Annotation expressions.- Returns:
- transmute_globals(**named_exprs)[source]
Similar to
MatrixTable.annotate_globals()
, but drops referenced fields.Notes
This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.- Parameters:
named_exprs (keyword args of
Expression
) – Annotation expressions.- Returns:
- transmute_rows(**named_exprs)[source]
Similar to
MatrixTable.annotate_rows()
, but drops referenced fields.Notes
This method adds new row fields according to named_exprs, and drops all row fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Note
transmute_rows()
will not drop key fields.Note
This method supports aggregation over columns.
- Parameters:
named_exprs (keyword args of
Expression
) – Annotation expressions.- Returns:
- unfilter_entries()[source]
Unfilters filtered entries, populating fields with missing values.
- Returns:
Notes
This method is used in the case that a pipeline downstream of
filter_entries()
requires a fully dense (no filtered entries) matrix table.Generally, if this method is required in a pipeline, the upstream pipeline can be rewritten to use annotation instead of entry filtering.
See also
- union_cols(other, row_join_type='inner', drop_right_row_fields=True)[source]
Take the union of dataset columns.
Warning
This method does not preserve the global fields from the other matrix table.
Examples
Union the columns of two datasets:
>>> dataset_result = dataset_to_union_1.union_cols(dataset_to_union_2)
Notes
In order to combine two datasets, three requirements must be met:
The row keys must match.
The column key schemas and column schemas must match.
The entry schemas must match.
The row fields in the resulting dataset are the row fields from the first dataset; the row schemas do not need to match.
This method creates a
MatrixTable
which contains all columns from both input datasets. The set of rows included in the result is determined by the row_join_type parameter.With the default value of
'inner'
, an inner join is performed on rows, so that only rows whose row key exists in both input datasets are included. In this case, the entries for each row are the concatenation of all entries of the corresponding rows in the input datasets.With row_join_type set to
'outer'
, an outer join is perfomed on rows, so that row keys which exist in only one input dataset are also included. For those rows, the entry fields for the columns coming from the other dataset will be missing.
Only distinct row keys from each dataset are included (equivalent to calling
distinct_by_row()
on each dataset first).This method does not deduplicate; if a column key exists identically in two datasets, then it will be duplicated in the result.
- Parameters:
other (
MatrixTable
) – Dataset to concatenate.row_join_type (
str
) – If outer, perform an outer join on rows; if ‘inner’, perform an inner join. Default inner.drop_right_row_fields (
bool
) – If true, non-key row fields of other are dropped. Otherwise, non-key row fields in the two datasets must have distinct names, and the result contains the union of the row fields.
- Returns:
MatrixTable
– Dataset with columns from both datasets.
- union_rows(*, _check_cols=True)[source]
Take the union of dataset rows.
Examples
Union the rows of two datasets:
>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2)
Given a list of datasets, take the union of all rows:
>>> all_datasets = [dataset_to_union_1, dataset_to_union_2]
The following three syntaxes are equivalent:
>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2) >>> dataset_result = all_datasets[0].union_rows(*all_datasets[1:]) >>> dataset_result = hl.MatrixTable.union_rows(*all_datasets)
Notes
In order to combine two datasets, three requirements must be met:
The column keys must be identical, both in type, value, and ordering.
The row key schemas and row schemas must match.
The entry schemas must match.
The column fields in the resulting dataset are the column fields from the first dataset; the column schemas do not need to match.
This method does not deduplicate; if a row exists identically in two datasets, then it will be duplicated in the result.
Warning
This method can trigger a shuffle, if partitions from two datasets overlap.
- Parameters:
datasets (varargs of
MatrixTable
) – Datasets to combine.- Returns:
MatrixTable
– Dataset with rows from each member of datasets.
- unpersist()[source]
Unpersists this dataset from memory/disk.
Notes
This function will have no effect on a dataset that was not previously persisted.
- Returns:
MatrixTable
– Unpersisted dataset.
- write(output, overwrite=False, stage_locally=False, _codec_spec=None, _partitions=None)[source]
Write to disk.
Examples
>>> dataset.write('output/dataset.mt')
Danger
Do not write or checkpoint to a path that is already an input source for the query. This can cause data loss.
See also
- Parameters:
output (str) – Path at which to write.
stage_locally (bool) – If
True
, major output will be written to temporary local storage before being copied tooutput
overwrite (bool) – If
True
, overwrite an existing file at the destination.