Experimental

This module serves two functions: as a staging area for extensions of Hail not ready for inclusion in the main package, and as a library of lightly reviewed community submissions.

Contribution Guidelines

Submissions from the community are welcome! The criteria for inclusion in the experimental module are loose and subject to change:

  1. Function docstrings are required. Hail uses NumPy style docstrings.
  2. Tests are not required, but are encouraged. If you do include tests, they must run in no more than a few seconds. Place tests as a class method on Tests in python/tests/experimental/test_experimental.py
  3. Code style is not strictly enforced, aside from egregious violations. We do recommend using autopep8 though!

Genetics Methods

ld_score(entry_expr, locus_expr, radius[, …]) Calculate LD scores.
ld_score_regression(weight_expr, …[, …]) Estimate SNP-heritability and level of confounding biases from GWAS summary statistics.
write_expression(expr, path[, overwrite]) Write an Expression.
read_expression(path) Read an Expression written with experimental.write_expression().
filtering_allele_frequency(ac, an, ci) Computes a filtering allele frequency (described below) for ac and an with confidence ci.
hail_metadata(t_path) Create a metadata plot for a Hail Table or MatrixTable.
plot_roc_curve(ht, scores[, tp_label, …]) Create ROC curve from Hail Table.
phase_by_transmission(locus, alleles, …) Phases genotype calls in a trio based allele transmission.
phase_trio_matrix_by_transmission(tm, …) Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio.
explode_trio_matrix(tm, col_keys, …) Splits a trio MatrixTable back into a sample MatrixTable.
load_dataset(name, version, reference_genome) Load a genetic dataset from Hail’s repository.
import_gtf(path[, reference_genome, …]) Import a GTF file.
get_gene_intervals([gene_symbols, gene_ids, …]) Get intervals of genes or transcripts.
export_entries_by_col(mt, path, batch_size, …) Export entries of the mt by column as separate text files.
sparse_split_multi(sparse_mt) Splits multiallelic variants on a sparse MatrixTable.
hail.experimental.ld_score(entry_expr, locus_expr, radius, coord_expr=None, annotation_exprs=None, block_size=None) → hail.table.Table[source]

Calculate LD scores.

Example

>>> # Load genetic data into MatrixTable
>>> mt = hl.import_plink(bed='data/ldsc.bed',
...                      bim='data/ldsc.bim',
...                      fam='data/ldsc.fam')
>>> # Create locus-keyed Table with numeric variant annotations
>>> ht = hl.import_table('data/ldsc.annot',
...                      types={'BP': hl.tint,
...                             'binary': hl.tfloat,
...                             'continuous': hl.tfloat})
>>> ht = ht.annotate(locus=hl.locus(ht.CHR, ht.BP))
>>> ht = ht.key_by('locus')
>>> # Annotate MatrixTable with external annotations
>>> mt = mt.annotate_rows(binary_annotation=ht[mt.locus].binary,
...                       continuous_annotation=ht[mt.locus].continuous)
>>> # Calculate LD scores using centimorgan coordinates
>>> ht_scores = hl.experimental.ld_score(entry_expr=mt.GT.n_alt_alleles(),
...                                      locus_expr=mt.locus,
...                                      radius=1.0,
...                                      coord_expr=mt.cm_position,
...                                      annotation_exprs=[mt.binary_annotation,
...                                                        mt.continuous_annotation])
>>> # Show results
>>> ht_scores.show(3)
+---------------+-------------------+-----------------------+-------------+
| locus         | binary_annotation | continuous_annotation |  univariate |
+---------------+-------------------+-----------------------+-------------+
| locus<GRCh37> |           float64 |               float64 |     float64 |
+---------------+-------------------+-----------------------+-------------+
| 20:82079      |       1.15183e+00 |           7.30145e+01 | 1.60117e+00 |
| 20:103517     |       2.04604e+00 |           2.75392e+02 | 4.69239e+00 |
| 20:108286     |       2.06585e+00 |           2.86453e+02 | 5.00124e+00 |
+---------------+-------------------+-----------------------+-------------+

Warning

ld_score() will fail if entry_expr results in any missing values. The special float value nan is not considered a missing value.

Further reading

For more in-depth discussion of LD scores, see:

Notes

entry_expr, locus_expr, coord_expr (if specified), and annotation_exprs (if specified) must come from the same MatrixTable.

Parameters:
  • entry_expr (NumericExpression) – Expression for entries of genotype matrix (e.g. mt.GT.n_alt_alleles()).
  • locus_expr (LocusExpression) – Row-indexed locus expression.
  • radius (int or float) – Radius of window for row values (in units of coord_expr if set, otherwise in units of basepairs).
  • coord_expr (Float64Expression, optional) – Row-indexed numeric expression for the row value used to window variants. By default, the row value is given by the locus position.
  • annotation_exprs (NumericExpression or) – list of NumericExpression, optional Annotation expression(s) to partition LD scores. Univariate annotation will always be included and does not need to be specified.
  • block_size (int, optional) – Block size. Default given by BlockMatrix.default_block_size().
Returns:

Table – Table keyed by locus_expr with LD scores for each variant and annotation_expr. The function will always return LD scores for the univariate (all SNPs) annotation.

hail.experimental.ld_score_regression(weight_expr, ld_score_expr, chi_sq_exprs, n_samples_exprs, n_blocks=200, two_step_threshold=30, n_reference_panel_variants=None) → hail.table.Table[source]

Estimate SNP-heritability and level of confounding biases from GWAS summary statistics.

Given a set or multiple sets of genome-wide association study (GWAS) summary statistics, ld_score_regression() estimates the heritability of a trait or set of traits and the level of confounding biases present in the underlying studies by regressing chi-squared statistics on LD scores, leveraging the model:

\[\mathrm{E}[\chi_j^2] = 1 + Na + \frac{Nh_g^2}{M}l_j\]
  • \(\mathrm{E}[\chi_j^2]\) is the expected chi-squared statistic for variant \(j\) resulting from a test of association between variant \(j\) and a trait.
  • \(l_j = \sum_{k} r_{jk}^2\) is the LD score of variant \(j\), calculated as the sum of squared correlation coefficients between variant \(j\) and nearby variants. See ld_score() for further details.
  • \(a\) captures the contribution of confounding biases, such as cryptic relatedness and uncontrolled population structure, to the association test statistic.
  • \(h_g^2\) is the SNP-heritability, or the proportion of variation in the trait explained by the effects of variants included in the regression model above.
  • \(M\) is the number of variants used to estimate \(h_g^2\).
  • \(N\) is the number of samples in the underlying association study.

For more details on the method implemented in this function, see:

Examples

Run the method on a matrix table of summary statistics, where the rows are variants and the columns are different phenotypes:

>>> mt_gwas = hl.read_matrix_table('data/ld_score_regression.sumstats.mt')
>>> ht_results = hl.experimental.ld_score_regression(
...     weight_expr=mt_gwas['ld_score'],
...     ld_score_expr=mt_gwas['ld_score'],
...     chi_sq_exprs=mt_gwas['chi_squared'],
...     n_samples_exprs=mt_gwas['n'])

Run the method on a table with summary statistics for a single phenotype:

>>> ht_gwas = hl.read_table('data/ld_score_regression.sumstats.ht')
>>> ht_results = hl.experimental.ld_score_regression(
...     weight_expr=ht_gwas['ld_score'],
...     ld_score_expr=ht_gwas['ld_score'],
...     chi_sq_exprs=ht_gwas['chi_squared_50_irnt'],
...     n_samples_exprs=ht_gwas['n_50_irnt'])

Run the method on a table with summary statistics for multiple phenotypes:

>>> ht_gwas = hl.read_table('data/ld_score_regression.sumstats.ht')
>>> ht_results = hl.experimental.ld_score_regression(
...     weight_expr=ht_gwas['ld_score'],
...     ld_score_expr=ht_gwas['ld_score'],
...     chi_sq_exprs=[ht_gwas['chi_squared_50_irnt'],
...                        ht_gwas['chi_squared_20160']],
...     n_samples_exprs=[ht_gwas['n_50_irnt'],
...                      ht_gwas['n_20160']])

Notes

The exprs provided as arguments to ld_score_regression() must all be from the same object, either a Table or a MatrixTable.

If the arguments originate from a table:

  • The table must be keyed by fields locus of type tlocus and alleles, a tarray of tstr elements.
  • weight_expr, ld_score_expr, chi_sq_exprs, and n_samples_exprs are must be row-indexed fields.
  • The number of expressions passed to n_samples_exprs must be equal to one or the number of expressions passed to chi_sq_exprs. If just one expression is passed to n_samples_exprs, that sample size expression is assumed to apply to all sets of statistics passed to chi_sq_exprs. Otherwise, the expressions passed to chi_sq_exprs and n_samples_exprs are matched by index.
  • The phenotype field that keys the table returned by ld_score_regression() will have generic int values 0, 1, etc. corresponding to the 0th, 1st, etc. expressions passed to the chi_sq_exprs argument.

If the arguments originate from a matrix table:

  • The dimensions of the matrix table must be variants (rows) by phenotypes (columns).
  • The rows of the matrix table must be keyed by fields locus of type tlocus and alleles, a tarray of tstr elements.
  • The columns of the matrix table must be keyed by a field of type tstr that uniquely identifies phenotypes represented in the matrix table. The column key must be a single expression; compound keys are not accepted.
  • weight_expr and ld_score_expr must be row-indexed fields.
  • chi_sq_exprs must be a single entry-indexed field (not a list of fields).
  • n_samples_exprs must be a single entry-indexed field (not a list of fields).
  • The phenotype field that keys the table returned by ld_score_regression() will have values corresponding to the column keys of the input matrix table.

This function returns a Table with one row per set of summary statistics passed to the chi_sq_exprs argument. The following row-indexed fields are included in the table:

  • phenotype (tstr) – The name of the phenotype. The returned table is keyed by this field. See the notes below for details on the possible values of this field.
  • mean_chi_sq (tfloat64) – The mean chi-squared test statistic for the given phenotype.
  • intercept (Struct) – Contains fields:
    • estimate (tfloat64) – A point estimate of the intercept \(1 + Na\).
    • standard_error (tfloat64) – An estimate of the standard error of this point estimate.
  • snp_heritability (Struct) – Contains fields:
    • estimate (tfloat64) – A point estimate of the SNP-heritability \(h_g^2\).
    • standard_error (tfloat64) – An estimate of the standard error of this point estimate.

Warning

ld_score_regression() considers only the rows for which both row fields weight_expr and ld_score_expr are defined. Rows with missing values in either field are removed prior to fitting the LD score regression model.

Parameters:
  • weight_expr (Float64Expression) – Row-indexed expression for the LD scores used to derive variant weights in the model.
  • ld_score_expr (Float64Expression) – Row-indexed expression for the LD scores used as covariates in the model.
  • chi_sq_exprs (Float64Expression or list of) – Float64Expression One or more row-indexed (if table) or entry-indexed (if matrix table) expressions for chi-squared statistics resulting from genome-wide association studies.
  • n_samples_exprs (NumericExpression or list of) – NumericExpression One or more row-indexed (if table) or entry-indexed (if matrix table) expressions indicating the number of samples used in the studies that generated the test statistics supplied to chi_sq_exprs.
  • n_blocks (int) – The number of blocks used in the jackknife approach to estimating standard errors.
  • two_step_threshold (int) – Variants with chi-squared statistics greater than this value are excluded in the first step of the two-step procedure used to fit the model.
  • n_reference_panel_variants (int, optional) – Number of variants used to estimate the SNP-heritability \(h_g^2\).
Returns:

Table – Table keyed by phenotype with intercept and heritability estimates for each phenotype passed to the function.

hail.experimental.write_expression(expr, path, overwrite=False)[source]

Write an Expression.

In the same vein as Python’s pickle, write out an expression that does not have a source (such as one that comes from Table.aggregate with _localize=False).

Example

>>> ht = hl.utils.range_table(100).annotate(x=hl.rand_norm())
>>> mean_norm = ht.aggregate(hl.agg.mean(ht.x), _localize=False)
>>> mean_norm
>>> hl.eval(mean_norm)
>>> hl.experimental.write_expression(mean_norm, 'output/expression.he')
Parameters:
  • expr (Expression) – Expression to write.
  • path (str) – Path to which to write expression. Suggested extension: .he (hail expression).
  • overwrite (bool) – If True, overwrite an existing file at the destination.
Returns:

None

hail.experimental.read_expression(path)[source]

Read an Expression written with experimental.write_expression().

Example

>>> hl.experimental.write_expression(hl.array([1, 2]), 'output/test_expression.he')
>>> expression = hl.experimental.read_expression('output/test_expression.he')
>>> hl.eval(expression)
Parameters:path (str) – File to read.
Returns:Expression
hail.experimental.hail_metadata(t_path)[source]

Create a metadata plot for a Hail Table or MatrixTable.

Parameters:t_path (str) – Path to the Hail Table or MatrixTable files.
Returns:bokeh.plotting.figure.Figure or bokeh.models.widgets.panels.Tabs or bokeh.models.layouts.Column
hail.experimental.plot_roc_curve(ht, scores, tp_label='tp', fp_label='fp', colors=None, title='ROC Curve', hover_mode='mouse')[source]

Create ROC curve from Hail Table.

One or more score fields must be provided, which are assessed against tp_label and fp_label as truth data.

High scores should correspond to true positives.

Parameters:
  • ht (Table) – Table with required data
  • scores (str or list of str) – Top-level location of scores in ht against which to generate PR curves.
  • tp_label (str) – Top-level location of true positives in ht.
  • fp_label (str) – Top-level location of false positives in ht.
  • colors (dict of str) – Optional colors to use (score -> desired color).
  • title (str) – Title of plot.
  • hover_mode (str) – Hover mode; one of ‘mouse’ (default), ‘vline’ or ‘hline’
Returns:

tuple of Figure and list of str – Figure, and list of AUCs corresponding to scores.

hail.experimental.filtering_allele_frequency(ac, an, ci) → hail.expr.expressions.typed_expressions.Float64Expression[source]

Computes a filtering allele frequency (described below) for ac and an with confidence ci.

The filtering allele frequency is the highest true population allele frequency for which the upper bound of the ci (confidence interval) of allele count under a Poisson distribution is still less than the variant’s observed ac (allele count) in the reference sample, given an an (allele number).

This function defines a “filtering AF” that represents the threshold disease-specific “maximum credible AF” at or below which the disease could not plausibly be caused by that variant. A variant with a filtering AF >= the maximum credible AF for the disease under consideration should be filtered, while a variant with a filtering AF below the maximum credible remains a candidate. This filtering AF is not disease-specific: it can be applied to any disease of interest by comparing with a user-defined disease-specific maximum credible AF.

For more details, see: Whiffin et al., 2017

Parameters:
Returns:

Expression of type tfloat64

hail.experimental.phase_by_transmission(locus: hail.expr.expressions.typed_expressions.LocusExpression, alleles: hail.expr.expressions.typed_expressions.ArrayExpression, proband_call: hail.expr.expressions.typed_expressions.CallExpression, father_call: hail.expr.expressions.typed_expressions.CallExpression, mother_call: hail.expr.expressions.typed_expressions.CallExpression) → hail.expr.expressions.typed_expressions.ArrayExpression[source]

Phases genotype calls in a trio based allele transmission.

Notes

In the phased calls returned, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele

Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If proband_call is diploid on non-par regions of the sex chromosomes, it is assumed to be female.

Returns NA when genotype calls cannot be phased. The following genotype calls combinations cannot be phased by transmission: 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y

In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)

Note

experimental.phase_trio_matrix_by_transmission() provides a convenience wrapper for phasing a trio matrix.

Parameters:
  • locus (LocusExpression) – Expression for the locus in the trio matrix
  • alleles (ArrayExpression) – Expression for the alleles in the trio matrix
  • proband_call (CallExpression) – Expression for the proband call in the trio matrix
  • father_call (CallExpression) – Expression for the father call in the trio matrix
  • mother_call (CallExpression) – Expression for the mother call in the trio matrix
Returns:

ArrayExpression – Array containing: [phased proband call, phased father call, phased mother call]

hail.experimental.phase_trio_matrix_by_transmission(tm: hail.matrixtable.MatrixTable, call_field: str = 'GT', phased_call_field: str = 'PBT_GT') → hail.matrixtable.MatrixTable[source]

Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio.

Example

>>> # Create a trio matrix
>>> pedigree = hl.Pedigree.read('data/case_control_study.fam')
>>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Phase trios by transmission
>>> phased_trio_dataset = phase_trio_matrix_by_transmission(trio_dataset)

Notes

Uses only a Call field to phase and only phases when all 3 members of the trio are present and have a call.

In the phased genotypes, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele

Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If a proband is diploid on non-par regions of the sex chromosomes, it is assumed to be female.

Genotypes that cannot be phased are set to NA. The following genotype calls combinations cannot be phased by transmission (all trio members phased calls set to missing): 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y

In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)

Parameters:
  • tm (MatrixTable) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present)
  • call_field (str) – genotype field name in the matrix entries to use for phasing
  • phased_call_field (str) – name for the phased genotype field in the matrix entries
Returns:

MatrixTable – Trio MatrixTable entry with additional phased genotype field for each individual

hail.experimental.explode_trio_matrix(tm: hail.matrixtable.MatrixTable, col_keys: List[str] = ['s'], keep_trio_cols: bool = True, keep_trio_entries: bool = False) → hail.matrixtable.MatrixTable[source]

Splits a trio MatrixTable back into a sample MatrixTable.

Example

>>> # Create a trio matrix from a sample matrix
>>> pedigree = hl.Pedigree.read('data/case_control_study.fam')
>>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Explode trio matrix back into a sample matrix
>>> exploded_trio_dataset = explode_trio_matrix(trio_dataset)

Notes

The resulting MatrixTable column schema is the same as the proband/father/mother schema, and the resulting entry schema is the same as the proband_entry/father_entry/mother_entry schema. If the keep_trio_cols option is set, then an additional source_trio column is added with the trio column data. If the keep_trio_entries option is set, then an additional source_trio_entry column is added with the trio entry data.

Note

This assumes that the input MatrixTable is a trio MatrixTable (similar to the result of methods.trio_matrix()) Its entry schema has to contain ‘proband_entry`, father_entry and mother_entry all with the same type. Its column schema has to contain ‘proband`, father and mother all with the same type.

Parameters:
  • tm (MatrixTable) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present)
  • col_keys (list of str) – Column key(s) for the resulting sample MatrixTable
  • keep_trio_cols (bool) – Whether to add a source_trio column with the trio column data (default True)
  • keep_trio_entries (bool) – Whether to add a source_trio_entries column with the trio entry data (default False)
Returns:

MatrixTable – Sample MatrixTable

hail.experimental.load_dataset(name, version, reference_genome, config_file='gs://hail-datasets/datasets.json')[source]

Load a genetic dataset from Hail’s repository.

Example

>>> # Load 1000 Genomes MatrixTable with GRCh38 coordinates
>>> mt_1kg = hl.experimental.load_dataset(name='1000_genomes',   # doctest: +SKIP
...                                       version='phase3',
...                                       reference_genome='GRCh38')
Parameters:
  • name (str) – Name of the dataset to load.
  • version (str) – Version of the named dataset to load (see available versions in documentation).
  • reference_genome (GRCh37 or GRCh38) – Reference genome build.
Returns:

Table or MatrixTable

hail.experimental.import_gtf(path, reference_genome=None, skip_invalid_contigs=False, min_partitions=None) → hail.table.Table[source]

Import a GTF file.

The GTF file format is identical to the GFF version 2 file format, and so this function can be used to import GFF version 2 files as well.

See https://www.ensembl.org/info/website/upload/gff.html for more details on the GTF/GFF2 file format.

The Table returned by this function will be keyed by the interval row field and will include the following row fields:

'source': str
'feature': str
'score': float64
'strand': str
'frame': int32
'interval': interval<>

There will also be corresponding fields for every tag found in the attribute field of the GTF file.

Note

This function will return an interval field of type tinterval constructed from the seqname, start, and end fields in the GTF file. This interval is inclusive of both the start and end positions in the GTF file.

If the reference_genome parameter is specified, the start and end points of the interval field will be of type tlocus. Otherwise, the start and end points of the interval field will be of type tstruct with fields seqname (type str) and position (type tint32).

Furthermore, if the reference_genome parameter is specified and skip_invalid_contigs is True, this import function will skip lines in the GTF where seqname is not consistent with the reference genome specified.

Example

>>> ht = hl.experimental.import_gtf('data/test.gtf',
...                                 reference_genome='GRCh37',
...                                 skip_invalid_contigs=True)
>>> ht.describe()  # doctest: +NOTEST
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
    'source': str
    'feature': str
    'score': float64
    'strand': str
    'frame': int32
    'gene_type': str
    'exon_id': str
    'havana_transcript': str
    'level': str
    'transcript_name': str
    'gene_status': str
    'gene_id': str
    'transcript_type': str
    'tag': str
    'transcript_status': str
    'gene_name': str
    'transcript_id': str
    'exon_number': str
    'havana_gene': str
    'interval': interval<locus<GRCh37>>
----------------------------------------
Key: ['interval']
----------------------------------------
Parameters:
  • path (str) – File to import.
  • reference_genome (str or ReferenceGenome, optional) – Reference genome to use.
  • skip_invalid_contigs (bool) – If True and reference_genome is not None, skip lines where seqname is not consistent with the reference genome.
  • min_partitions (int or None) – Minimum number of partitions (passed to import_table).
Returns:

Table

hail.experimental.get_gene_intervals(gene_symbols=None, gene_ids=None, transcript_ids=None, verbose=True, reference_genome=None, gtf_file=None)[source]

Get intervals of genes or transcripts.

Get the boundaries of genes or transcripts from a GTF file, for quick filtering of a Table or MatrixTable.

On Google Cloud platform: Gencode v19 (GRCh37) GTF available at: gs://hail-common/references/gencode/gencode.v19.annotation.gtf.bgz Gencode v29 (GRCh38) GTF available at: gs://hail-common/references/gencode/gencode.v29.annotation.gtf.bgz

Example

>>> hl.filter_intervals(ht, get_gene_intervals(gene_symbols=['PCSK9'], reference_genome='GRCh37'))  # doctest: +SKIP
Parameters:
  • gene_symbols (list of str, optional) – Gene symbols (e.g. PCSK9).
  • gene_ids (list of str, optional) – Gene IDs (e.g. ENSG00000223972).
  • transcript_ids (list of str, optional) – Transcript IDs (e.g. ENSG00000223972).
  • verbose (bool) – If True, print which genes and transcripts were matched in the GTF file.
  • reference_genome (str or ReferenceGenome, optional) – Reference genome to use (passed along to import_gtf).
  • gtf_file (str) – GTF file to load. If none is provided, but reference_genome is one of GRCh37 or GRCh38, a default will be used (on Google Cloud Platform).
Returns:

list of Interval

hail.experimental.export_entries_by_col(mt: hail.matrixtable.MatrixTable, path: str, batch_size: int = 256, bgzip: bool = True)[source]

Export entries of the mt by column as separate text files.

Examples

>>> range_mt = hl.utils.range_matrix_table(10, 10)
>>> range_mt = range_mt.annotate_entries(x = hl.rand_unif(0, 1))
>>> hl.experimental.export_entries_by_col(range_mt, 'output/cols_files')

Notes

This function writes a directory with one file per column in mt. The files contain one tab-separated field (with header) for each row field and entry field in mt. The column fields of mt are written as JSON in the first line of each file, prefixed with a #.

The above will produce a directory at output/cols_files with the following files:

$ ls -l output/cols_files
total 80
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 index.tsv
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-00.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-01.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-02.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-03.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-04.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-05.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-06.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-07.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-08.tsv.bgz
-rw-r--r--  1 hail-dev  wheel  712 Jan 25 17:19 part-09.tsv.bgz

$ zcat output/cols_files/part-00.tsv.bgz
#{"col_idx":0}
row_idx  x
0        6.2501e-02
1        7.0083e-01
2        3.6452e-01
3        4.4170e-01
4        7.9177e-02
5        6.2392e-01
6        5.9920e-01
7        9.7540e-01
8        8.4848e-01
9        3.7423e-01

Due to overhead and file system limits related to having large numbers of open files, this function will iteratively export groups of columns. The batch_size parameter can control the size of these groups.

Parameters:
  • mt (MatrixTable)
  • path (int) – Path (directory to write to.
  • batch_size (int) – Number of columns to write per iteration.
  • bgzip (bool) – BGZip output files.
hail.experimental.sparse_split_multi(sparse_mt)[source]

Splits multiallelic variants on a sparse MatrixTable.

Takes a dataset formatted like the output of vcf_combiner(). The splitting will add was_split and a_index fields, as split_multi() does. This function drops the LA (local alleles) field, as it re-computes entry fields based on the new, split globals alleles.

Variants are split thus:

  • A row with only one (reference) or two (reference and alternate) alleles.
  • A row with multiple alternate alleles will be split, with one row for each alternate allele, and each row will contain two alleles: ref and alt. The reference and alternate allele will be minrepped using min_rep().

The split multi logic handles the following entry fields:

struct {
  LGT: call
  LAD: array<int32>
  DP: int32
  GQ: int32
  LPL: array<int32>
  RGQ: int32
  LPGT: call
  LA: array<int32>
  END: int32
}

All fields except for LA are optional, and only handled if they exist.

  • LA is used to find the corresponding local allele index for the desired global a_index, and then dropped from the resulting dataset. If LA does not contain the global a_index, the index for the <NON_REF> allele is used to process the entry fields.
  • LGT and LPGT are downcoded using the corresponding local a_index. They are renamed to GT and PGT respectively, as the resulting call is no longer local.
  • LAD is used to create an AD field consisting of the allele depths corresponding to the reference, global a_index allele, and <NON_REF> allele.
  • DP is preserved unchanged.
  • GQ is recalculated from the updated PL, if it exists, but otherwise preserved unchanged.
  • PL array elements are calculated from the minimum LPL value for all allele pairs that downcode to the desired one. (This logic is identical to the PL logic in split_mult_hts(); if a row has an alternate allele but it is not present in LA, the PL field is set to missing. The PL for ref/<NON_REF> in that case can be drawn from RGQ.
  • RGQ (the ref genotype quality) is preserved unchanged.
  • END is untouched.

Notes

This version of split-multi doesn’t deal with either duplicate loci (in which case the explode could possibly result in out-of-order rows, although the actual split_multi function also doesn’t handle that case).

It also checks that min-repping will not change the locus and will error if it does. (I believe the VCF combiner checks that this holds true, currently.)

Parameters:sparse_mt (MatrixTable) – Sparse MatrixTable to split.
Returns:MatrixTable – The split MatrixTable in sparse format.