Experimental
This module serves two functions: as a staging area for extensions of Hail not ready for inclusion in the main package, and as a library of lightly reviewed community submissions.
At present, the experimental module is organized into a few freestanding modules, linked immediately below, and many freestanding functions, documented on this page.
Warning
The functionality in this module may change or disappear entirely between different versions of Hail. If you critically depend on functionality in this module, please create an issue to request promotion of that functionality to non-experimental. Otherwise, that functionality may disappear!
Contribution Guidelines
Submissions from the community are welcome! The criteria for inclusion in the experimental module are loose and subject to change:
Function docstrings are required. Hail uses NumPy style docstrings.
Tests are not required, but are encouraged. If you do include tests, they must run in no more than a few seconds. Place tests as a class method on
Tests
inpython/tests/experimental/test_experimental.py
Code style is not strictly enforced, aside from egregious violations. We do recommend using autopep8 though!
Annotation Database
Classes
An annotation database instance. |
Genetics Methods
|
Load a genetic dataset from Hail's repository. |
|
Calculate LD scores. |
|
Estimate SNP-heritability and level of confounding biases from genome-wide association study (GWAS) summary statistics. |
|
Write an Expression. |
|
Read an |
|
Computes a filtering allele frequency (described below) for ac and an with confidence ci. |
|
Create a metadata plot for a Hail Table or MatrixTable. |
|
Create ROC curve from Hail Table. |
|
Phases genotype calls in a trio based allele transmission. |
|
Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio. |
|
Splits a trio MatrixTable back into a sample MatrixTable. |
|
Import a GTF file. |
|
Get intervals of genes or transcripts. |
|
Export entries of the mt by column as separate text files. |
|
Projects genotypes onto pre-computed PCs. |
dplyr-inspired Methods
|
Collapse fields into key-value pairs. |
|
Separate a field into multiple fields by splitting on a delimiter character or position. |
|
Spread a key-value pair of fields across multiple fields. |
Functions
- hail.experimental.load_dataset(name, version, reference_genome, region='us-central1', cloud='gcp')[source]
Load a genetic dataset from Hail’s repository.
Example
>>> # Load the gnomAD "HGDP + 1000 Genomes" dense MatrixTable with GRCh38 coordinates. >>> mt = hl.experimental.load_dataset(name='gnomad_hgdp_1kg_subset_dense', ... version='3.1.2', ... reference_genome='GRCh38', ... region='us-central1', ... cloud='gcp')
- Parameters:
name (
str
) – Name of the dataset to load.version (
str
, optional) – Version of the named dataset to load (see available versions in documentation). PossiblyNone
for some datasets.reference_genome (
str
, optional) – Reference genome build,'GRCh37'
or'GRCh38'
. PossiblyNone
for some datasets.region (
str
) – Specify region for bucket,'us'
,'us-central1'
, or'europe-west1'
, (default is'us-central1'
).cloud (
str
) – Specify if using Google Cloud Platform or Amazon Web Services,'gcp'
or'aws'
(default is'gcp'
).
Note
The
'aws'
cloud platform is currently only available for the'us'
region.- Returns:
Table
,MatrixTable
, orBlockMatrix
- hail.experimental.ld_score(entry_expr, locus_expr, radius, coord_expr=None, annotation_exprs=None, block_size=None)[source]
Calculate LD scores.
Example
>>> # Load genetic data into MatrixTable >>> mt = hl.import_plink(bed='data/ldsc.bed', ... bim='data/ldsc.bim', ... fam='data/ldsc.fam')
>>> # Create locus-keyed Table with numeric variant annotations >>> ht = hl.import_table('data/ldsc.annot', ... types={'BP': hl.tint, ... 'binary': hl.tfloat, ... 'continuous': hl.tfloat}) >>> ht = ht.annotate(locus=hl.locus(ht.CHR, ht.BP)) >>> ht = ht.key_by('locus')
>>> # Annotate MatrixTable with external annotations >>> mt = mt.annotate_rows(binary_annotation=ht[mt.locus].binary, ... continuous_annotation=ht[mt.locus].continuous)
>>> # Calculate LD scores using centimorgan coordinates >>> ht_scores = hl.experimental.ld_score(entry_expr=mt.GT.n_alt_alleles(), ... locus_expr=mt.locus, ... radius=1.0, ... coord_expr=mt.cm_position, ... annotation_exprs=[mt.binary_annotation, ... mt.continuous_annotation])
>>> # Show results >>> ht_scores.show(3)
+---------------+-------------------+-----------------------+-------------+ | locus | binary_annotation | continuous_annotation | univariate | +---------------+-------------------+-----------------------+-------------+ | locus<GRCh37> | float64 | float64 | float64 | +---------------+-------------------+-----------------------+-------------+ | 20:82079 | 1.15183e+00 | 7.30145e+01 | 1.60117e+00 | | 20:103517 | 2.04604e+00 | 2.75392e+02 | 4.69239e+00 | | 20:108286 | 2.06585e+00 | 2.86453e+02 | 5.00124e+00 | +---------------+-------------------+-----------------------+-------------+
Warning
ld_score()
will fail ifentry_expr
results in any missing values. The special float valuenan
is not considered a missing value.Further reading
For more in-depth discussion of LD scores, see:
Notes
entry_expr, locus_expr, coord_expr (if specified), and annotation_exprs (if specified) must come from the same MatrixTable.
- Parameters:
entry_expr (
NumericExpression
) – Expression for entries of genotype matrix (e.g.mt.GT.n_alt_alleles()
).locus_expr (
LocusExpression
) – Row-indexed locus expression.radius (
int
orfloat
) – Radius of window for row values (in units of coord_expr if set, otherwise in units of basepairs).coord_expr (
Float64Expression
, optional) – Row-indexed numeric expression for the row value used to window variants. By default, the row value is given by the locus position.annotation_exprs (
NumericExpression
or) –list
ofNumericExpression
, optional Annotation expression(s) to partition LD scores. Univariate annotation will always be included and does not need to be specified.block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
- Returns:
Table
– Table keyed by locus_expr with LD scores for each variant and annotation_expr. The function will always return LD scores for the univariate (all SNPs) annotation.
- hail.experimental.ld_score_regression(weight_expr, ld_score_expr, chi_sq_exprs, n_samples_exprs, n_blocks=200, two_step_threshold=30, n_reference_panel_variants=None)[source]
Estimate SNP-heritability and level of confounding biases from genome-wide association study (GWAS) summary statistics.
Given a set or multiple sets of GWAS summary statistics,
ld_score_regression()
estimates the heritability of a trait or set of traits and the level of confounding biases present in the underlying studies by regressing chi-squared statistics on LD scores, leveraging the model:\[\mathrm{E}[\chi_j^2] = 1 + Na + \frac{Nh_g^2}{M}l_j\]\(\mathrm{E}[\chi_j^2]\) is the expected chi-squared statistic for variant \(j\) resulting from a test of association between variant \(j\) and a trait.
\(l_j = \sum_{k} r_{jk}^2\) is the LD score of variant \(j\), calculated as the sum of squared correlation coefficients between variant \(j\) and nearby variants. See
ld_score()
for further details.\(a\) captures the contribution of confounding biases, such as cryptic relatedness and uncontrolled population structure, to the association test statistic.
\(h_g^2\) is the SNP-heritability, or the proportion of variation in the trait explained by the effects of variants included in the regression model above.
\(M\) is the number of variants used to estimate \(h_g^2\).
\(N\) is the number of samples in the underlying association study.
For more details on the method implemented in this function, see:
Examples
Run the method on a matrix table of summary statistics, where the rows are variants and the columns are different phenotypes:
>>> mt_gwas = ld_score_all_phenos_sumstats >>> ht_results = hl.experimental.ld_score_regression( ... weight_expr=mt_gwas['ld_score'], ... ld_score_expr=mt_gwas['ld_score'], ... chi_sq_exprs=mt_gwas['chi_squared'], ... n_samples_exprs=mt_gwas['n'])
Run the method on a table with summary statistics for a single phenotype:
>>> ht_gwas = ld_score_one_pheno_sumstats >>> ht_results = hl.experimental.ld_score_regression( ... weight_expr=ht_gwas['ld_score'], ... ld_score_expr=ht_gwas['ld_score'], ... chi_sq_exprs=ht_gwas['chi_squared_50_irnt'], ... n_samples_exprs=ht_gwas['n_50_irnt'])
Run the method on a table with summary statistics for multiple phenotypes:
>>> ht_gwas = ld_score_one_pheno_sumstats >>> ht_results = hl.experimental.ld_score_regression( ... weight_expr=ht_gwas['ld_score'], ... ld_score_expr=ht_gwas['ld_score'], ... chi_sq_exprs=[ht_gwas['chi_squared_50_irnt'], ... ht_gwas['chi_squared_20160']], ... n_samples_exprs=[ht_gwas['n_50_irnt'], ... ht_gwas['n_20160']])
Notes
The
exprs
provided as arguments told_score_regression()
must all be from the same object, either aTable
or aMatrixTable
.If the arguments originate from a table:
The table must be keyed by fields
locus
of typetlocus
andalleles
, atarray
oftstr
elements.weight_expr
,ld_score_expr
,chi_sq_exprs
, andn_samples_exprs
are must be row-indexed fields.The number of expressions passed to
n_samples_exprs
must be equal to one or the number of expressions passed tochi_sq_exprs
. If just one expression is passed ton_samples_exprs
, that sample size expression is assumed to apply to all sets of statistics passed tochi_sq_exprs
. Otherwise, the expressions passed tochi_sq_exprs
andn_samples_exprs
are matched by index.The
phenotype
field that keys the table returned byld_score_regression()
will have genericint
values0
,1
, etc. corresponding to the0th
,1st
, etc. expressions passed to thechi_sq_exprs
argument.
If the arguments originate from a matrix table:
The dimensions of the matrix table must be variants (rows) by phenotypes (columns).
The rows of the matrix table must be keyed by fields
locus
of typetlocus
andalleles
, atarray
oftstr
elements.The columns of the matrix table must be keyed by a field of type
tstr
that uniquely identifies phenotypes represented in the matrix table. The column key must be a single expression; compound keys are not accepted.weight_expr
andld_score_expr
must be row-indexed fields.chi_sq_exprs
must be a single entry-indexed field (not a list of fields).n_samples_exprs
must be a single entry-indexed field (not a list of fields).The
phenotype
field that keys the table returned byld_score_regression()
will have values corresponding to the column keys of the input matrix table.
This function returns a
Table
with one row per set of summary statistics passed to thechi_sq_exprs
argument. The following row-indexed fields are included in the table:phenotype (
tstr
) – The name of the phenotype. The returned table is keyed by this field. See the notes below for details on the possible values of this field.mean_chi_sq (
tfloat64
) – The mean chi-squared test statistic for the given phenotype.intercept (Struct) – Contains fields:
snp_heritability (Struct) – Contains fields:
Warning
ld_score_regression()
considers only the rows for which both row fieldsweight_expr
andld_score_expr
are defined. Rows with missing values in either field are removed prior to fitting the LD score regression model.- Parameters:
weight_expr (
Float64Expression
) – Row-indexed expression for the LD scores used to derive variant weights in the model.ld_score_expr (
Float64Expression
) – Row-indexed expression for the LD scores used as covariates in the model.chi_sq_exprs (
Float64Expression
orlist
of) –Float64Expression
One or more row-indexed (if table) or entry-indexed (if matrix table) expressions for chi-squared statistics resulting from genome-wide association studies (GWAS).n_samples_exprs (
NumericExpression
orlist
of) –NumericExpression
One or more row-indexed (if table) or entry-indexed (if matrix table) expressions indicating the number of samples used in the studies that generated the test statistics supplied tochi_sq_exprs
.n_blocks (
int
) – The number of blocks used in the jackknife approach to estimating standard errors.two_step_threshold (
int
) – Variants with chi-squared statistics greater than this value are excluded in the first step of the two-step procedure used to fit the model.n_reference_panel_variants (
int
, optional) – Number of variants used to estimate the SNP-heritability \(h_g^2\).
- Returns:
Table
– Table keyed byphenotype
with intercept and heritability estimates for each phenotype passed to the function.
- hail.experimental.write_expression(expr, path, overwrite=False)[source]
Write an Expression.
In the same vein as Python’s pickle, write out an expression that does not have a source (such as one that comes from Table.aggregate with _localize=False).
Example
>>> ht = hl.utils.range_table(100).annotate(x=hl.rand_norm()) >>> mean_norm = ht.aggregate(hl.agg.mean(ht.x), _localize=False) >>> mean_norm >>> hl.eval(mean_norm) >>> hl.experimental.write_expression(mean_norm, 'output/expression.he')
- Parameters:
expr (
Expression
) – Expression to write.path (
str
) – Path to which to write expression. Suggested extension: .he (hail expression).overwrite (
bool
) – IfTrue
, overwrite an existing file at the destination.
- Returns:
None
- hail.experimental.read_expression(path, _assert_type=None)[source]
Read an
Expression
written withexperimental.write_expression()
.Example
>>> hl.experimental.write_expression(hl.array([1, 2]), 'output/test_expression.he') >>> expression = hl.experimental.read_expression('output/test_expression.he') >>> hl.eval(expression)
- Parameters:
path (
str
) – File to read.- Returns:
- hail.experimental.hail_metadata(t_path)[source]
Create a metadata plot for a Hail Table or MatrixTable.
- Parameters:
t_path (str) – Path to the Hail Table or MatrixTable files.
- Returns:
- hail.experimental.plot_roc_curve(ht, scores, tp_label='tp', fp_label='fp', colors=None, title='ROC Curve', hover_mode='mouse')[source]
Create ROC curve from Hail Table.
One or more score fields must be provided, which are assessed against tp_label and fp_label as truth data.
High scores should correspond to true positives.
- Parameters:
ht (
Table
) – Table with required datascores (
str
orlist
ofstr
) – Top-level location of scores in ht against which to generate PR curves.tp_label (
str
) – Top-level location of true positives in ht.fp_label (
str
) – Top-level location of false positives in ht.colors (
dict
ofstr
) – Optional colors to use (score -> desired color).title (
str
) – Title of plot.hover_mode (
str
) – Hover mode; one of ‘mouse’ (default), ‘vline’ or ‘hline’
- Returns:
tuple
ofbokeh.plotting.figure
andlist
ofstr
– Figure, and list of AUCs corresponding to scores.
- hail.experimental.filtering_allele_frequency(ac, an, ci)[source]
Computes a filtering allele frequency (described below) for ac and an with confidence ci.
The filtering allele frequency is the highest true population allele frequency for which the upper bound of the ci (confidence interval) of allele count under a Poisson distribution is still less than the variant’s observed ac (allele count) in the reference sample, given an an (allele number).
This function defines a “filtering AF” that represents the threshold disease-specific “maximum credible AF” at or below which the disease could not plausibly be caused by that variant. A variant with a filtering AF >= the maximum credible AF for the disease under consideration should be filtered, while a variant with a filtering AF below the maximum credible remains a candidate. This filtering AF is not disease-specific: it can be applied to any disease of interest by comparing with a user-defined disease-specific maximum credible AF.
For more details, see: Whiffin et al., 2017
- Parameters:
ac (int or
Expression
of typetint32
)an (int or
Expression
of typetint32
)ci (float or
Expression
of typetfloat64
)
- Returns:
Expression
of typetfloat64
- hail.experimental.phase_by_transmission(locus, alleles, proband_call, father_call, mother_call)[source]
Phases genotype calls in a trio based allele transmission.
Notes
In the phased calls returned, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele
Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If proband_call is diploid on non-par regions of the sex chromosomes, it is assumed to be female.
Returns NA when genotype calls cannot be phased. The following genotype calls combinations cannot be phased by transmission: 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y
In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)
Note
phase_trio_matrix_by_transmission()
provides a convenience wrapper for phasing a trio matrix.- Parameters:
locus (
LocusExpression
) – Expression for the locus in the trio matrixalleles (
ArrayExpression
) – Expression for the alleles in the trio matrixproband_call (
CallExpression
) – Expression for the proband call in the trio matrixfather_call (
CallExpression
) – Expression for the father call in the trio matrixmother_call (
CallExpression
) – Expression for the mother call in the trio matrix
- Returns:
ArrayExpression
– Array containing: [phased proband call, phased father call, phased mother call]
- hail.experimental.phase_trio_matrix_by_transmission(tm, call_field='GT', phased_call_field='PBT_GT')[source]
Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio.
Example
>>> # Create a trio matrix >>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Phase trios by transmission >>> phased_trio_dataset = phase_trio_matrix_by_transmission(trio_dataset)
Notes
Uses only a Call field to phase and only phases when all 3 members of the trio are present and have a call.
In the phased genotypes, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele
Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If a proband is diploid on non-par regions of the sex chromosomes, it is assumed to be female.
Genotypes that cannot be phased are set to NA. The following genotype calls combinations cannot be phased by transmission (all trio members phased calls set to missing): 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y
In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)
- Parameters:
tm (
MatrixTable
) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present)call_field (str) – genotype field name in the matrix entries to use for phasing
phased_call_field (str) – name for the phased genotype field in the matrix entries
- Returns:
MatrixTable
– Trio MatrixTable entry with additional phased genotype field for each individual
- hail.experimental.explode_trio_matrix(tm, col_keys=['s'], keep_trio_cols=True, keep_trio_entries=False)[source]
Splits a trio MatrixTable back into a sample MatrixTable.
Example
>>> # Create a trio matrix from a sample matrix >>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Explode trio matrix back into a sample matrix >>> exploded_trio_dataset = explode_trio_matrix(trio_dataset)
Notes
The resulting MatrixTable column schema is the same as the proband/father/mother schema, and the resulting entry schema is the same as the proband_entry/father_entry/mother_entry schema. If the keep_trio_cols option is set, then an additional source_trio column is added with the trio column data. If the keep_trio_entries option is set, then an additional source_trio_entry column is added with the trio entry data.
Note
This assumes that the input MatrixTable is a trio MatrixTable (similar to the result of
trio_matrix()
) Its entry schema has to contain ‘proband_entry`, father_entry and mother_entry all with the same type. Its column schema has to contain ‘proband`, father and mother all with the same type.- Parameters:
tm (
MatrixTable
) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present)col_keys (
list
of str) – Column key(s) for the resulting sample MatrixTablekeep_trio_cols (bool) – Whether to add a source_trio column with the trio column data (default True)
keep_trio_entries (bool) – Whether to add a source_trio_entries column with the trio entry data (default False)
- Returns:
MatrixTable
– Sample MatrixTable
- hail.experimental.import_gtf(path, reference_genome=None, skip_invalid_contigs=False, min_partitions=None, force_bgz=False, force=False)[source]
Import a GTF file.
The GTF file format is identical to the GFF version 2 file format, and so this function can be used to import GFF version 2 files as well.
See https://www.ensembl.org/info/website/upload/gff.html for more details on the GTF/GFF2 file format.
The
Table
returned by this function will be keyed by theinterval
row field and will include the following row fields:'source': str 'feature': str 'score': float64 'strand': str 'frame': int32 'interval': interval<>
There will also be corresponding fields for every tag found in the attribute field of the GTF file.
Note
This function will return an
interval
field of typetinterval
constructed from theseqname
,start
, andend
fields in the GTF file. This interval is inclusive of both the start and end positions in the GTF file.If the
reference_genome
parameter is specified, the start and end points of theinterval
field will be of typetlocus
. Otherwise, the start and end points of theinterval
field will be of typetstruct
with fieldsseqname
(typestr
) andposition
(typetint32
).Furthermore, if the
reference_genome
parameter is specified andskip_invalid_contigs
isTrue
, this import function will skip lines in the GTF whereseqname
is not consistent with the reference genome specified.Example
>>> ht = hl.experimental.import_gtf('data/test.gtf', ... reference_genome='GRCh37', ... skip_invalid_contigs=True)
>>> ht.describe() ---------------------------------------- Global fields: None ---------------------------------------- Row fields: 'source': str 'feature': str 'score': float64 'strand': str 'frame': int32 'gene_type': str 'exon_id': str 'havana_transcript': str 'level': str 'transcript_name': str 'gene_status': str 'gene_id': str 'transcript_type': str 'tag': str 'transcript_status': str 'gene_name': str 'transcript_id': str 'exon_number': str 'havana_gene': str 'interval': interval<locus<GRCh37>> ---------------------------------------- Key: ['interval'] ----------------------------------------
- Parameters:
path (
str
) – File to import.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.skip_invalid_contigs (
bool
) – IfTrue
and reference_genome is notNone
, skip lines whereseqname
is not consistent with the reference genome.min_partitions (
int
orNone
) – Minimum number of partitions (passed to import_table).force_bgz (
bool
) – IfTrue
, load files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not'.bgz'
, but the file is blocked gzip, so that the file can be read in parallel and not on a single node.force (
bool
) – IfTrue
, load gzipped files serially on one core. This should be used only when absolutely necessary, as processing time will be increased due to lack of parallelism.
- Returns:
- hail.experimental.get_gene_intervals(gene_symbols=None, gene_ids=None, transcript_ids=None, verbose=True, reference_genome=None, gtf_file=None)[source]
Get intervals of genes or transcripts.
Get the boundaries of genes or transcripts from a GTF file, for quick filtering of a Table or MatrixTable.
On Google Cloud platform: Gencode v19 (GRCh37) GTF available at: gs://hail-common/references/gencode/gencode.v19.annotation.gtf.bgz Gencode v29 (GRCh38) GTF available at: gs://hail-common/references/gencode/gencode.v29.annotation.gtf.bgz
Example
>>> hl.filter_intervals(ht, get_gene_intervals(gene_symbols=['PCSK9'], reference_genome='GRCh37'))
- Parameters:
gene_symbols (
list
ofstr
, optional) – Gene symbols (e.g. PCSK9).gene_ids (
list
ofstr
, optional) – Gene IDs (e.g. ENSG00000223972).transcript_ids (
list
ofstr
, optional) – Transcript IDs (e.g. ENSG00000223972).verbose (
bool
) – IfTrue
, print which genes and transcripts were matched in the GTF file.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use (passed along to import_gtf).gtf_file (
str
) – GTF file to load. If none is provided, but reference_genome is one of GRCh37 or GRCh38, a default will be used (on Google Cloud Platform).
- Returns:
- hail.experimental.export_entries_by_col(mt, path, batch_size=256, bgzip=True, header_json_in_file=True, use_string_key_as_file_name=False)[source]
Export entries of the mt by column as separate text files.
Examples
>>> range_mt = hl.utils.range_matrix_table(10, 10) >>> range_mt = range_mt.annotate_entries(x = hl.rand_unif(0, 1)) >>> hl.experimental.export_entries_by_col(range_mt, 'output/cols_files')
Notes
This function writes a directory with one file per column in mt. The files contain one tab-separated field (with header) for each row field and entry field in mt. The column fields of mt are written as JSON in the first line of each file, prefixed with a
#
.The above will produce a directory at
output/cols_files
with the following files:$ ls -l output/cols_files total 80 -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 index.tsv -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-00.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-01.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-02.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-03.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-04.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-05.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-06.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-07.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-08.tsv.bgz -rw-r--r-- 1 hail-dev wheel 712 Jan 25 17:19 part-09.tsv.bgz $ zcat output/cols_files/part-00.tsv.bgz #{"col_idx":0} row_idx x 0 6.2501e-02 1 7.0083e-01 2 3.6452e-01 3 4.4170e-01 4 7.9177e-02 5 6.2392e-01 6 5.9920e-01 7 9.7540e-01 8 8.4848e-01 9 3.7423e-01
Due to overhead and file system limits related to having large numbers of open files, this function will iteratively export groups of columns. The batch_size parameter can control the size of these groups.
- Parameters:
mt (
MatrixTable
)path (
int
) – Path (directory to write to.batch_size (
int
) – Number of columns to write per iteration.bgzip (
bool
) – BGZip output files.header_json_in_file (
bool
) – Include JSON header in each component file (if False, only written to index.tsv)
- hail.experimental.gather(ht, key, value, *fields)[source]
Collapse fields into key-value pairs.
gather()
mimics the functionality of the gather() function found in R’stidyr
package. This is a way to turn “wide” format data into “long” format data.- Parameters:
- Returns:
Table
– Table with originalfields
gathered intokey
andvalue
fields.
- hail.experimental.separate(ht, field, into, delim)[source]
Separate a field into multiple fields by splitting on a delimiter character or position.
separate()
mimics the functionality of the separate() function in R’stidyr
package.This function will create a new table where
field
has been split into multiple new fields, whose names are given byinto
.If
delim
is astr
(including regular expression strings),field
will be separated into columns by that string. In this case, the length ofinto
must match the number of resulting fields.If
delim
is anint
,field
will be separated into two row fields, where the first field contains the firstdelim
characters offield
and the second field contains the remaining characters.- Parameters:
- Returns:
Table
– Table with originalfield
split into fields whose names are defined by into.
- hail.experimental.spread(ht, field, value, key=None)[source]
Spread a key-value pair of fields across multiple fields.
spread()
mimics the functionality of the spread() function in R’s tidyr package. This is a way to turn “long” format data into “wide” format data.Given a
field
,spread()
will create a new table by groupinght
by its row key and, optionally, any additional fields passed to thekey
argument.After collapsing
ht
by these keys,spread()
creates a new row field for each unique value offield
, where the row field values are given by the correspondingvalue
in the originalht
.- Parameters:
- Returns:
Table
– Table with originalkey
andvalue
fields spread across multiple columns.
- hail.experimental.full_outer_join_mt(left, right)[source]
Performs a full outer join on left and right.
Replaces row, column, and entry fields with the following:
left_row / right_row: structs of row fields from left and right.
left_col / right_col: structs of column fields from left and right.
left_entry / right_entry: structs of entry fields from left and right.
Examples
The following creates and joins two random datasets with disjoint sample ids but non-disjoint variant sets. We use
or_else()
to attempt to find a non-missing genotype. If neither genotype is non-missing, then the genotype is set to missing. In particular, note that Samples 2 and 3 have missing genotypes for loci 1:1 and 1:2 because those loci are not present in mt2 and these samples are not present in mt1>>> hl.reset_global_randomness() >>> mt1 = hl.balding_nichols_model(1, 2, 3) >>> mt2 = hl.balding_nichols_model(1, 2, 3) >>> mt2 = mt2.key_rows_by(locus=hl.locus(mt2.locus.contig, ... mt2.locus.position+2), ... alleles=mt2.alleles) >>> mt2 = mt2.key_cols_by(sample_idx=mt2.sample_idx+2) >>> mt1.show() +---------------+------------+------+------+ | locus | alleles | 0.GT | 1.GT | +---------------+------------+------+------+ | locus<GRCh37> | array<str> | call | call | +---------------+------------+------+------+ | 1:1 | ["A","C"] | 0/0 | 0/0 | | 1:2 | ["A","C"] | 0/1 | 0/1 | | 1:3 | ["A","C"] | 0/0 | 0/1 | +---------------+------------+------+------+ >>> mt2.show() +---------------+------------+------+------+ | locus | alleles | 2.GT | 3.GT | +---------------+------------+------+------+ | locus<GRCh37> | array<str> | call | call | +---------------+------------+------+------+ | 1:3 | ["A","C"] | 0/1 | 1/1 | | 1:4 | ["A","C"] | 1/1 | 0/1 | | 1:5 | ["A","C"] | 0/0 | 0/0 | +---------------+------------+------+------+ >>> mt3 = hl.experimental.full_outer_join_mt(mt1, mt2) >>> mt3 = mt3.select_entries(GT=hl.or_else(mt3.left_entry.GT, mt3.right_entry.GT)) >>> mt3.show() +---------------+------------+------+------+------+------+ | locus | alleles | 0.GT | 1.GT | 2.GT | 3.GT | +---------------+------------+------+------+------+------+ | locus<GRCh37> | array<str> | call | call | call | call | +---------------+------------+------+------+------+------+ | 1:1 | ["A","C"] | 0/0 | 0/0 | NA | NA | | 1:2 | ["A","C"] | 0/1 | 0/1 | NA | NA | | 1:3 | ["A","C"] | 0/0 | 0/1 | 0/1 | 1/1 | | 1:4 | ["A","C"] | NA | NA | 1/1 | 0/1 | | 1:5 | ["A","C"] | NA | NA | 0/0 | 0/0 | +---------------+------------+------+------+------+------+
- Parameters:
left (
MatrixTable
)right (
MatrixTable
)
- Returns:
- hail.experimental.strftime(format, time, zone_id)[source]
Convert Unix timestamp to a formatted datetime string.
Examples
>>> hl.eval(hl.experimental.strftime("%Y.%m.%d %H:%M:%S %z", 1562569201, "America/New_York")) '2019.07.08 03:00:01 -04:00'
>>> hl.eval(hl.experimental.strftime("%A, %B %e, %Y. %r", 876541523, "GMT+2")) 'Saturday, October 11, 1997. 05:45:23 AM'
>>> hl.eval(hl.experimental.strftime("%A, %B %e, %Y. %r", 876541523, "+08:00")) 'Saturday, October 11, 1997. 11:45:23 AM'
Notes
The following formatting characters are supported in format strings: A a B b D d e F H I j k l M m n p R r S s T t U u V v W Y y z See documentation here: https://linux.die.net/man/3/strftime
A zone id can take one of three forms. It can be an explicit offset, like “+01:00”, a relative offset, like “GMT+2”, or a IANA timezone database (TZDB) identifier, like “America/New_York”. Wikipedia maintains a list of TZDB identifiers here: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
Currently, the formatter implicitly uses the “en_US” locale.
- Parameters:
format (str or
Expression
of typetstr
) – The format string describing how to render the time.time (int of
Expression
of typetint64
) – A long representing the time as a Unix timestamp.zone_id (str or
Expression
of typetstr
) – An id representing the timezone. See notes above.
- Returns:
StringExpression
– A string of the specified format based on the requested time.
- hail.experimental.strptime(time, format, zone_id)[source]
Interpret a formatted datetime string as a Unix timestamp (number of seconds since 1970-01-01T00:00Z (ISO)).
Examples
>>> hl.eval(hl.experimental.strptime("07/08/19 3:00:01 AM", "%D %l:%M:%S %p", "America/New_York")) 1562569201
>>> hl.eval(hl.experimental.strptime("Saturday, October 11, 1997. 05:45:23 AM", "%A, %B %e, %Y. %r", "GMT+2")) 876541523
Notes
The following formatting characters are supported in format strings: A a B b D d e F H I j k l M m n p R r S s T t U u V v W Y y z See documentation here: https://linux.die.net/man/3/strftime
A zone id can take one of three forms. It can be an explicit offset, like “+01:00”, a relative offset, like “GMT+2”, or a IANA timezone database (TZDB) identifier, like “America/New_York”. Wikipedia maintains a list of TZDB identifiers here: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
Currently, the parser implicitly uses the “en_US” locale.
This function will fail if there is not enough information in the string to determine a particular timestamp. For example, if you have the string “07/08/09” and the format string “%Y.%m.%d”, this method will fail, since that’s not specific enough to determine seconds from. You can fix this by adding “00:00:00” to your date string and “%H:%M:%S” to your format string.
- Parameters:
time (str or
Expression
of typetstr
) – The string from which to parse the time.format (str or
Expression
of typetstr
) – The format string describing how to parse the time.zone_id (str or
Expression
of typetstr
) – An id representing the timezone. See notes above.
- Returns:
Int64Expression
– The Unix timestamp associated with the given time string.
- hail.experimental.pc_project(call_expr, loadings_expr, af_expr)[source]
Projects genotypes onto pre-computed PCs. Requires loadings and allele-frequency from a reference dataset (see example). Note that loadings_expr must have no missing data and reflect the rows from the original PCA run for this method to be accurate.
Example
>>> # Compute loadings and allele frequency for reference dataset >>> _, _, loadings_ht = hl.hwe_normalized_pca(mt.GT, k=10, compute_loadings=True) >>> mt = mt.annotate_rows(af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2) >>> loadings_ht = loadings_ht.annotate(af=mt.rows()[loadings_ht.key].af) >>> # Project new genotypes onto loadings >>> ht = pc_project(mt_to_project.GT, loadings_ht.loadings, loadings_ht.af)
- Parameters:
call_expr (
CallExpression
) – Entry-indexed call expression for genotypes to project onto loadings.loadings_expr (
ArrayNumericExpression
) – Location of expression for loadingsaf_expr (
Float64Expression
) – Location of expression for allele frequency
- Returns:
Table
– Table with scores calculated from loadings in column scores
- hail.experimental.loop(f, typ, *args)[source]
Define and call a tail-recursive function with given arguments.
Notes
The argument f must be a function where the first argument defines the recursive call, and the remaining arguments are the arguments to the recursive function, e.g. to define the recursive function
\[f(x, y) = \begin{cases} y & \textrm{if } x \equiv 0 \\ f(x - 1, y + x) & \textrm{otherwise} \end{cases}\]we would write: >>> f = lambda recur, x, y: hl.if_else(x == 0, y, recur(x - 1, y + x))
Full recursion is not supported, and any non-tail-recursive methods will throw an error when called.
This means that the result of any recursive call within the function must also be the result of the entire function, without modification. Let’s consider two different recursive definitions for the triangle function \(f(x) = 0 + 1 + \dots + x\):
>>> def triangle1(x): ... if x == 1: ... return x ... return x + triangle1(x - 1)
>>> def triangle2(x, total): ... if x == 0: ... return total ... return triangle2(x - 1, total + x)
The first function definition, triangle1, will call itself and then add x. This is an example of a non-tail recursive function, since triangle1(9) needs to modify the result of the inner recursive call to triangle1(8) by adding 9 to the result.
The second function is tail recursive: the result of triangle2(9, 0) is the same as the result of the inner recursive call, triangle2(8, 9).
Example
To find the sum of all the numbers from n=1…10: >>> triangle_f = lambda f, x, total: hl.if_else(x == 0, total, f(x - 1, total + x)) >>> x = hl.experimental.loop(triangle_f, hl.tint32, 10, 0) >>> hl.eval(x) 55
Let’s say we want to find the root of a polynomial equation: >>> def polynomial(x): … return 5 * x**3 - 2 * x - 1
We’ll use Newton’s method<https://en.wikipedia.org/wiki/Newton%27s_method> to find it, so we’ll also define the derivative:
>>> def derivative(x): ... return 15 * x**2 - 2
and starting at \(x_0 = 0\), we’ll compute the next step \(x_{i+1} = x_i - \frac{f(x_i)}{f'(x_i)}\) until the difference between \(x_{i}\) and \(x_{i+1}\) falls below our convergence threshold:
>>> threshold = 0.005 >>> def find_root(f, guess, error): ... converged = hl.is_defined(error) & (error < threshold) ... new_guess = guess - (polynomial(guess) / derivative(guess)) ... new_error = hl.abs(new_guess - guess) ... return hl.if_else(converged, guess, f(new_guess, new_error)) >>> x = hl.experimental.loop(find_root, hl.tfloat, 0.0, hl.missing(hl.tfloat)) >>> hl.eval(x) 0.8052291984599675
Warning
Using arguments of a type other than numeric types and booleans can cause memory issues if if you expect the recursive call to happen many times.
- Parameters:
f (function ( (marker, *args) ->
Expression
) – Function of one callable marker, denoting where the recursive call (or calls) is located, and many args, the loop variables.args (variable-length args of
Expression
) – Expressions to initialize the loop values.
- Returns:
Expression
– Result of the loop with args as initial loop values.