# Genetics¶

 balding_nichols_model(n_populations, …[, …]) Generate a matrix table of variants, samples, and genotypes using the Balding-Nichols model. concordance(left, right) Calculate call concordance with another dataset. filter_intervals(ds, intervals[, keep]) Filter rows with a list of intervals. filter_alleles(mt, f) Filter alternate alleles. filter_alleles_hts(mt, f, subset) Filter alternate alleles and update standard GATK entry fields. hwe_normalized_pca(call_expr[, k, …]) Run principal component analysis (PCA) on the Hardy-Weinberg-normalized genotype call matrix. identity_by_descent(dataset[, maf, bounded, …]) Compute matrix of identity-by-descent estimates. genetic_relatedness_matrix(call_expr) Compute the genetic relatedness matrix (GRM). realized_relationship_matrix(call_expr) Computes the realized relationship matrix (RRM). impute_sex(call[, aaf_threshold, …]) Impute sex of samples by calculating inbreeding coefficient on the X chromosome. ld_matrix(entry_expr, locus_expr, radius[, …]) Computes the windowed correlation (linkage disequilibrium) matrix between variants. ld_prune(call_expr[, r2, bp_window_size, …]) Returns a maximal subset of variants that are nearly uncorrelated within each window. mendel_errors(call, pedigree) Find Mendel errors; count per variant, individual and nuclear family. de_novo(mt, pedigree, pop_frequency_prior, …) Call putative de novo events from trio data. nirvana(dataset, hail.table.Table], config) Annotate variants using Nirvana. pc_relate(call_expr, min_individual_maf, *) Compute relatedness estimates between individuals using a variant of the PC-Relate method. sample_qc(mt[, name]) Compute per-sample metrics useful for quality control. skat(key_expr, weight_expr, y, x, covariates) Test each keyed group of rows for association by linear or logistic SKAT test. split_multi(ds[, keep_star, left_aligned]) Split multiallelic variants. split_multi_hts(ds[, keep_star, …]) Split multiallelic variants for datasets that contain one or more fields from a standard high-throughput sequencing entry schema. summarize_variants(mt[, show]) Summarize the variants present in a dataset and print the results. transmission_disequilibrium_test(dataset, …) Performs the transmission disequilibrium test on trios. trio_matrix(dataset, pedigree[, complete_trios]) Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio. variant_qc(mt[, name]) Compute common variant statistics (quality control metrics). vep(dataset, hail.matrixtable.MatrixTable], …) Annotate variants with VEP. window_by_locus(mt, bp_window_size) Collect arrays of row and entry values from preceding loci.
hail.methods.balding_nichols_model(n_populations, n_samples, n_variants, n_partitions=None, pop_dist=None, fst=None, af_dist=<Float64Expression of type float64>, reference_genome='default', mixture=False) → hail.matrixtable.MatrixTable[source]

Generate a matrix table of variants, samples, and genotypes using the Balding-Nichols model.

Examples

Generate a matrix table of genotypes with 1000 variants and 100 samples across 3 populations:

>>> bn_ds = hl.balding_nichols_model(3, 100, 1000)

Generate a matrix table using 4 populations, 40 samples, 150 variants, 3 partitions, population distribution [0.1, 0.2, 0.3, 0.4], $$F_{ST}$$ values [.02, .06, .04, .12], ancestral allele frequencies drawn from a truncated beta distribution with a = 0.01 and b = 0.05 over the interval [0.05, 1], and random seed 1:

>>> hl.set_global_seed(1)
>>> bn_ds = hl.balding_nichols_model(4, 40, 150, 3,
...          pop_dist=[0.1, 0.2, 0.3, 0.4],
...          fst=[.02, .06, .04, .12],
...          af_dist=hl.rand_beta(a=0.01, b=2.0, lower=0.05, upper=1.0))

Note that in order to guarantee reproducibility, the hail global seed is set with set_global_seed() immediately prior to generating the dataset.

Notes

This method simulates a matrix table of variants, samples, and genotypes using the Balding-Nichols model, which we now define.

• $$K$$ populations are labeled by integers 0, 1, …, K - 1.
• $$N$$ samples are labeled by strings 0, 1, …, N - 1.
• $$M$$ variants are defined as 1:1:A:C, 1:2:A:C, …, 1:M:A:C.
• The default distribution for population assignment $$\pi$$ is uniform.
• The default ancestral frequency distribution $$P_0$$ is uniform on [0.1, 0.9]. All three classes are located in hail.stats.
• The default $$F_{ST}$$ values are all 0.1.

The Balding-Nichols model models genotypes of individuals from a structured population comprising $$K$$ homogeneous modern populations that have each diverged from a single ancestral population (a star phylogeny). Each sample is assigned a population by sampling from the categorical distribution $$\pi$$. Note that the actual size of each population is random.

Variants are modeled as biallelic and unlinked. Ancestral allele frequencies are drawn independently for each variant from a frequency spectrum $$P_0$$. The extent of genetic drift of each modern population from the ancestral population is defined by the corresponding $$F_{ST}$$ parameter $$F_k$$ (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. $$k = 1, \ldots, K$$). For each variant and population, allele frequencies are drawn from a beta distribution whose parameters are determined by the ancestral allele frequency and $$F_{ST}$$ parameter. The beta distribution gives a continuous approximation of the effect of genetic drift. We denote sample population assignments by $$k_n$$, ancestral allele frequencies by $$p_m$$, population allele frequencies by $$p_{k, m}$$, and diploid, unphased genotype calls by $$g_{n, m}$$ (0, 1, and 2 correspond to homozygous reference, heterozygous, and homozygous variant, respectively).

The generative model is then given by:

\begin{align}\begin{aligned}k_n \,&\sim\, \pi\\p_m \,&\sim\, P_0\\p_{k,m} \mid p_m\,&\sim\, \mathrm{Beta}(\mu = p_m,\, \sigma^2 = F_k p_m (1 - p_m))\\g_{n,m} \mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m})\end{aligned}\end{align}

The beta distribution by its mean and variance above; the usual parameters are $$a = (1 - p) \frac{1 - F}{F}$$ and $$b = p \frac{1 - F}{F}$$ with $$F = F_k$$ and $$p = p_m$$.

The resulting dataset has the following fields.

Global fields:

• bn.n_populations (tint32) – Number of populations.
• bn.n_samples (tint32) – Number of samples.
• bn.n_variants (tint32) – Number of variants.
• bn.n_partitions (tint32) – Number of partitions.
• bn.pop_dist (tarray of tfloat64) – Population distribution indexed by population.
• bn.fst (tarray of tfloat64) – $$F_{ST}$$ values indexed by population.
• bn.seed (tint32) – Random seed.
• bn.mixture (tbool) – Value of mixture parameter.

Row fields:

• locus (tlocus) – Variant locus (key field).
• alleles (tarray of tstr) – Variant alleles (key field).
• ancestral_af (tfloat64) – Ancestral allele frequency.
• af (tarray of tfloat64) – Modern allele frequencies indexed by population.

Column fields:

• sample_idx (tint32) - Sample index (key field).
• pop (tint32) – Population of sample.

Entry fields:

• GT (tcall) – Genotype call (diploid, unphased).
Parameters: n_populations (int) – Number of modern populations. n_samples (int) – Total number of samples. n_variants (int) – Number of variants. n_partitions (int, optional) – Number of partitions. Default is 1 partition per million entries or 8, whichever is larger. pop_dist (list of float, optional) – Unnormalized population distribution, a list of length n_populations with non-negative values. Default is [1, ..., 1]. fst (list of float, optional) – $$F_{ST}$$ values, a list of length n_populations with values in (0, 1). Default is [0.1, ..., 0.1]. af_dist (Float64Expression representing a random function.) – Ancestral allele frequency distribution. Default is rand_unif() over the range [0.1, 0.9] with seed 0. reference_genome (str or ReferenceGenome) – Reference genome to use. mixture (bool) – Treat pop_dist as the parameters of a Dirichlet distribution, as in the Prichard-Stevens-Donnelly model. This feature is EXPERIMENTAL and currently undocumented and untested. If True, the type of pop is tarray of tfloat64 and the value is the mixture proportions. MatrixTable – Simulated matrix table of variants, samples, and genotypes.
hail.methods.concordance(left, right) → Tuple[[List[List[int]], hail.table.Table], hail.table.Table][source]

Calculate call concordance with another dataset.

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Note

Requires the dataset to contain only diploid and unphased genotype calls. Use call() to recode genotype calls or null() to set genotype calls to missing.

Examples

Compute concordance between two datasets and output the global concordance statistics and two tables with concordance computed per column key and per row key:

>>> global_conc, cols_conc, rows_conc = hl.concordance(dataset, dataset2)

Notes

This method computes the genotype call concordance (from the entry field GT) between two biallelic variant datasets. It requires unique sample IDs and performs an inner join on samples (only samples in both datasets will be considered). In addition, all genotype calls must be diploid and unphased.

It performs an ordered zip join of the variants. That means the variants of each dataset are sorted, with duplicate variants appearing in some random relative order, and then zipped together. When a variant appears a different number of times between the two datasets, the dataset with the fewer number of instances is padded with “no data”. For example, if a variant is only in one dataset, then each genotype is treated as “no data” in the other.

This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.

Using the global summary result

The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:

1. No Data (missing variant)
2. No Call (missing genotype call)
3. Hom Ref
4. Heterozygous
5. Hom Var

The first index is the state in the left dataset and the second index is the state in the right dataset. Typical uses of the summary list are shown below.

>>> summary, samples, variants = hl.concordance(dataset, dataset2)
>>> left_homref_right_homvar = summary[2][4]
>>> left_het_right_missing = summary[3][1]
>>> left_het_right_something_else = sum(summary[3][:]) - summary[3][3]
>>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4]
>>> total_discordant = sum([sum(s[2:]) for s in summary[2:]]) - total_concordant

Using the table results

Table 1: Concordance statistics by column

This table contains the column key field of left, and the following fields:

• n_discordant (tint64) – Count of discordant calls (see below for full definition).
• concordance (tarray of tarray of tint64) – Array of concordance per state on left and right, matching the structure of the global summary defined above.

Table 2: Concordance statistics by row

This table contains the row key fields of left, and the following fields:

• n_discordant (tfloat64) – Count of discordant calls (see below for full definition).
• concordance (tarray of tarray of tint64) – Array of concordance per state on left and right, matching the structure of the global summary defined above.

In these tables, the column n_discordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.

The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example, concordance[1][4] is the number of “no call” genotypes on the left that were called homozygous variant on the right.

Parameters: left (MatrixTable) – First dataset to compare. right (MatrixTable) – Second dataset to compare. (list of list of int, Table, Table) – The global concordance statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.
hail.methods.filter_intervals(ds, intervals, keep=True) → Union[hail.table.Table, hail.matrixtable.MatrixTable][source]

Filter rows with a list of intervals.

Examples

Filter to loci falling within one interval:

>>> ds_result = hl.filter_intervals(dataset, [hl.parse_locus_interval('17:38449840-38530994')])

Remove all loci within list of intervals:

>>> intervals = [hl.parse_locus_interval(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
>>> ds_result = hl.filter_intervals(dataset, intervals, keep=False)

Notes

Based on the keep argument, this method will either restrict to points in the supplied interval ranges, or remove all rows in those ranges.

When keep=True, partitions that don’t overlap any supplied interval will not be loaded at all. This enables filter_intervals() to be used for reasonably low-latency queries of small ranges of the dataset, even on large datasets.

Parameters: ds (MatrixTable or Table) – Dataset to filter. intervals (ArrayExpression of type tinterval) – Intervals to filter on. The point type of the interval must be a prefix of the key or equal to the first field of the key. keep (bool) – If True, keep only rows that fall within any interval in intervals. If False, keep only rows that fall outside all intervals in intervals.
hail.methods.filter_alleles(mt: hail.matrixtable.MatrixTable, f: Callable) → hail.matrixtable.MatrixTable[source]

Filter alternate alleles.

Note

Requires the dataset to have a compound row key:

Examples

Keep SNPs:

>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))

Keep alleles with AC > 0:

>>> ds_result = hl.filter_alleles(ds, lambda a, allele_index: ds.info.AC[allele_index - 1] > 0)

Update the AC field of the resulting dataset:

>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i-1]))
>>> ds_result = ds_result.annotate_rows(info = updated_info)

Notes

The following new fields are generated:

• old_locus (locus) – The old locus, before filtering and computing the minimal representation.
• old_alleles (array<str>) – The old alleles, before filtering and computing the minimal representation.
• old_to_new (array<int32>) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.
• new_to_old (array<int32>) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.

If all alternate alleles of a variant are filtered out, the variant itself is filtered out.

Using _f_

The f argument is a function or lambda evaluated per alternate allele to determine whether that allele is kept. If f evaluates to True, the allele is kept. If f evaluates to False or missing, the allele is removed.

f is a function that takes two arguments: the allele string (of type StringExpression) and the allele index (of type Int32Expression), and returns a boolean expression. This can be either a defined function or a lambda. For example, these two usages are equivalent:

(with a lambda)

>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))

(with a defined function)

>>> def filter_f(allele, allele_index):
...     return hl.is_snp(ds.alleles[0], allele)
>>> ds_result = hl.filter_alleles(ds, filter_f)

Warning

filter_alleles() does not update any fields other than locus and alleles. This means that row fields like allele count (AC) and entry fields like allele depth (AD) can become meaningless unless they are also updated. You can update them with annotate_rows() and annotate_entries().

Parameters: mt (MatrixTable) – Dataset. f (callable) – Function from (allele: StringExpression, allele_index: Int32Expression) to BooleanExpression MatrixTable
hail.methods.filter_alleles_hts(mt: hail.matrixtable.MatrixTable, f: Callable, subset: bool = False) → hail.matrixtable.MatrixTable[source]

Filter alternate alleles and update standard GATK entry fields.

Examples

Filter to SNP alleles using the subset strategy:

>>> ds_result = hl.filter_alleles_hts(
...     ds,
...     lambda allele, _: hl.is_snp(ds.alleles[0], allele),
...     subset=True)

Update the AC field of the resulting dataset:

>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i-1]))
>>> ds_result = ds_result.annotate_rows(info = updated_info)

Notes

For usage of the _f_ argument, see the filter_alleles() documentation.

filter_alleles_hts() requires the dataset have the GATK VCF schema, namely the following entry fields in this order:

GT: call
DP: int32
GQ: int32
PL: array<int32>

Use MatrixTable.select_entries() to rearrange these fields if necessary.

The following new fields are generated:

• old_locus (locus) – The old locus, before filtering and computing the minimal representation.
• old_alleles (array<str>) – The old alleles, before filtering and computing the minimal representation.
• old_to_new (array<int32>) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.
• new_to_old (array<int32>) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.

Downcode algorithm

We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.

GT: 1/2
GQ: 10

0 | 1000
1 | 1000   10
2 | 1000   0     20
+-----------------
0     1     2

The downcode algorithm recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 -> 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcoding filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then re-normalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly to split_multi_hts().

The downcode algorithm would produce the following:

GT: 0/1
GQ: 10

0 | 20
1 | 0    10
+-----------
0    1

In summary:

• GT: Downcode filtered alleles to reference.
• AD: Columns of filtered alleles are eliminated and their values are added to the reference column, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 40,20.
• DP: No change.
• PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.
• GQ: The second-lowest PL (after shifting).

Subset algorithm

We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.

GT: 1/2
GQ: 10

0 | 1000
1 | 1000   10
2 | 1000   0     20
+-----------------
0     1     2

The subset algorithm subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are re-normalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.

The subset algorithm would produce the following:

GT: 1/1
GQ: 980

0 | 980
1 | 980    0
+-----------
0      1

In summary:

• GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).
• AD: The filtered alleles’ columns are eliminated, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 25,20.
• DP: Unchanged.
• PL: Columns involving filtered alleles are eliminated and the remaining columns’ values are shifted so the minimum value is 0.
• GQ: The second-lowest PL (after shifting).

Warning

filter_alleles_hts() does not update any row fields other than locus and alleles. This means that row fields like allele count (AC) can become meaningless unless they are also updated. You can update them with annotate_rows().

filter_alleles()

Parameters: mt (MatrixTable) f (callable) – Function from (allele: StringExpression, allele_index: Int32Expression) to BooleanExpression subset (bool) – Subset PL field if True, otherwise downcode PL field. The calculation of GT and GQ also depend on whether one subsets or downcodes the PL. MatrixTable

Run principal component analysis (PCA) on the Hardy-Weinberg-normalized genotype call matrix.

Examples

Notes

This method specializes pca() for the common use case of PCA in statistical genetics, that of projecting samples to a small number of ancestry coordinates. Variants that are all homozygous reference or all homozygous alternate are unnormalizable and removed before evaluation. See pca() for more details.

Users of PLINK/GCTA should be aware that Hail computes the GRM slightly differently with regard to missing data. In Hail, the $$ij$$ entry of the GRM $$MM^T$$ is simply the dot product of rows $$i$$ and $$j$$ of $$M$$; in terms of $$C$$ it is

$\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}-2p_l)(C_{jl} - 2p_l)}{2p_l(1-p_l)}$

where $$\mathcal{C}_i = \{l \mid C_{il} \text{ is non-missing}\}$$. In PLINK/GCTA the denominator $$m$$ is replaced with the number of terms in the sum $$\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert$$, i.e. the number of variants where both samples have non-missing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections

Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM, i.e. the left singular vectors $$U_k$$ instead of the component scores $$U_k S_k$$. The scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. In PC bi-plots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.

Parameters: call_expr (CallExpression) – Entry-indexed call expression. k (int) – Number of principal components. compute_loadings (bool) – If True, compute row loadings. (list of float, Table, Table) – List of eigenvalues, table with column scores, table with row loadings.
hail.methods.identity_by_descent(dataset, maf=None, bounded=True, min=None, max=None) → hail.table.Table[source]

Compute matrix of identity-by-descent estimates.

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

To calculate a full IBD matrix, using minor allele frequencies computed from the dataset itself:

>>> hl.identity_by_descent(dataset)

To calculate an IBD matrix containing only pairs of samples with PI_HAT in $$[0.2, 0.9]$$, using minor allele frequencies stored in the row field panel_maf:

>>> hl.identity_by_descent(dataset, maf=dataset['panel_maf'], min=0.2, max=0.9)

Notes

The implementation is based on the IBD algorithm described in the PLINK paper.

identity_by_descent() requires the dataset to be biallelic and does not perform LD pruning. Linkage disequilibrium may bias the result so consider filtering variants first.

The resulting Table entries have the type: { i: String, j: String, ibd: { Z0: Double, Z1: Double, Z2: Double, PI_HAT: Double }, ibs0: Long, ibs1: Long, ibs2: Long }. The key list is: *i: String, j: String*.

Conceptually, the output is a symmetric, sample-by-sample matrix. The output table has the following form

i               j       ibd.Z0  ibd.Z1  ibd.Z2  ibd.PI_HAT ibs0 ibs1    ibs2
sample1 sample2 1.0000  0.0000  0.0000  0.0000 ...
sample1 sample3 1.0000  0.0000  0.0000  0.0000 ...
sample1 sample4 0.6807  0.0000  0.3193  0.3193 ...
sample1 sample5 0.1966  0.0000  0.8034  0.8034 ...
Parameters: dataset (MatrixTable) – Variant-keyed MatrixTable containing genotype information. maf (Float64Expression, optional) – Row-indexed expression for the minor allele frequency. bounded (bool) – Forces the estimations for Z0, Z1, Z2, and PI_HAT to take on biologically meaningful values (in the range [0,1]). min (float or None) – Sample pairs with a PI_HAT below this value will not be included in the output. Must be in $$[0,1]$$. max (float or None) – Sample pairs with a PI_HAT above this value will not be included in the output. Must be in $$[0,1]$$. Table
hail.methods.genetic_relatedness_matrix(call_expr) → hail.linalg.blockmatrix.BlockMatrix[source]

Compute the genetic relatedness matrix (GRM).

Examples

>>> grm = hl.genetic_relatedness_matrix(dataset.GT)

Notes

The genetic relationship matrix (GRM) $$G$$ encodes genetic correlation between each pair of samples. It is defined by $$G = MM^T$$ where $$M$$ is a standardized version of the genotype matrix, computed as follows. Let $$C$$ be the $$n \times m$$ matrix of raw genotypes in the variant dataset, with rows indexed by $$n$$ samples and columns indexed by $$m$$ bialellic autosomal variants; $$C_{ij}$$ is the number of alternate alleles of variant $$j$$ carried by sample $$i$$, which can be 0, 1, 2, or missing. For each variant $$j$$, the sample alternate allele frequency $$p_j$$ is computed as half the mean of the non-missing entries of column $$j$$. Entries of $$M$$ are then mean-centered and variance-normalized as

$M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},$

with $$M_{ij} = 0$$ for $$C_{ij}$$ missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value $$1/m$$ for variants in Hardy-Weinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor $$1/m$$ gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,

$G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}-2p_j)(C_{kj}-2p_j)}{2 p_j (1-p_j)}$

This method drops variants with $$p_j = 0$$ or math:p_j = 1 before computing kinship.

Parameters: call_expr (CallExpression) – Entry-indexed call expression with columns corresponding to samples. BlockMatrix – Genetic relatedness matrix for all samples. Row and column indices correspond to matrix table column index.
hail.methods.realized_relationship_matrix(call_expr) → hail.linalg.blockmatrix.BlockMatrix[source]

Computes the realized relationship matrix (RRM).

Examples

>>> rrm = hl.realized_relationship_matrix(dataset.GT)

Notes

The realized relationship matrix (RRM) is defined as follows. Consider the $$n \times m$$ matrix $$C$$ of raw genotypes, with rows indexed by $$n$$ samples and columns indexed by the $$m$$ bialellic autosomal variants; $$C_{ij}$$ is the number of alternate alleles of variant $$j$$ carried by sample $$i$$, which can be 0, 1, 2, or missing. For each variant $$j$$, the sample alternate allele frequency $$p_j$$ is computed as half the mean of the non-missing entries of column $$j$$. Entries of $$M$$ are then mean-centered and variance-normalized as

$M_{ij} = \frac{C_{ij}-2p_j} {\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}-2p_j)^2}},$

with $$M_{ij} = 0$$ for $$C_{ij}$$ missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance $$1/m$$, which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the $$n \times n$$ sample correlation or realized relationship matrix (RRM) $$K$$ as simply

$K = MM^T$

Note that the only difference between the realized relationship matrix and the genetic relatedness matrix (GRM) used in realized_relationship_matrix() is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under Hardy-Weinberg Equilibrium.

This method drops variants with zero variance before computing kinship.

Parameters: call_expr (CallExpression) – Entry-indexed call expression on matrix table with columns corresponding to samples. BlockMatrix – Realized relationship matrix for all samples. Row and column indices correspond to matrix table column index.
hail.methods.impute_sex(call, aaf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, aaf=None) → hail.table.Table[source]

Impute sex of samples by calculating inbreeding coefficient on the X chromosome.

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Remove samples where imputed sex does not equal reported sex:

>>> imputed_sex = hl.impute_sex(dataset.GT)
>>> dataset_result = dataset.filter_cols(imputed_sex[dataset.s].is_female != dataset.pheno.is_female)

Notes

We have used the same implementation as PLINK v1.7.

Let gr be the the reference genome of the type of the locus key (as given by TLocus.reference_genome())

1. Filter the dataset to loci on the X contig defined by gr.
2. Calculate alternate allele frequency (AAF) for each row from the dataset.
3. Filter to variants with AAF above aaf_threshold.
4. Remove loci in the pseudoautosomal region, as defined by gr, if and only if include_par is True (it defaults to False)
5. For each row and column with a non-missing genotype call, $$E$$, the expected number of homozygotes (from population AAF), is computed as $$1.0 - (2.0*maf*(1.0-maf))$$.
6. For each row and column with a non-missing genotype call, $$O$$, the observed number of homozygotes, is computed interpreting 0 as heterozygote and 1 as homozygote
7. For each row and column with a non-missing genotype call, $$N$$ is incremented by 1
8. For each column, $$E$$, $$O$$, and $$N$$ are combined across variants
9. For each column, $$F$$ is calculated by $$(O - E) / (N - E)$$
10. A sex is assigned to each sample with the following criteria: - Female when F < 0.2 - Male when F > 0.8 Use female_threshold and male_threshold to change this behavior.

Annotations

The returned column-key indexed Table has the following fields in addition to the matrix table’s column keys:

• is_female (tbool) – True if the imputed sex is female, false if male, missing if undetermined.
• f_stat (tfloat64) – Inbreeding coefficient.
• n_called (tint64) – Number of variants with a genotype call.
• expected_homs (tfloat64) – Expected number of homozygotes.
• observed_homs (tint64) – Observed number of homozygotes.
call : CallExpression
A genotype call for each row and column. The source dataset’s row keys must be [[locus], alleles] with types tlocus and ArrayStringExpression. Moreover, the alleles array must have exactly two elements (i.e. the variant must be biallelic).
aaf_threshold : float
Minimum alternate allele frequency threshold.
include_par : bool
Include pseudoautosomal regions.
female_threshold : float
Samples are called females if F < female_threshold.
male_threshold : float
Samples are called males if F > male_threshold.
aaf : str or None
A field defining the alternate allele frequency for each row. If None, AAF will be computed from call.
Returns: Table – Sex imputation statistics per sample.
hail.methods.ld_matrix(entry_expr, locus_expr, radius, coord_expr=None, block_size=None) → hail.linalg.blockmatrix.BlockMatrix[source]

Computes the windowed correlation (linkage disequilibrium) matrix between variants.

Examples

Consider the following dataset consisting of three variants with centimorgan coordinates and four samples:

>>> data = [{'v': '1:1:A:C',       'cm': 0.1, 's': 'a', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'd', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'b', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'd', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'c', 'GT': hl.Call([1, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'd', 'GT': hl.null(hl.tcall)}]
>>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, cm: float64, GT: call}'))
>>> ht = ht.transmute(**hl.parse_variant(ht.v))
>>> mt = ht.to_matrix_table(row_key=['locus', 'alleles'], col_key=['s'], row_fields=['cm'])

Compute linkage disequilibrium between all pairs of variants on the same contig and within two megabases:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=2e6)
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
[-0.85280287,  1.        ,  0.        ],
[ 0.        ,  0.        ,  1.        ]])

Within one megabases:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1e6)
>>> ld.to_numpy()
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

Within one centimorgan:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm)
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
[-0.85280287,  1.        ,  0.        ],
[ 0.        ,  0.        ,  1.        ]])

Within one centimorgan, and only calculate the upper triangle:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm)
>>> ld = ld.sparsify_triangle()
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
[ 0.        ,  1.        ,  0.        ],
[ 0.        ,  0.        ,  1.        ]])

Notes

This method sparsifies the result of row_correlation() using linalg.utils.locus_windows() and BlockMatrix.sparsify_row_intervals() in order to only compute linkage disequilibrium between nearby variants. Use row_correlation() directly to calculate correlation without windowing.

More precisely, variants are 0-indexed by their order in the matrix table (see add_row_index()). Each variant is regarded as a vector of elements defined by entry_expr, typically the number of alternate alleles or genotype dosage. Missing values are mean-imputed within variant.

The method produces a symmetric block-sparse matrix supported in a neighborhood of the diagonal. If variants i and j are on the same contig and within radius base pairs (inclusive) then the (i, j) element is their Pearson correlation coefficient. Otherwise, the (i, j) element is 0.0.

Rows with a constant value (i.e., zero variance) will result in nan correlation values. To avoid this, first check that all variants vary or filter out constant variants (for example, with the help of aggregators.stats()).

If the global_position() on locus_expr is not in ascending order, this method will fail. Ascending order should hold for a matrix table keyed by locus or variant (and the associated row table), or for a table that’s been ordered by locus_expr.

Set coord_expr to use a value other than position to define the windows. This row-indexed numeric expression must be non-missing, non-nan, on the same source as locus_expr, and ascending with respect to locus position for each contig; otherwise the method will raise an error.

Warning

See the warnings in row_correlation(). In particular, for large matrices it may be preferable to run its stages separately.

entry_expr and locus_expr are implicitly aligned by row-index, though they need not be on the same source. If their sources differ in the number of rows, an error will be raised; otherwise, unintended misalignment may silently produce unexpected results.

Parameters: entry_expr (Float64Expression) – Entry-indexed numeric expression on matrix table. locus_expr (LocusExpression) – Row-indexed locus expression on a table or matrix table that is row-aligned with the matrix table of entry_expr. radius (int or float) – Radius of window for row values. coord_expr (Float64Expression, optional) – Row-indexed numeric expression for the row value on the same table or matrix table as locus_expr. By default, the row value is given by the locus position. block_size (int, optional) – Block size. Default given by BlockMatrix.default_block_size(). BlockMatrix – Windowed correlation matrix between variants. Row and column indices correspond to matrix table variant index.
hail.methods.ld_prune(call_expr, r2=0.2, bp_window_size=1000000, memory_per_core=256, keep_higher_maf=True, block_size=None)[source]

Returns a maximal subset of variants that are nearly uncorrelated within each window.

Note

Requires the dataset to contain only diploid genotype calls.

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Note

Requires the dataset to have a compound row key:

Examples

Prune variants in linkage disequilibrium by filtering a dataset to those variants returned by ld_prune(). If the dataset contains multiallelic variants, the multiallelic variants must be filtered out or split before being passed to ld_prune().

>>> biallelic_dataset = dataset.filter_rows(hl.len(dataset.alleles) == 2)
>>> pruned_variant_table = hl.ld_prune(biallelic_dataset.GT, r2=0.2, bp_window_size=500000)
>>> filtered_ds = dataset.filter_rows(hl.is_defined(pruned_variant_table[dataset.row_key]))

Notes

This method finds a maximal subset of variants such that the squared Pearson correlation coefficient $$r^2$$ of any pair at most bp_window_size base pairs apart is strictly less than r2. Each variant is represented as a vector over samples with elements given by the (mean-imputed) number of alternate alleles. In particular, even if present, phase information is ignored. Variants that do not vary across samples are dropped.

The method prunes variants in linkage disequilibrium in three stages.

• The first, “local pruning” stage prunes correlated variants within each partition, using a local variant queue whose size is determined by memory_per_core. A larger queue may facilitate more local pruning in this stage. Minor allele frequency is not taken into account. The parallelism is the number of matrix table partitions.
• The second, “global correlation” stage uses block-sparse matrix multiplication to compute correlation between each pair of remaining variants within bp_window_size base pairs, and then forms a graph of correlated variants. The parallelism of writing the locally-pruned matrix table as a block matrix is n_locally_pruned_variants / block_size.
• The third, “global pruning” stage applies maximal_independent_set() to prune variants from this graph until no edges remain. This algorithm iteratively removes the variant with the highest vertex degree. If keep_higher_maf is true, then in the case of a tie for highest degree, the variant with lowest minor allele frequency is removed.

Warning

The locally-pruned matrix table and block matrix are stored as temporary files on persistent disk. See the warnings on BlockMatrix.from_entry_expr with regard to memory and Hadoop replication errors.

Parameters: call_expr (CallExpression) – Entry-indexed call expression on a matrix table with row-indexed variants and column-indexed samples. r2 (float) – Squared correlation threshold (exclusive upper bound). Must be in the range [0.0, 1.0]. bp_window_size (int) – Window size in base pairs (inclusive upper bound). memory_per_core (int) – Memory in MB per core for local pruning queue. keep_higher_maf (int) – If True, break ties at each step of the global pruning stage by preferring to keep variants with higher minor allele frequency. block_size (int, optional) – Block size for block matrices in the second stage. Default given by BlockMatrix.default_block_size(). Table – Table of a maximal independent set of variants.
hail.methods.mendel_errors(call, pedigree) → Tuple[[hail.table.Table, hail.table.Table, hail.table.Table], hail.table.Table][source]

Find Mendel errors; count per variant, individual and nuclear family.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables (all errors, errors by family, errors by individual, errors by variant):

>>> all_errors, per_fam, per_sample, per_variant = hl.mendel_errors(dataset['GT'], ped)

Export all mendel errors to a text file:

>>> all_errors.export('output/all_mendel_errors.tsv')

Annotate columns with the number of Mendel errors:

>>> annotated_samples = dataset.annotate_cols(mendel=per_sample[dataset.s])

Annotate rows with the number of Mendel errors:

>>> annotated_variants = dataset.annotate_rows(mendel=per_variant[dataset.locus, dataset.alleles])

Notes

The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats, resembling the .mendel, .fmendel, .imendel, and .lmendel formats, respectively.

First table: all Mendel errors. This table contains one row per Mendel error, keyed by the variant and proband id.

• locus (tlocus) – Variant locus, key field.
• alleles (tarray of tstr) – Variant alleles, key field.
• (column key of dataset) (tstr) – Proband ID, key field.
• fam_id (tstr) – Family ID.
• mendel_code (tint32) – Mendel error code, see below.

Second table: errors per nuclear family. This table contains one row per nuclear family, keyed by the parents.

• pat_id (tstr) – Paternal ID. (key field)
• mat_id (tstr) – Maternal ID. (key field)
• fam_id (tstr) – Family ID.
• children (tint32) – Number of children in this nuclear family.
• errors (tint64) – Number of Mendel errors in this nuclear family.
• snp_errors (tint64) – Number of Mendel errors at SNPs in this nuclear family.

Third table: errors per individual. This table contains one row per individual. Each error is counted toward the proband, father, and mother according to the Implicated in the table below.

• (column key of dataset) (tstr) – Sample ID (key field).
• fam_id (tstr) – Family ID.
• errors (tint64) – Number of Mendel errors involving this individual.
• snp_errors (tint64) – Number of Mendel errors involving this individual at SNPs.

Fourth table: errors per variant.

• locus (tlocus) – Variant locus, key field.
• alleles (tarray of tstr) – Variant alleles, key field.
• errors (tint64) – Number of Mendel errors in this variant.

This method only considers complete trios (two parents and proband with defined sex). The code of each Mendel error is determined by the table below, extending the Plink classification.

In the table, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by in_autosome().

• Auto – in autosome or in PAR or female child
• HemiX – in non-PAR of X and male child
• HemiY – in non-PAR of Y and male child

Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.

Code Dad Mom Kid Copy State | Implicated
1 HomVar HomVar Het Auto Dad, Mom, Kid
2 HomRef HomRef Het Auto Dad, Mom, Kid
3 HomRef ~HomRef HomVar Auto Dad, Kid
4 ~HomRef HomRef HomVar Auto Mom, Kid
5 HomRef HomRef HomVar Auto Kid
6 HomVar ~HomVar HomRef Auto Dad, Kid
7 ~HomVar HomVar HomRef Auto Mom, Kid
8 HomVar HomVar HomRef Auto Kid
9 Any HomVar HomRef HemiX Mom, Kid
10 Any HomRef HomVar HemiX Mom, Kid
11 HomVar Any HomRef HemiY Dad, Kid
12 HomRef Any HomVar HemiY Dad, Kid
Parameters: dataset (MatrixTable) pedigree (Pedigree) (Table, Table, Table, Table)
hail.methods.de_novo(mt: hail.matrixtable.MatrixTable, pedigree: hail.genetics.pedigree.Pedigree, pop_frequency_prior, *, min_gq: int = 20, min_p: float = 0.05, max_parent_ab: float = 0.05, min_child_ab: float = 0.2, min_dp_ratio: float = 0.1) → hail.table.Table[source]

Call putative de novo events from trio data.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Call de novo events:

>>> priors = priors.transmute(**hl.parse_variant(priors.Variant)).key_by('locus', 'alleles')
>>> de_novo_results = hl.de_novo(dataset, pedigree, pop_frequency_prior=priors[dataset.row_key].AF)

Notes

This method assumes the GATK high-throughput sequencing fields exist: GT, AD, DP, GQ, PL.

This method replicates the functionality of Kaitlin Samocha’s de novo caller. The version corresponding to git commit bde3e40 is implemented in Hail with her permission and assistance.

This method produces a Table with the following fields:

• locus (locus) – Variant locus.
• alleles (array<str>) – Variant alleles.
• id (str) – Proband sample ID.
• prior (float64) – Site frequency prior. It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7.
• proband (struct) – Proband column fields from mt.
• father (struct) – Father column fields from mt.
• mother (struct) – Mother column fields from mt.
• proband_entry (struct) – Proband entry fields from mt.
• father_entry (struct) – Father entry fields from mt.
• proband_entry (struct) – Mother entry fields from mt.
• is_female (bool) – True if proband is female.
• p_de_novo (float64) – Unfiltered posterior probability that the event is de novo rather than a missed heterozygous event in a parent.
• confidence (str) Validation confidence. One of: 'HIGH', 'MEDIUM', 'LOW'.

The key of the table is ['locus', 'alleles', 'id'].

The model looks for de novo events in which both parents are homozygous reference and the proband is a heterozygous. The model makes the simplifying assumption that when this configuration x = (AA, AA, AB) of calls occurs, exactly one of the following is true:

• d: a de novo mutation occurred in the proband and all calls are accurate.
• m: at least one parental allele is actually heterozygous and the proband call is accurate.

We can then estimate the posterior probability of a de novo mutation as:

$\mathrm{P_{\text{de novo}}} = \frac{\mathrm{P}(d\,|\,x)}{\mathrm{P}(d\,|\,x) + \mathrm{P}(m\,|\,x)}$

Applying Bayes rule to the numerator and denominator yields

$\frac{\mathrm{P}(x\,|\,d)\,\mathrm{P}(d)}{\mathrm{P}(x\,|\,d)\,\mathrm{P}(d) + \mathrm{P}(x\,|\,m)\,\mathrm{P}(m)}$

The prior on de novo mutation is estimated from the rate in the literature:

$\mathrm{P}(d) = \frac{1 \text{mutation}}{30,000,000\, \text{bases}}$

The prior used for at least one alternate allele between the parents depends on the alternate allele frequency:

$\mathrm{P}(m) = 1 - (1 - AF)^4$

The likelihoods $$\mathrm{P}(x\,|\,d)$$ and $$\mathrm{P}(x\,|\,m)$$ are computed from the PL (genotype likelihood) fields using these factorizations:

$\begin{split}\mathrm{P}(x = (AA, AA, AB) \,|\,d) = \Big( &\mathrm{P}(x_{\mathrm{father}} = AA \,|\, \mathrm{father} = AA) \\ \cdot &\mathrm{P}(x_{\mathrm{mother}} = AA \,|\, \mathrm{mother} = AA) \\ \cdot &\mathrm{P}(x_{\mathrm{proband}} = AB \,|\, \mathrm{proband} = AB) \Big)\end{split}$
$\begin{split}\mathrm{P}(x = (AA, AA, AB) \,|\,m) = \Big( & \mathrm{P}(x_{\mathrm{father}} = AA \,|\, \mathrm{father} = AB) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \,|\, \mathrm{mother} = AA) \\ + \, &\mathrm{P}(x_{\mathrm{father}} = AA \,|\, \mathrm{father} = AA) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \,|\, \mathrm{mother} = AB) \Big) \\ \cdot \, &\mathrm{P}(x_{\mathrm{proband}} = AB \,|\, \mathrm{proband} = AB)\end{split}$

(Technically, the second factorization assumes there is exactly (rather than at least) one alternate allele among the parents, which may be justified on the grounds that it is typically the most likely case by far.)

While this posterior probability is a good metric for grouping putative de novo mutations by validation likelihood, there exist error modes in high-throughput sequencing data that are not appropriately accounted for by the phred-scaled genotype likelihoods. To this end, a number of hard filters are applied in order to assign validation likelihood.

These filters are different for SNPs and insertions/deletions. In the below rules, the following variables are used:

• DR refers to the ratio of the read depth in the proband to the combined read depth in the parents.
• AB refers to the read allele balance of the proband (number of alternate reads divided by total reads).
• AC refers to the count of alternate alleles across all individuals in the dataset at the site.
• p refers to $$\mathrm{P_{\text{de novo}}}$$.
• min_p refers to the min_p function parameter.

HIGH-quality SNV:

p > 0.99 && AB > 0.3 && DR > 0.2
or
p > 0.99 && AB > 0.3 && AC == 1

MEDIUM-quality SNV:

p > 0.5 && AB > 0.3
or
p > 0.5 && AB > 0.2 && AC == 1

LOW-quality SNV:

p > min_p && AB > 0.2

HIGH-quality indel:

p > 0.99 && AB > 0.3 && DR > 0.2
or
p > 0.99 && AB > 0.3 && AC == 1

MEDIUM-quality indel:

p > 0.5 && AB > 0.3
or
p > 0.5 && AB > 0.2 and AC == 1

LOW-quality indel:

p > min_p && AB > 0.2

Additionally, de novo candidates are not considered if the proband GQ is smaller than the min_gq parameter, if the proband allele balance is lower than the min_child_ab parameter, if the depth ratio between the proband and parents is smaller than the min_depth_ratio parameter, or if the allele balance in a parent is above the max_parent_ab parameter.

Parameters: mt (MatrixTable) – High-throughput sequencing dataset. pedigree (Pedigree) – Sample pedigree. pop_frequency_prior (Float64Expression) – Expression for population alternate allele frequency prior. min_gq – Minimum proband GQ to be considered for de novo calling. min_p – Minimum posterior probability to be considered for de novo calling. max_parent_ab – Maximum parent allele balance. min_child_ab – Minimum proband allele balance/ min_dp_ratio – Minimum ratio between proband read depth and parental read depth. Table
hail.methods.nirvana(dataset: Union[hail.matrixtable.MatrixTable, hail.table.Table], config, block_size=500000, name='nirvana')[source]

Annotate variants using Nirvana.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Note

Requires the dataset to have a compound row key:

nirvana() runs Nirvana on the current dataset and adds a new row field in the location specified by name.

Examples

Add Nirvana annotations to the dataset:

>>> result = hl.nirvana(dataset, "data/nirvana.properties")

Configuration

nirvana() requires a configuration file. The format is a .properties file, where each line defines a property as a key-value pair of the form key = value. nirvana() supports the following properties:

• hail.nirvana.dotnet – Location of dotnet. Optional, default: dotnet.
• hail.nirvana.path – Value of the PATH environment variable when invoking Nirvana. Optional, by default PATH is not set.
• hail.nirvana.location – Location of Nirvana.dll. Required.
• hail.nirvana.reference – Location of reference genome. Required.
• hail.nirvana.cache – Location of cache. Required.
• hail.nirvana.supplementaryAnnotationDirectory – Location of Supplementary Database. Optional, no supplementary database by default.

Here is an example nirvana.properties configuration file:

hail.nirvana.location = /path/to/dotnet/netcoreapp2.0/Nirvana.dll
hail.nirvana.reference = /path/to/nirvana/References/Homo_sapiens.GRCh37.Nirvana.dat
hail.nirvana.cache = /path/to/nirvana/Cache/GRCh37/Ensembl
hail.nirvana.supplementaryAnnotationDirectory = /path/to/nirvana/SupplementaryDatabase/GRCh37

Annotations

A new row field is added in the location specified by name with the following schema:

struct {
chromosome: str,
refAllele: str,
position: int32,
altAlleles: array<str>,
cytogeneticBand: str,
quality: float64,
filters: array<str>,
jointSomaticNormalQuality: int32,
copyNumber: int32,
strandBias: float64,
recalibratedQuality: float64,
variants: array<struct {
altAllele: str,
refAllele: str,
chromosome: str,
begin: int32,
end: int32,
phylopScore: float64,
isReferenceMinor: bool,
variantType: str,
vid: str,
hgvsg: str,
isRecomposedVariant: bool,
isDecomposedVariant: bool,
regulatoryRegions: array<struct {
id: str,
type: str,
consequence: set<str>
}>,
clinvar: array<struct {
id: str,
reviewStatus: str,
isAlleleSpecific: bool,
alleleOrigins: array<str>,
refAllele: str,
altAllele: str,
phenotypes: array<str>,
medGenIds: array<str>,
omimIds: array<str>,
orphanetIds: array<str>,
significance: str,
lastUpdatedDate: str,
pubMedIds: array<str>
}>,
cosmic: array<struct {
id: str,
isAlleleSpecific: bool,
refAllele: str,
altAllele: str,
gene: str,
sampleCount: int32,
studies: array<struct {
id: int32,
histology: str,
primarySite: str
}>
}>,
dbsnp: struct {
ids: array<str>
},
globalAllele: struct {
globalMinorAllele: str,
globalMinorAlleleFrequency: float64
},
coverage: str,
allAf: float64,
allAc: int32,
allAn: int32,
allHc: int32,
afrAf: float64,
afrAc: int32,
afrAn: int32,
afrHc: int32,
amrAf: float64,
amrAc: int32,
amrAn: int32,
amrHc: int32,
easAf: float64,
easAc: int32,
easAn: int32,
easHc: int32,
finAf: float64,
finAc: int32,
finAn: int32,
finHc: int32,
nfeAf: float64,
nfeAc: int32,
nfeAn: int32,
nfeHc: int32,
othAf: float64,
othAc: int32,
othAn: int32,
othHc: int32,
asjAf: float64,
asjAc: int32,
asjAn: int32,
asjHc: int32,
failedFilter: bool
},
coverage: str,
allAf: float64,
allAc: int32,
allAn: int32,
allHc: int32,
afrAf: float64,
afrAc: int32,
afrAn: int32,
afrHc: int32,
amrAf: float64,
amrAc: int32,
amrAn: int32,
amrHc: int32,
easAf: float64,
easAc: int32,
easAn: int32,
easHc: int32,
finAf: float64,
finAc: int32,
finAn: int32,
finHc: int32,
nfeAf: float64,
nfeAc: int32,
nfeAn: int32,
nfeHc: int32,
othAf: float64,
othAc: int32,
othAn: int32,
othHc: int32,
asjAf: float64,
asjAc: int32,
asjAn: int32,
asjHc: int32,
sasAf: float64,
sasAc: int32,
sasAn: int32,
sasHc: int32,
failedFilter: bool
},
topmed: struct {
failedFilter: bool,
allAc: int32,
allAn: int32,
allAf: float64,
allHc: int32
},
oneKg: struct {
ancestralAllele: str,
allAf: float64,
allAc: int32,
allAn: int32,
afrAf: float64,
afrAc: int32,
afrAn: int32,
amrAf: float64,
amrAc: int32,
amrAn: int32,
easAf: float64,
easAc: int32,
easAn: int32,
eurAf: float64,
eurAc: int32,
eurAn: int32,
sasAf: float64,
sasAc: int32,
sasAn: int32
},
mitomap: array<struct {
refAllele: str,
altAllele: str,
diseases : array<str>,
hasHomoplasmy: bool,
hasHeteroplasmy: bool,
status: str,
clinicalSignificance: str,
scorePercentile: float64,
isAlleleSpecific: bool,
chromosome: str,
begin: int32,
end: int32,
variantType: str
}
transcripts: struct {
refSeq: array<struct {
transcript: str,
bioType: str,
aminoAcids: str,
cdnaPos: str,
codons: str,
cdsPos: str,
exons: str,
introns: str,
geneId: str,
hgnc: str,
consequence: array<str>,
hgvsc: str,
hgvsp: str,
isCanonical: bool,
polyPhenScore: float64,
polyPhenPrediction: str,
proteinId: str,
proteinPos: str,
siftScore: float64,
siftPrediction: str
}>,
ensembl: array<struct {
transcript: str,
bioType: str,
aminoAcids: str,
cdnaPos: str,
codons: str,
cdsPos: str,
exons: str,
introns: str,
geneId: str,
hgnc: str,
consequence: array<str>,
hgvsc: str,
hgvsp: str,
isCanonical: bool,
polyPhenScore: float64,
polyPhenPrediction: str,
proteinId: str,
proteinPos: str,
siftScore: float64,
siftPrediction: str
}>
},
overlappingGenes: array<str>
}>
genes: array<struct {
name: str,
omim: array<struct {
mimNumber: int32,
hgnc: str,
description: str,
phenotypes: array<struct {
mimNumber: int32,
phenotype: str,
mapping: str,
inheritance: array<str>,
}>
}>
exac: struct {
pLi: float64,
pRec: float64,
pNull: float64
}
}>
}
Parameters: dataset (MatrixTable or Table) – Dataset. config (str) – Path to Nirvana configuration file. block_size (int) – Number of rows to process per Nirvana invocation. name (str) – Name for resulting row field. MatrixTable or Table – Dataset with new row-indexed field name containing Nirvana annotations.
hail.methods.pc_relate(call_expr, min_individual_maf, *, k=None, scores_expr=None, min_kinship=-inf, statistics='all', block_size=None) → hail.table.Table[source]

Compute relatedness estimates between individuals using a variant of the PC-Relate method.

Note

Requires the dataset to contain only diploid genotype calls.

Examples

Estimate kinship, identity-by-descent two, identity-by-descent one, and identity-by-descent zero for every pair of samples, using a minimum minor allele frequency filter of 0.01 and 10 principal components to control for population structure.

>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10)

Only compute the kinship statistic. This is more efficient than computing all statistics.

>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, statistics='kin')

Compute all statistics, excluding sample-pairs with kinship less than 0.1. This is more efficient than producing the full table and then filtering using Table.filter().

>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, min_kinship=0.1)

One can also pass in pre-computed principal component scores. To produce the same results as in the previous example:

>>> _, scores_table, _ = hl.hwe_normalized_pca(dataset.GT,
...                                      k=10,
>>> rel = hl.pc_relate(dataset.GT,
...                    0.01,
...                    scores_expr=scores_table[dataset.col_key].scores,
...                    min_kinship=0.1)

Notes

The traditional estimator for kinship between a pair of individuals $$i$$ and $$j$$, sharing the set $$S_{ij}$$ of single-nucleotide variants, from a population with allele frequencies $$p_s$$, is given by:

$\widehat{\phi_{ij}} := \frac{1}{|S_{ij}|} \sum_{s \in S_{ij}} \frac{(g_{is} - 2 p_s) (g_{js} - 2 p_s)} {4 \sum_{s \in S_{ij} p_s (1 - p_s)}}$

This estimator is true under the model that the sharing of common (relative to the population) alleles is not very informative to relatedness (because they’re common) and the sharing of rare alleles suggests a recent common ancestor from which the allele was inherited by descent.

When multiple ancestry groups are mixed in a sample, this model breaks down. Alleles that are rare in all but one ancestry group are treated as very informative to relatedness. However, these alleles are simply markers of the ancestry group. The PC-Relate method corrects for this situation and the related situation of admixed individuals.

PC-Relate slightly modifies the usual estimator for relatedness: occurrences of population allele frequency are replaced with an “individual-specific allele frequency”. This modification allows the method to correctly weight an allele according to an individual’s unique ancestry profile.

The “individual-specific allele frequency” at a given genetic locus is modeled by PC-Relate as a linear function of a sample’s first k principal component coordinates. As such, the efficacy of this method rests on two assumptions:

• an individual’s first k principal component coordinates fully describe their allele-frequency-relevant ancestry, and
• the relationship between ancestry (as described by principal component coordinates) and population allele frequency is linear

The estimators for kinship, and identity-by-descent zero, one, and two follow. Let:

• $$S_{ij}$$ be the set of genetic loci at which both individuals $$i$$ and $$j$$ have a defined genotype
• $$g_{is} \in {0, 1, 2}$$ be the number of alternate alleles that individual $$i$$ has at genetic locus $$s$$
• $$\widehat{\mu_{is}} \in [0, 1]$$ be the individual-specific allele frequency for individual $$i$$ at genetic locus $$s$$
• $${\widehat{\sigma^2_{is}}} := \widehat{\mu_{is}} (1 - \widehat{\mu_{is}})$$, the binomial variance of $$\widehat{\mu_{is}}$$
• $$\widehat{\sigma_{is}} := \sqrt{\widehat{\sigma^2_{is}}}$$, the binomial standard deviation of $$\widehat{\mu_{is}}$$
• $$\text{IBS}^{(0)}_{ij} := \sum_{s \in S_{ij}} \mathbb{1}_{||g_{is} - g_{js} = 2||}$$, the number of genetic loci at which individuals $$i$$ and $$j$$ share no alleles
• $$\widehat{f_i} := 2 \widehat{\phi_{ii}} - 1$$, the inbreeding coefficient for individual $$i$$
• $$g^D_{is}$$ be a dominance encoding of the genotype matrix, and $$X_{is}$$ be a normalized dominance-coded genotype matrix
\begin{align}\begin{aligned}\begin{split}g^D_{is} := \begin{cases} \widehat{\mu_{is}} & g_{is} = 0 \\ 0 & g_{is} = 1 \\ 1 - \widehat{\mu_{is}} & g_{is} = 2 \end{cases}\end{split}\\X_{is} := g^D_{is} - \widehat{\sigma^2_{is}} (1 - \widehat{f_i})\end{aligned}\end{align}

The estimator for kinship is given by:

$\widehat{\phi_{ij}} := \frac{\sum_{s \in S_{ij}}(g - 2 \mu)_{is} (g - 2 \mu)_{js}} {4 * \sum_{s \in S_{ij}} \widehat{\sigma_{is}} \widehat{\sigma_{js}}}$

The estimator for identity-by-descent two is given by:

$\widehat{k^{(2)}_{ij}} := \frac{\sum_{s \in S_{ij}}X_{is} X_{js}}{\sum_{s \in S_{ij}} \widehat{\sigma^2_{is}} \widehat{\sigma^2_{js}}}$

The estimator for identity-by-descent zero is given by:

$\begin{split}\widehat{k^{(0)}_{ij}} := \begin{cases} \frac{\text{IBS}^{(0)}_{ij}} {\sum_{s \in S_{ij}} \widehat{\mu_{is}}^2(1 - \widehat{\mu_{js}})^2 + (1 - \widehat{\mu_{is}})^2\widehat{\mu_{js}}^2} & \widehat{\phi_{ij}} > 2^{-5/2} \\ 1 - 4 \widehat{\phi_{ij}} + k^{(2)}_{ij} & \widehat{\phi_{ij}} \le 2^{-5/2} \end{cases}\end{split}$

The estimator for identity-by-descent one is given by:

$\widehat{k^{(1)}_{ij}} := 1 - \widehat{k^{(2)}_{ij}} - \widehat{k^{(0)}_{ij}}$

Note that, even if present, phase information is ignored by this method.

The PC-Relate method is described in “Model-free Estimation of Recent Genetic Relatedness”. Conomos MP, Reiner AP, Weir BS, Thornton TA. in American Journal of Human Genetics. 2016 Jan 7. The reference implementation is available in the GENESIS Bioconductor package .

pc_relate() differs from the reference implementation in a few ways:

• if k is supplied, samples scores are computed via PCA on all samples, not a specified subset of genetically unrelated samples. The latter can be achieved by filtering samples, computing PCA variant loadings, and using these loadings to compute and pass in scores for all samples.
• the estimators do not perform small sample correction
• the algorithm does not provide an option to use population-wide allele frequency estimates
• the algorithm does not provide an option to not use “overall standardization” (see R pcrelate documentation)

Under the PC-Relate model, kinship, $$\phi_{ij}$$, ranges from 0 to 0.5, and is precisely half of the fraction-of-genetic-material-shared. Listed below are the statistics for a few pairings:

• Monozygotic twins share all their genetic material so their kinship statistic is 0.5 in expection.
• Parent-child and sibling pairs both have kinship 0.25 in expectation and are separated by the identity-by-descent-zero, $$k^{(2)}_{ij}$$, statistic which is zero for parent-child pairs and 0.25 for sibling pairs.
• Avuncular pairs and grand-parent/-child pairs both have kinship 0.125 in expectation and both have identity-by-descent-zero 0.5 in expectation
• “Third degree relatives” are those pairs sharing $$2^{-3} = 12.5 %$$ of their genetic material, the results of PCRelate are often too noisy to reliably distinguish these pairs from higher-degree-relative-pairs or unrelated pairs.

Note that $$g_{is}$$ is the number of alternate alleles. Hence, for multi-allelic variants, a value of 2 may indicate two distinct alternative alleles rather than a homozygous variant genotype. To enforce the latter, either filter or split multi-allelic variants first.

The resulting table has the first 3, 4, 5, or 6 fields below, depending on the statistics parameter:

• i (col_key.dtype) – First sample. (key field)
• j (col_key.dtype) – Second sample. (key field)
• kin (tfloat64) – Kinship estimate, $$\widehat{\phi_{ij}}$$.
• ibd2 (tfloat64) – IBD2 estimate, $$\widehat{k^{(2)}_{ij}}$$.
• ibd0 (tfloat64) – IBD0 estimate, $$\widehat{k^{(0)}_{ij}}$$.
• ibd1 (tfloat64) – IBD1 estimate, $$\widehat{k^{(1)}_{ij}}$$.

Here col_key refers to the column key of the source matrix table, and col_key.dtype is a struct containing the column key fields.

There is one row for each pair of distinct samples (columns), where i corresponds to the column of smaller column index. In particular, if the same column key value exists for $$n$$ columns, then the resulting table will have $$\binom{n-1}{2}$$ rows with both key fields equal to that column key value. This may result in unexpected behavior in downstream processing.

Parameters: call_expr (CallExpression) – Entry-indexed call expression. min_individual_maf (float) – The minimum individual-specific minor allele frequency. If either individual-specific minor allele frequency for a pair of individuals is below this threshold, then the variant will not be used to estimate relatedness for the pair. k (int, optional) – If set, k principal component scores are computed and used. Exactly one of k and scores_expr must be specified. scores_expr (ArrayNumericExpression, optional) – Column-indexed expression of principal component scores, with the same source as call_expr. All array values must have the same positive length, corresponding to the number of principal components, and all scores must be non-missing. Exactly one of k and scores_expr must be specified. min_kinship (float) – Pairs of samples with kinship lower than min_kinship are excluded from the results. statistics (str) – Set of statistics to compute. If 'kin', only estimate the kinship statistic. If 'kin2', estimate the above and IBD2. If 'kin20', estimate the above and IBD0. If 'all', estimate the above and IBD1. block_size (int, optional) – Block size of block matrices used in the algorithm. Default given by BlockMatrix.default_block_size(). Table – A Table mapping pairs of samples to their pair-wise statistics.
hail.methods.sample_qc(mt, name='sample_qc') → hail.matrixtable.MatrixTable[source]

Compute per-sample metrics useful for quality control.

Note

Requires the dataset to have a compound row key:

Examples

Compute sample QC metrics and remove low-quality samples:

>>> dataset = hl.sample_qc(dataset, name='sample_qc')
>>> filtered_dataset = dataset.filter_cols((dataset.sample_qc.dp_stats.mean > 20) & (dataset.sample_qc.r_ti_tv > 1.5))

Notes

This method computes summary statistics per sample from a genetic matrix and stores the results as a new column-indexed struct field in the matrix, named based on the name parameter.

If mt contains an entry field DP of type tint32, then the field dp_stats is computed. If mt contains an entry field GQ of type tint32, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:

• mean (float64) – Mean value.
• stdev (float64) – Standard deviation (zero degrees of freedom).
• min (int32) – Minimum value.
• max (int32) – Maximum value.

If the dataset does not contain an entry field GT of type tcall, then an error is raised. The following fields are always computed from GT:

• call_rate (float64) – Fraction of calls non-missing.
• n_called (int64) – Number of non-missing calls.
• n_not_called (int64) – Number of missing calls.
• n_hom_ref (int64) – Number of homozygous reference calls.
• n_het (int64) – Number of heterozygous calls.
• n_hom_var (int64) – Number of homozygous alternate calls.
• n_non_ref (int64) – Sum of n_het and n_hom_var.
• n_snp (int64) – Number of SNP alternate alleles.
• n_insertion (int64) – Number of insertion alternate alleles.
• n_deletion (int64) – Number of deletion alternate alleles.
• n_singleton (int64) – Number of private alleles.
• n_transition (int64) – Number of transition (A-G, C-T) alternate alleles.
• n_transversion (int64) – Number of transversion alternate alleles.
• n_star (int64) – Number of star (upstream deletion) alleles.
• r_ti_tv (float64) – Transition/Transversion ratio.
• r_het_hom_var (float64) – Het/HomVar call ratio.
• r_insertion_deletion (float64) – Insertion/Deletion allele ratio.

Missing values NA may result from division by zero.

Parameters: mt (MatrixTable) – Dataset. name (str) – Name for resulting field. MatrixTable – Dataset with a new column-indexed field name.
hail.methods.skat(key_expr, weight_expr, y, x, covariates, logistic=False, max_size=46340, accuracy=1e-06, iterations=10000) → hail.table.Table[source]

Test each keyed group of rows for association by linear or logistic SKAT test.

Examples

Test each gene for association using the linear sequence kernel association test:

>>> skat_table = hl.skat(key_expr=burden_ds.gene,
...                      weight_expr=burden_ds.weight,
...                      y=burden_ds.burden.pheno,
...                      x=burden_ds.GT.n_alt_alleles(),
...                      covariates=[1, burden_ds.burden.cov1, burden_ds.burden.cov2])

Caution

By default, the Davies algorithm iterates up to 10k times until an accuracy of 1e-6 is achieved. Hence a reported p-value of zero with no issues may truly be as large as 1e-6. The accuracy and maximum number of iterations may be controlled by the corresponding function parameters. In general, higher accuracy requires more iterations.

Caution

To process a group with $$m$$ rows, several copies of an $$m \times m$$ matrix of doubles must fit in worker memory. Groups with tens of thousands of rows may exhaust worker memory causing the entire job to fail. In this case, use the max_size parameter to skip groups larger than max_size.

Warning

skat() considers the same set of columns (i.e., samples, points) for every group, namely those columns for which all covariates are defined. For each row, missing values of x are mean-imputed over these columns. As in the example, the intercept covariate 1 must be included explicitly if desired.

Notes

This method provides a scalable implementation of the score-based variance-component test originally described in Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test.

Row weights must be non-negative. Rows with missing weights are ignored. In the R package skat—which assumes rows are variants—default weights are given by evaluating the Beta(1, 25) density at the minor allele frequency. To replicate these weights in Hail using alternate allele frequencies stored in a row-indexed field AF, one can use the expression:

>>> hl.dbeta(hl.min(ds2.AF), 1.0, 25.0) ** 2

In the logistic case, the response y must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.

The resulting Table provides the group’s key (id), thenumber of rows in the group (size), the variance component score q_stat, the SKAT p-value, and a fault flag. For the toy example above, the table has the form:

id size q_stat p_value fault
geneA 2 4.136 0.205 0
geneB 1 5.659 0.195 0
geneC 3 4.122 0.192 0

Groups larger than max_size appear with missing q_stat, p_value, and fault. The hard limit on the number of rows in a group is 46340.

Note that the variance component score q_stat agrees with Q in the R package skat, but both differ from $$Q$$ in the paper by the factor $$\frac{1}{2\sigma^2}$$ in the linear case and $$\frac{1}{2}$$ in the logistic case, where $$\sigma^2$$ is the unbiased estimator of residual variance for the linear null model. The R package also applies a “small-sample adjustment” to the null distribution in the logistic case when the sample size is less than 2000. Hail does not apply this adjustment.

The fault flag is an integer indicating whether any issues occurred when running the Davies algorithm to compute the p-value as the right tail of a weighted sum of $$\chi^2(1)$$ distributions.

fault value Description
0 no issues
1 accuracy NOT achieved
2 round-off error possibly significant
3 invalid parameters
4 unable to locate integration parameters
5 out of memory
Parameters: key_expr (Expression) – Row-indexed expression for key associated to each row. weight_expr (Float64Expression) – Row-indexed expression for row weights. y (Float64Expression) – Column-indexed response expression. If logistic is True, all non-missing values must evaluate to 0 or 1. Note that a BooleanExpression will be implicitly converted to a Float64Expression with this property. x (Float64Expression) – Entry-indexed expression for input variable. covariates (list of Float64Expression) – List of column-indexed covariate expressions. logistic (bool) – If true, use the logistic test rather than the linear test. max_size (int) – Maximum size of group on which to run the test. accuracy (float) – Accuracy achieved by the Davies algorithm if fault value is zero. iterations (int) – Maximum number of iterations attempted by the Davies algorithm. Table – Table of SKAT results.
hail.methods.split_multi(ds, keep_star=False, left_aligned=False)[source]

Split multiallelic variants.

The resulting dataset will be keyed by the split locus and alleles.

• was_split (bool) – True if this variant was originally multiallelic, otherwise False.
• a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with a_index = 1 and 1:100:A:C with a_index = 2.
• old_locus (locus) – The original, unsplit locus.
• old_alleles (array<str>) – The original, unsplit alleles.

All other fields are left unchanged.

Example

split_multi_hts(), which splits multiallelic variants for the HTS genotype schema and updates the entry fields by downcoding the genotype, is implemented as:

>>> sm = hl.split_multi(ds)
>>> pl = hl.or_missing(
...      hl.is_defined(sm.PL),
...      (hl.range(0, 3).map(lambda i: hl.min(hl.range(0, hl.len(sm.PL))
...                     .filter(lambda j: hl.downcode(hl.unphased_diploid_gt_index_call(j), sm.a_index) == hl.unphased_diploid_gt_index_call(i))
...                     .map(lambda j: sm.PL[j])))))
>>> split_ds = sm.annotate_entries(
...     GT=hl.downcode(sm.GT, sm.a_index),
...     DP=sm.DP,
...     PL=pl,
...     GQ=hl.gq_from_pl(pl)).drop('old_locus', 'old_alleles')

Warning

In order to support a wide variety of data types, this function splits only the variants on a MatrixTable, but not the genotypes. Use split_multi_hts() if possible, or split the genotypes yourself using one of the entry modification methods: MatrixTable.annotate_entries(), MatrixTable.select_entries(), MatrixTable.transmute_entries().

Parameters: ds (MatrixTable or Table) – An unsplit dataset. keep_star (bool) – Do not filter out * alleles. left_aligned (bool) – If True, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.
hail.methods.split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep')[source]

Split multiallelic variants for datasets that contain one or more fields from a standard high-throughput sequencing entry schema.

struct {
GT: call,
DP: int32,
GQ: int32,
PL: array<int32>,
PGT: call,
PID: str
}

For other entry fields, write your own splitting logic using MatrixTable.annotate_entries().

Examples

>>> hl.split_multi_hts(dataset).write('output/split.vds')

Notes

We will explain by example. Consider a hypothetical 3-allelic variant:

A   C,T 0/2:7,2,6:15:45:99,50,99,0,45,99

split_multi_hts() will create two biallelic variants (one for each alternate allele) at the same position

A   C   0/0:13,2:15:45:0,45,99
A   T   0/1:9,6:15:50:50,0,99

Each multiallelic GT or PGT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.

The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.

The biallelic DP is the same as the multiallelic DP.

The biallelic PL entry for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.

Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.

GQ is recomputed from PL if PL is provided. If not, it is copied from the original GQ.

Here is a second example for a het non-ref

A   C,T 1/2:2,8,6:16:45:99,50,99,45,0,99

splits as

A   C   0/1:8,8:16:45:45,0,99
A   T   0/1:10,6:16:50:50,0,99

VCF Info Fields

Hail does not split fields in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index field a_index can be used to select the value corresponding to the split allele’s position:

>>> split_ds = hl.split_multi_hts(dataset)
>>> split_ds = split_ds.filter_rows(split_ds.info.AC[split_ds.a_index - 1] < 10,
...                                 keep = False)

VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number “.”.

If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:

>>> split_ds = hl.split_multi_hts(dataset)
>>> split_ds = split_ds.annotate_rows(info = Struct(AC=split_ds.info.AC[split_ds.a_index - 1],
...                                   **split_ds.info))
>>> hl.export_vcf(split_ds, 'output/export.vcf')

The info field AC in data/export.vcf will have Number=1.

New Fields

• was_split (bool) – True if this variant was originally multiallelic, otherwise False.
• a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with a_index = 1 and 1:100:A:C with a_index = 2.

split_multi()

Parameters: ds (MatrixTable or Table) – An unsplit dataset. keep_star (bool) – Do not filter out * alleles. left_aligned (bool) – If True, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated. vep_root (str) – Top-level location of vep data. All variable-length VEP fields (intergenic_consequences, motif_feature_consequences, regulatory_feature_consequences, and transcript_consequences) will be split properly (i.e. a_index corresponding to the VEP allele_num). MatrixTable or Table – A biallelic variant dataset.
hail.methods.summarize_variants(mt: hail.matrixtable.MatrixTable, show=True)[source]

Summarize the variants present in a dataset and print the results.

Examples

>>> hl.summarize_variants(dataset)
==============================
Number of variants: 346
==============================
Alleles per variant
-------------------
2 alleles: 346 variants
==============================
Variants per contig
-------------------
20: 346 variants
==============================
Allele type distribution
------------------------
SNP: 301 alleles
Deletion: 27 alleles
Insertion: 18 alleles
==============================
Parameters: mt (MatrixTable) – Matrix table with a variant (locus / alleles) row key. show (bool) – If True, print results instead of returning them.

Notes

The result returned if show is False is a Struct with four fields:

• n_variants (int): Number of variants present in the matrix table.
• allele_types (Dict[str, int]): Number of alternate alleles in each allele allele category.
• contigs (Dict[str, int]): Number of variants on each contig.
• allele_counts (Dict[int, int]): Number of variants broken down by number of alleles (biallelic is 2, for example).
Returns: None or Struct – Returns None if show is True, or returns results as a struct.
hail.methods.transmission_disequilibrium_test(dataset, pedigree) → hail.table.Table[source]

Performs the transmission disequilibrium test on trios.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Compute TDT association statistics and show the first two results:

>>> tdt_table = hl.transmission_disequilibrium_test(tdt_dataset, pedigree)
>>> tdt_table.show(2)
+---------------+------------+-------+-------+-------------+-------------+
| locus         | alleles    |     t |     u |      chi_sq |     p_value |
+---------------+------------+-------+-------+-------------+-------------+
| locus<GRCh37> | array<str> | int32 | int32 |     float64 |     float64 |
+---------------+------------+-------+-------+-------------+-------------+
| 1:246714629   | ["C","A"]  |     0 |     4 | 4.00000e+00 | 4.55003e-02 |
| 2:167262169   | ["T","C"]  |    NA |    NA |          NA |          NA |
+---------------+------------+-------+-------+-------------+-------------+

Export variants with p-values below 0.001:

>>> tdt_table = tdt_table.filter(tdt_table.p_value < 0.001)
>>> tdt_table.export("output/tdt_results.tsv")

Notes

The transmission disequilibrium test compares the number of times the alternate allele is transmitted (t) versus not transmitted (u) from a heterozgyous parent to an affected child. The null hypothesis holds that each case is equally likely. The TDT statistic is given by

$(t - u)^2 \over (t + u)$

and asymptotically follows a chi-squared distribution with one degree of freedom under the null hypothesis.

transmission_disequilibrium_test() only considers complete trios (two parents and a proband with defined sex) and only returns results for the autosome, as defined by in_autosome(), and chromosome X. Transmissions and non-transmissions are counted only for the configurations of genotypes and copy state in the table below, in order to filter out Mendel errors and configurations where transmission is guaranteed. The copy state of a locus with respect to a trio is defined as follows:

• Auto – in autosome or in PAR of X or female child
• HemiX – in non-PAR of X and male child

Here PAR is the pseudoautosomal region of X and Y defined by ReferenceGenome, which many variant callers map to chromosome X.

Kid Dad Mom Copy State t u
HomRef Het Het Auto 0 2
HomRef HomRef Het Auto 0 1
HomRef Het HomRef Auto 0 1
Het Het Het Auto 1 1
Het HomRef Het Auto 1 0
Het Het HomRef Auto 1 0
Het HomVar Het Auto 0 1
Het Het HomVar Auto 0 1
HomVar Het Het Auto 2 0
HomVar Het HomVar Auto 1 0
HomVar HomVar Het Auto 1 0
HomRef HomRef Het HemiX 0 1
HomRef HomVar Het HemiX 0 1
HomVar HomRef Het HemiX 1 0
HomVar HomVar Het HemiX 1 0

tdt() produces a table with the following columns:

• locus (tlocus) – Locus.
• alleles (tarray of tstr) – Alleles.
• t (tint32) – Number of transmitted alternate alleles.
• u (tint32) – Number of untransmitted alternate alleles.
• chi_sq (tfloat64) – TDT statistic.
• p_value (tfloat64) – p-value.
Parameters: dataset (MatrixTable) – Dataset. pedigree (Pedigree) – Sample pedigree. Table – Table of TDT results.
hail.methods.trio_matrix(dataset, pedigree, complete_trios=False) → hail.matrixtable.MatrixTable[source]

Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio.

Note

Requires the column key to be one field of type tstr

Examples

Create a trio matrix:

>>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)

Notes

This method builds a new matrix table with one column per trio. If complete_trios is True, then only trios that satisfy Trio.is_complete() are included. In this new dataset, the column identifiers are the sample IDs of the trio probands. The column fields and entries of the matrix are changed in the following ways:

The new column fields consist of three structs (proband, father, mother), a Boolean field, and a string field:

• proband (tstruct) - Column fields on the proband.
• father (tstruct) - Column fields on the father.
• mother (tstruct) - Column fields on the mother.
• id (tstr) - Column key for the proband.
• is_female (tbool) - Proband is female. True for female, False for male, missing if unknown.
• fam_id (tstr) - Family ID.

The new entry fields are:

• proband_entry (tstruct) - Proband entry fields.
• father_entry (tstruct) - Father entry fields.
• mother_entry (tstruct) - Mother entry fields.
Parameters: pedigree (Pedigree) MatrixTable
hail.methods.variant_qc(mt, name='variant_qc') → hail.matrixtable.MatrixTable[source]

Compute common variant statistics (quality control metrics).

Note

Requires the dataset to have a compound row key:

Examples

>>> dataset_result = hl.variant_qc(dataset)

Notes

This method computes variant statistics from the genotype data, returning a new struct field name with the following metrics based on the fields present in the entry schema.

If mt contains an entry field DP of type tint32, then the field dp_stats is computed. If mt contains an entry field GQ of type tint32, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:

• mean (float64) – Mean value.
• stdev (float64) – Standard deviation (zero degrees of freedom).
• min (int32) – Minimum value.
• max (int32) – Maximum value.

If the dataset does not contain an entry field GT of type tcall, then an error is raised. The following fields are always computed from GT:

• AF (array<float64>) – Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN.
• AC (array<int32>) – Calculated allele count, one element per allele, including the reference. Sums to AN.
• AN (int32) – Total number of called alleles.
• homozygote_count (array<int32>) – Number of homozygotes per allele. One element per allele, including the reference.
• n_called (int64) – Number of samples with a defined GT.
• n_not_called (int64) – Number of samples with a missing GT.
• call_rate (float32) – Fraction of samples with a defined GT. Equivalent to n_called / count_cols().
• n_het (int64) – Number of heterozygous samples.
• n_non_ref (int64) – Number of samples with at least one called non-reference allele.
• het_freq_hwe (float64) – Expected frequency of heterozygous samples under Hardy-Weinberg equilibrium. See functions.hardy_weinberg_test() for details.
• p_value_hwe (float64) – p-value from test of Hardy-Weinberg equilibrium. See functions.hardy_weinberg_test() for details.

Warning

het_freq_hwe and p_value_hwe are calculated as in functions.hardy_weinberg_test(), with non-diploid calls (ploidy != 2) ignored in the counts. As this test is only statistically rigorous in the biallelic setting, variant_qc() sets both fields to missing for multiallelic variants. Consider using split_multi() to split multi-allelic variants beforehand.

Parameters: mt (MatrixTable) – Dataset. name (str) – Name for resulting field. MatrixTable
hail.methods.vep(dataset: Union[hail.table.Table, hail.matrixtable.MatrixTable], config, block_size=1000, name='vep', csq=False)[source]

Annotate variants with VEP.

Note

Requires the dataset to have a compound row key:

vep() runs Variant Effect Predictor on the current dataset and adds the result as a row field.

Examples

Add VEP annotations to the dataset:

>>> result = hl.vep(dataset, "data/vep-configuration.json")

Notes

Configuration

vep() needs a configuration file to tell it how to run VEP. The format of the configuration file is JSON, and vep() expects a JSON object with three fields:

• command (array of string) – The VEP command line to run. The string literal __OUTPUT_FORMAT_FLAG__ is replaced with –json or –vcf depending on csq.
• env (object) – A map of environment variables to values to add to the environment when invoking the command. The value of each object member must be a string.
• vep_json_schema (string): The type of the VEP JSON schema (as produced by the VEP when invoked with the –json option). Note: This is the old-style ‘parseable’ Hail type syntax. This will change.

Here is an example configuration file for invoking VEP release 85 installed in /vep with the Loftee plugin:

{
"command": [
"/vep",
"--format", "vcf",
"__OUTPUT_FORMAT_FLAG__",
"--everything",
"--allele_number",
"--no_stats",
"--cache", "--offline",
"--minimal",
"--assembly", "GRCh37",
"--plugin", "LoF,human_ancestor_fa:/root/.vep/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/root/.vep/loftee_data/phylocsf_gerp.sql,gerp_file:/root/.vep/loftee_data/GERP_scores.final.sorted.txt.gz",
"-o", "STDOUT"
],
"env": {
"PERL5LIB": "/vep_data/loftee"
},
}

Annotations

A new row field is added in the location specified by name with type given by the type given by the json_vep_schema (if csq is False) or tstr (if csq is True).

If csq is True, then the CSQ header string is also added as a global field with name name + '_csq_header'.

Parameters: dataset (MatrixTable or Table) – Dataset. config (str) – Path to VEP configuration file. block_size (int) – Number of rows to process per VEP invocation. name (str) – Name for resulting row field. csq (bool) – If True, annotates with the VCF CSQ field as a tstr. If False, annotates as the vep_json_schema. MatrixTable or Table – Dataset with new row-indexed field name containing VEP annotations.
hail.methods.window_by_locus(mt: hail.matrixtable.MatrixTable, bp_window_size: int) → hail.matrixtable.MatrixTable[source]

Collect arrays of row and entry values from preceding loci.

Note

Requires that the row key starts with:

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Examples

>>> ds_result = hl.window_by_locus(ds, 3)

Notes

This method groups each row (variant) with the previous rows in a window of bp_window_size base pairs, putting the row values from the previous variants into prev_rows (row field of type array<struct>) and entry values from those variants into prev_entries (entry field of type array<struct>).

The bp_window_size argument is inclusive; if base_pairs is 2 and the loci are

1:100
1:100
1:102
1:102
1:103
2:100
2:101

then the size of prev_rows is 0, 1, 2, 3, 2, 0, and 1, respectively (and same for the size of prev_entries).

Parameters: mt (MatrixTable) – Input dataset. bp_window_size (int) – Base pairs to include in the backwards window (inclusive). MatrixTable