Genetics¶
balding_nichols_model (n_populations, …[, …]) 
Generate a matrix table of variants, samples, and genotypes using the BaldingNichols model. 
concordance (left, right) 
Calculate call concordance with another dataset. 
filter_intervals (ds, intervals[, keep]) 
Filter rows with a list of intervals. 
filter_alleles (mt, f) 
Filter alternate alleles. 
filter_alleles_hts (mt, f, subset) 
Filter alternate alleles and update standard GATK entry fields. 
hwe_normalized_pca (call_expr[, k, …]) 
Run principal component analysis (PCA) on the HardyWeinbergnormalized genotype call matrix. 
identity_by_descent (dataset[, maf, bounded, …]) 
Compute matrix of identitybydescent estimates. 
genetic_relatedness_matrix (call_expr) 
Compute the genetic relatedness matrix (GRM). 
realized_relationship_matrix (call_expr) 
Computes the realized relationship matrix (RRM). 
impute_sex (call[, aaf_threshold, …]) 
Impute sex of samples by calculating inbreeding coefficient on the X chromosome. 
ld_prune (call_expr[, r2, bp_window_size, …]) 
Returns a maximal subset of nearlyuncorrelated variants. 
mendel_errors (call, pedigree) 
Find Mendel errors; count per variant, individual and nuclear family. 
de_novo (mt, pedigree, pop_frequency_prior, …) 
Call putative de novo events from trio data. 
nirvana (dataset, config[, block_size, name]) 
Annotate variants using Nirvana. 
pc_relate (call_expr, min_individual_maf, *) 
Compute relatedness estimates between individuals using a variant of the PCRelate method. 
sample_qc (dataset[, name]) 
Compute persample metrics useful for quality control. 
skat (key_expr, weight_expr, y, x[, …]) 
Test each keyed group of rows for association by linear or logistic SKAT test. 
split_multi_hts (ds[, keep_star, left_aligned]) 
Split multiallelic variants for datasets with a standard highthroughput sequencing entry schema. 
SplitMulti (ds[, keep_star, left_aligned]) 
Split multiallelic variants. 
transmission_disequilibrium_test (dataset, …) 
Performs the transmission disequilibrium test on trios. 
trio_matrix (dataset, pedigree[, complete_trios]) 
Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio. 
variant_qc (dataset[, name]) 
Compute common variant statistics (quality control metrics). 
vep (dataset, config[, block_size, name, csq]) 
Annotate variants with VEP. 
window_by_locus (mt, bp_window_size) 
Collect arrays of row and entry values from preceding loci. 

hail.methods.
balding_nichols_model
(n_populations, n_samples, n_variants, n_partitions=None, pop_dist=None, fst=None, af_dist=<hail.stats.distributions.UniformDist object>, seed=0, reference_genome='default', mixture=False) → hail.matrixtable.MatrixTable[source]¶ Generate a matrix table of variants, samples, and genotypes using the BaldingNichols model.
Examples
Generate a matrix table of genotypes with 1000 variants and 100 samples across 3 populations:
>>> bn_ds = hl.balding_nichols_model(3, 100, 1000)
Generate a matrix table using 4 populations, 40 samples, 150 variants, 3 partitions, population distribution
[0.1, 0.2, 0.3, 0.4]
, \(F_{ST}\) values[.02, .06, .04, .12]
, ancestral allele frequencies drawn from a truncated beta distribution witha = 0.01
andb = 0.05
over the interval[0.05, 1]
, and random seed 1:>>> from hail.stats import TruncatedBetaDist >>> >>> bn_ds = hl.balding_nichols_model(4, 40, 150, 3, ... pop_dist=[0.1, 0.2, 0.3, 0.4], ... fst=[.02, .06, .04, .12], ... af_dist=TruncatedBetaDist(a=0.01, b=2.0, min=0.05, max=1.0), ... seed=1)
Notes
This method simulates a matrix table of variants, samples, and genotypes using the BaldingNichols model, which we now define.
 \(K\) populations are labeled by integers 0, 1, …, K  1.
 \(N\) samples are labeled by strings 0, 1, …, N  1.
 \(M\) variants are defined as
1:1:A:C
,1:2:A:C
, …,1:M:A:C
.  The default distribution for population assignment \(\pi\) is uniform.
 The default ancestral frequency distribution \(P_0\) is uniform on
[0.1, 0.9]
. Other options areUniformDist
,BetaDist
, andTruncatedBetaDist
. All three classes are located inhail.stats
.  The default \(F_{ST}\) values are all 0.1.
The BaldingNichols model models genotypes of individuals from a structured population comprising \(K\) homogeneous modern populations that have each diverged from a single ancestral population (a star phylogeny). Each sample is assigned a population by sampling from the categorical distribution \(\pi\). Note that the actual size of each population is random.
Variants are modeled as biallelic and unlinked. Ancestral allele frequencies are drawn independently for each variant from a frequency spectrum \(P_0\). The extent of genetic drift of each modern population from the ancestral population is defined by the corresponding \(F_{ST}\) parameter \(F_k\) (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. \(k = 1, \ldots, K\)). For each variant and population, allele frequencies are drawn from a beta distribution whose parameters are determined by the ancestral allele frequency and \(F_{ST}\) parameter. The beta distribution gives a continuous approximation of the effect of genetic drift. We denote sample population assignments by \(k_n\), ancestral allele frequencies by \(p_m\), population allele frequencies by \(p_{k, m}\), and diploid, unphased genotype calls by \(g_{n, m}\) (0, 1, and 2 correspond to homozygous reference, heterozygous, and homozygous variant, respectively).
The generative model is then given by:
\[ \begin{align}\begin{aligned}k_n \,&\sim\, \pi\\p_m \,&\sim\, P_0\\p_{k,m} \mid p_m\,&\sim\, \mathrm{Beta}(\mu = p_m,\, \sigma^2 = F_k p_m (1  p_m))\\g_{n,m} \mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m})\end{aligned}\end{align} \]The beta distribution by its mean and variance above; the usual parameters are \(a = (1  p) \frac{1  F}{F}\) and \(b = p \frac{1  F}{F}\) with \(F = F_k\) and \(p = p_m\).
The resulting dataset has the following fields.
Global fields:
 n_populations (
tint32
) – Number of populations.  n_samples (
tint32
) – Number of samples.  n_variants (
tint32
) – Number of variants.  pop_dist (
tarray
oftfloat64
) – Population distribution indexed by population.  fst (
tarray
oftfloat64
) – \(F_{ST}\) values indexed by population.  ancestral_af_dist (
tstruct
) – Description of the ancestral allele frequency distribution.  seed (
tint32
) – Random seed.  mixture (
tbool
) – Value of mixture parameter.
Row fields:
 locus (
tlocus
) – Variant locus (key field).  alleles (
tarray
oftstr
) – Variant alleles (key field).  ancestral_af (
tfloat64
) – Ancestral allele frequency.  af (
tarray
oftfloat64
) – Modern allele frequencies indexed by population.
Column fields:
Entry fields:
 GT (
tcall
) – Genotype call (diploid, unphased).
Parameters:  n_populations (
int
) – Number of modern populations.  n_samples (
int
) – Total number of samples.  n_variants (
int
) – Number of variants.  n_partitions (
int
, optional) – Number of partitions. Default is 1 partition per million entries or 8, whichever is larger.  pop_dist (
list
offloat
, optional) – Unnormalized population distribution, a list of lengthn_populations
with nonnegative values. Default is[1, ..., 1]
.  fst (
list
offloat
, optional) – \(F_{ST}\) values, a list of lengthn_populations
with values in (0, 1). Default is[0.1, ..., 0.1]
.  af_dist (
UniformDist
orBetaDist
orTruncatedBetaDist
) – Ancestral allele frequency distribution. Default isUniformDist(0.1, 0.9)
.  seed (
int
) – Random seed.  reference_genome (
str
orReferenceGenome
) – Reference genome to use.  mixture (
bool
) – Treat pop_dist as the parameters of a Dirichlet distribution, as in the PrichardStevensDonnelly model. This feature is EXPERIMENTAL and currently undocumented and untested. IfTrue
, the type of pop istarray
oftfloat64
and the value is the mixture proportions.
Returns: MatrixTable
– Simulated matrix table of variants, samples, and genotypes.

hail.methods.
concordance
(left, right) → Tuple[[List[List[int]], hail.table.Table], hail.table.Table][source]¶ Calculate call concordance with another dataset.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to contain only diploid and unphased genotype calls. Use
call()
to recode genotype calls ornull()
to set genotype calls to missing.Examples
Compute concordance between two datasets and output the global concordance statistics and two tables with concordance computed per column key and per row key:
>>> global_conc, cols_conc, rows_conc = hl.concordance(dataset, dataset2)
Notes
This method computes the genotype call concordance (from the entry field GT) between two biallelic variant datasets. It requires unique sample IDs and performs an inner join on samples (only samples in both datasets will be considered). In addition, all genotype calls must be diploid and unphased.
It performs an ordered zip join of the variants. That means the variants of each dataset are sorted, with duplicate variants appearing in some random relative order, and then zipped together. When a variant appears a different number of times between the two datasets, the dataset with the fewer number of instances is padded with “no data”. For example, if a variant is only in one dataset, then each genotype is treated as “no data” in the other.
This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.
Using the global summary result
The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:
 No Data (missing variant)
 No Call (missing genotype call)
 Hom Ref
 Heterozygous
 Hom Var
The first index is the state in the left dataset and the second index is the state in the right dataset. Typical uses of the summary list are shown below.
>>> summary, samples, variants = hl.concordance(dataset, dataset2) >>> left_homref_right_homvar = summary[2][4] >>> left_het_right_missing = summary[3][1] >>> left_het_right_something_else = sum(summary[3][:])  summary[3][3] >>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4] >>> total_discordant = sum([sum(s[2:]) for s in summary[2:]])  total_concordant
Using the table results
Table 1: Concordance statistics by column
This table contains the column key field of left, and the following fields:
Table 2: Concordance statistics by row
This table contains the row key fields of left, and the following fields:
In these tables, the column n_discordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.
The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example,
concordance[1][4]
is the number of “no call” genotypes on the left that were called homozygous variant on the right.Parameters:  left (
MatrixTable
) – First dataset to compare.  right (
MatrixTable
) – Second dataset to compare.
Returns: (list of list of int,
Table
,Table
) – The global concordance statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.

hail.methods.
filter_intervals
(ds, intervals, keep=True) → hail.matrixtable.MatrixTable[source]¶ Filter rows with a list of intervals.
Examples
Filter to loci falling within one interval:
>>> ds_result = hl.filter_intervals(dataset, [hl.parse_locus_interval('17:3844984038530994')])
Remove all loci within list of intervals:
>>> intervals = [hl.parse_locus_interval(x) for x in ['1:50M75M', '2:START400000', '322']] >>> ds_result = hl.filter_intervals(dataset, intervals)
Notes
Based on the
keep
argument, this method will either restrict to points in the supplied interval ranges, or remove all rows in those ranges.When
keep=True
, partitions that don’t overlap any supplied interval will not be loaded at all. This enablesfilter_intervals()
to be used for reasonably lowlatency queries of small ranges of the dataset, even on large datasets.Parameters:  ds (
MatrixTable
) – Dataset.  intervals (
ArrayExpression
of typetinterval
) – Intervals to filter on. If there is only one row partition key, the point type of the interval can be the type of the first partition key. Otherwise, the interval point type must be aStruct
matching the row partition key schema.  keep (
bool
) – IfTrue
, keep only rows that fall within any interval in intervals. IfFalse
, keep only rows that fall outside all intervals in intervals.
Returns:  ds (

hail.methods.
filter_alleles
(mt: hail.matrixtable.MatrixTable, f: Callable) → hail.matrixtable.MatrixTable[source]¶ Filter alternate alleles.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Examples
Keep SNPs:
>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))
Keep alleles with AC > 0:
>>> ds_result = hl.filter_alleles(ds, lambda a, allele_index: ds.info.AC[allele_index  1] > 0)
Update the AC field of the resulting dataset:
>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i1])) >>> ds_result = ds_result.annotate_rows(info = updated_info)
Notes
The following new fields are generated:
 old_locus (
locus
) – The old locus, before filtering and computing the minimal representation.  old_alleles (
array<str>
) – The old alleles, before filtering and computing the minimal representation.  old_to_new (
array<int32>
) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.  new_to_old (
array<int32>
) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.
If all alternate alleles of a variant are filtered out, the variant itself is filtered out.
Using _f_
The f argument is a function or lambda evaluated per alternate allele to determine whether that allele is kept. If f evaluates to
True
, the allele is kept. If f evaluates toFalse
or missing, the allele is removed.f is a function that takes two arguments: the allele string (of type
StringExpression
) and the allele index (of typeInt32Expression
), and returns a boolean expression. This can be either a defined function or a lambda. For example, these two usages are equivalent:(with a lambda)
>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))
(with a defined function)
>>> def filter_f(allele, allele_index): ... return hl.is_snp(ds.alleles[0], allele) >>> ds_result = hl.filter_alleles(ds, filter_f)
Warning
filter_alleles()
does not update any fields other than locus and alleles. This means that row fields like allele count (AC) and entry fields like allele depth (AD) can become meaningless unless they are also updated. You can update them withannotate_rows()
andannotate_entries()
.See also
Parameters:  mt (
MatrixTable
) – Dataset.  f (callable) – Function from (allele:
StringExpression
, allele_index:Int32Expression
) toBooleanExpression
Returns:  old_locus (

hail.methods.
filter_alleles_hts
(mt: hail.matrixtable.MatrixTable, f: Callable, subset: bool = False) → hail.matrixtable.MatrixTable[source]¶ Filter alternate alleles and update standard GATK entry fields.
Examples
Filter to SNP alleles using the subset strategy:
>>> ds_result = hl.filter_alleles_hts( ... ds, ... lambda allele, _: hl.is_snp(ds.alleles[0], allele), ... subset=True)
Update the AC field of the resulting dataset:
>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i1])) >>> ds_result = ds_result.annotate_rows(info = updated_info)
Notes
For usage of the _f_ argument, see the
filter_alleles()
documentation.filter_alleles_hts()
requires the dataset have the GATK VCF schema, namely the following entry fields in this order:GT: call AD: array<int32> DP: int32 GQ: int32 PL: array<int32>
Use
MatrixTable.select_entries()
to rearrange these fields if necessary.The following new fields are generated:
 old_locus (
locus
) – The old locus, before filtering and computing the minimal representation.  old_alleles (
array<str>
) – The old alleles, before filtering and computing the minimal representation.  old_to_new (
array<int32>
) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.  new_to_old (
array<int32>
) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.
Downcode algorithm
We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.
GT: 1/2 GQ: 10 AD: 0,50,35 0  1000 1  1000 10 2  1000 0 20 + 0 1 2
The downcode algorithm recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 > 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcoding filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then renormalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly to
split_multi_hts()
.The downcode algorithm would produce the following:
GT: 0/1 GQ: 10 AD: 35,50 0  20 1  0 10 + 0 1
In summary:
 GT: Downcode filtered alleles to reference.
 AD: Columns of filtered alleles are eliminated and their
values are added to the reference column, e.g., filtering
alleles 1 and 2 transforms
25,5,10,20
to40,20
.  DP: No change.
 PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.
 GQ: The secondlowest PL (after shifting).
Subset algorithm
We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.
GT: 1/2 GQ: 10 AD: 0,50,35 0  1000 1  1000 10 2  1000 0 20 + 0 1 2
The subset algorithm subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are renormalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.
The subset algorithm would produce the following:
GT: 1/1 GQ: 980 AD: 0,50 0  980 1  980 0 + 0 1
In summary:
 GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).
 AD: The filtered alleles’ columns are eliminated, e.g.,
filtering alleles 1 and 2 transforms
25,5,10,20
to25,20
.  DP: Unchanged.
 PL: Columns involving filtered alleles are eliminated and the remaining columns’ values are shifted so the minimum value is 0.
 GQ: The secondlowest PL (after shifting).
Warning
filter_alleles_hts()
does not update any row fields other than locus and alleles. This means that row fields like allele count (AC) can become meaningless unless they are also updated. You can update them withannotate_rows()
.See also
Parameters:  mt (
MatrixTable
)  f (callable) – Function from (allele:
StringExpression
, allele_index:Int32Expression
) toBooleanExpression
 subset (
bool
) – Subset PL field ifTrue
, otherwise downcode PL field. The calculation of GT and GQ also depend on whether one subsets or downcodes the PL.
Returns:  old_locus (

hail.methods.
hwe_normalized_pca
(call_expr, k=10, compute_loadings=False) → Tuple[[List[float], hail.table.Table], hail.table.Table][source]¶ Run principal component analysis (PCA) on the HardyWeinbergnormalized genotype call matrix.
Examples
>>> eigenvalues, scores, loadings = hl.hwe_normalized_pca(dataset.GT, k=5)
Notes
This method specializes
pca()
for the common use case of PCA in statistical genetics, that of projecting samples to a small number of ancestry coordinates. Variants that are all homozygous reference or all homozygous alternate are unnormalizable and removed before evaluation. Seepca()
for more details.Users of PLINK/GCTA should be aware that Hail computes the GRM slightly differently with regard to missing data. In Hail, the \(ij\) entry of the GRM \(MM^T\) is simply the dot product of rows \(i\) and \(j\) of \(M\); in terms of \(C\) it is
\[\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}2p_l)(C_{jl}  2p_l)}{2p_l(1p_l)}\]where \(\mathcal{C}_i = \{l \mid C_{il} \text{ is nonmissing}\}\). In PLINK/GCTA the denominator \(m\) is replaced with the number of terms in the sum \(\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert\), i.e. the number of variants where both samples have nonmissing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections
Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM, i.e. the left singular vectors \(U_k\) instead of the component scores \(U_k S_k\). The scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. In PC biplots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.
Parameters:  call_expr (
CallExpression
) – Entryindexed call expression.  k (
int
) – Number of principal components.  compute_loadings (
bool
) – IfTrue
, compute row loadings.
Returns: (
list
offloat
,Table
,Table
) – List of eigenvalues, table with column scores, table with row loadings. call_expr (

hail.methods.
identity_by_descent
(dataset, maf=None, bounded=True, min=None, max=None) → hail.table.Table[source]¶ Compute matrix of identitybydescent estimates.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
To calculate a full IBD matrix, using minor allele frequencies computed from the dataset itself:
>>> hl.identity_by_descent(dataset)
To calculate an IBD matrix containing only pairs of samples with
PI_HAT
in \([0.2, 0.9]\), using minor allele frequencies stored in the row field panel_maf:>>> hl.identity_by_descent(dataset, maf=dataset['panel_maf'], min=0.2, max=0.9)
Notes
The implementation is based on the IBD algorithm described in the PLINK paper.
identity_by_descent()
requires the dataset to be biallelic and does not perform LD pruning. Linkage disequilibrium may bias the result so consider filtering variants first.The resulting
Table
entries have the type: { i: String, j: String, ibd: { Z0: Double, Z1: Double, Z2: Double, PI_HAT: Double }, ibs0: Long, ibs1: Long, ibs2: Long }. The key list is: *i: String, j: String*.Conceptually, the output is a symmetric, samplebysample matrix. The output table has the following form
i j ibd.Z0 ibd.Z1 ibd.Z2 ibd.PI_HAT ibs0 ibs1 ibs2 sample1 sample2 1.0000 0.0000 0.0000 0.0000 ... sample1 sample3 1.0000 0.0000 0.0000 0.0000 ... sample1 sample4 0.6807 0.0000 0.3193 0.3193 ... sample1 sample5 0.1966 0.0000 0.8034 0.8034 ...
Parameters:  dataset (
MatrixTable
) – VariantkeyedMatrixTable
containing genotype information.  maf (
Float64Expression
, optional) – Rowindexed expression for the minor allele frequency.  bounded (
bool
) – Forces the estimations for Z0`,Z1
,Z2
, andPI_HAT
to take on biologically meaningful values (in the range [0,1]).  min (
float
orNone
) – Sample pairs with aPI_HAT
below this value will not be included in the output. Must be in \([0,1]\).  max (
float
orNone
) – Sample pairs with aPI_HAT
above this value will not be included in the output. Must be in \([0,1]\).
Returns:  dataset (
Compute the genetic relatedness matrix (GRM).
Examples
>>> grm = hl.genetic_relatedness_matrix(dataset.GT)
Notes
The genetic relationship matrix (GRM) \(G\) encodes genetic correlation between each pair of samples. It is defined by \(G = MM^T\) where \(M\) is a standardized version of the genotype matrix, computed as follows. Let \(C\) be the \(n \times m\) matrix of raw genotypes in the variant dataset, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the nonmissing entries of column \(j\). Entries of \(M\) are then meancentered and variancenormalized as
\[M_{ij} = \frac{C_{ij}2p_j}{\sqrt{2p_j(1p_j)m}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in HardyWeinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,
\[G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}2p_j)(C_{kj}2p_j)}{2 p_j (1p_j)}\]Note that variants for which the alternate allele frequency is zero or one are not normalizable, and therefore removed prior to calculating the GRM.
Parameters: call_expr ( CallExpression
) – Entryindexed call expression.Returns: genetics.KinshipMatrix
– Genetic relatedness matrix for all samples.

hail.methods.
realized_relationship_matrix
(call_expr) → hail.genetics.kinshipMatrix.KinshipMatrix[source]¶ Computes the realized relationship matrix (RRM).
Examples
>>> rrm = hl.realized_relationship_matrix(dataset.GT)
Notes
The realized relationship matrix (RRM) is defined as follows. Consider the \(n \times m\) matrix \(C\) of raw genotypes, with rows indexed by \(n\) samples and columns indexed by the \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the nonmissing entries of column \(j\). Entries of \(M\) are then meancentered and variancenormalized as
\[M_{ij} = \frac{C_{ij}2p_j} {\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}2p_j)^2}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance \(1/m\), which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the \(n \times n\) sample correlation or realized relationship matrix (RRM) \(K\) as simply
\[K = MM^T\]Note that the only difference between the realized relationship matrix and the genetic relatedness matrix (GRM) used in
realized_relationship_matrix()
is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under HardyWeinberg Equilibrium.Parameters: call_expr ( CallExpression
) – Entryindexed call expression.Returns: genetics.KinshipMatrix
– Realized relationship matrix for all samples.

hail.methods.
impute_sex
(call, aaf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, aaf=None) → hail.table.Table[source]¶ Impute sex of samples by calculating inbreeding coefficient on the X chromosome.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Remove samples where imputed sex does not equal reported sex:
>>> imputed_sex = hl.impute_sex(dataset.GT) >>> dataset_result = dataset.filter_cols(imputed_sex[dataset.s].is_female != dataset.pheno.is_female)
Notes
We have used the same implementation as PLINK v1.7.
Let gr be the the reference genome of the type of the locus key (as given by
TLocus.reference_genome()
) Filter the dataset to loci on the X contig defined by gr.
 Calculate alternate allele frequency (AAF) for each row from the dataset.
 Filter to variants with AAF above aaf_threshold.
 Remove loci in the pseudoautosomal region, as defined by gr, if and
only if include_par is
True
(it defaults toFalse
)  For each row and column with a nonmissing genotype call, \(E\), the expected number of homozygotes (from population AAF), is computed as \(1.0  (2.0*maf*(1.0maf))\).
 For each row and column with a nonmissing genotype call, \(O\), the
observed number of homozygotes, is computed interpreting
0
as heterozygote and1
as homozygote`  For each row and column with a nonmissing genotype call, \(N\) is incremented by 1
 For each column, \(E\), \(O\), and \(N\) are combined across variants
 For each column, \(F\) is calculated by \((O  E) / (N  E)\)
 A sex is assigned to each sample with the following criteria:
 Female when
F < 0.2
 Male whenF > 0.8
Use female_threshold and male_threshold to change this behavior.
Annotations
The returned columnkey indexed
Table
has the following fields in addition to the matrix table’s column keys: is_female (
tbool
) – True if the imputed sex is female, false if male, missing if undetermined.  f_stat (
tfloat64
) – Inbreeding coefficient.  n_called (
tint64
) – Number of variants with a genotype call.  expected_homs (
tfloat64
) – Expected number of homozygotes.  observed_homs (
tint64
) – Observed number of homozygotes.
 call :
CallExpression
 A genotype call for each row and column. The source dataset’s row keys
must be [[locus], alleles] with types
tlocus
andArrayStringExpression
. Moreover, the alleles array must have exactly two elements (i.e. the variant must be biallelic).  aaf_threshold :
float
 Minimum alternate allele frequency threshold.
 include_par :
bool
 Include pseudoautosomal regions.
 female_threshold :
float
 Samples are called females if F < female_threshold.
 male_threshold :
float
 Samples are called males if F > male_threshold.
 aaf :
str
orNone
 A field defining the alternate allele frequency for each row. If
None
, AAF will be computed from call.
Returns: Table
– Sex imputation statistics per sample.

hail.methods.
ld_prune
(call_expr, r2=0.2, bp_window_size=1000000, memory_per_core=256)[source]¶ Returns a maximal subset of nearlyuncorrelated variants.
Note
Requires the dataset to contain only diploid genotype calls.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Notes
This method finds a maximal subset of variants such that the squared Pearson correlation coefficient \(r^2\) of any pair at most bp_window_size base pairs apart is strictly less than r2. Each variant is represented as a vector over samples with elements given by the (meanimputed) number of alternate alleles. In particular, even if present, phase information is ignored.
The method prunes variants in linkage disequilibirum in two stages.
The first (local) stage prunes correlated variants within each partition, using a local variant queue whose size is determined by memory_per_core. A larger queue may facilitate more local pruning in this stage. The parallelism is the number of partitions in the dataset.
The second (global) stage computes the correlation matrix across all remaining variants. Correlated variants within bp_window_size base pairs form the edges of a graph to which
maximal_independent_set()
is applied.Parameters:  call_expr (
CallExpression
) – Entryindexed call expression on a matrix table with rowindexed variants and columnindexed samples.  r2 (
float
) – Squared correlation threshold (exclusive upper bound).  bp_window_size (
int
) – Window size in base pairs (inclusive upper bound).  memory_per_core (
int
) – Memory in MB per core for local pruning queue.
Returns: Table
– Table of a maximal independent set of variants. call_expr (

hail.methods.
mendel_errors
(call, pedigree) → Tuple[[hail.table.Table, hail.table.Table, hail.table.Table], hail.table.Table][source]¶ Find Mendel errors; count per variant, individual and nuclear family.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables (all errors, errors by family, errors by individual, errors by variant):
>>> ped = hl.Pedigree.read('data/trios.fam') >>> all_errors, per_fam, per_sample, per_variant = hl.mendel_errors(dataset['GT'], ped)
Export all mendel errors to a text file:
>>> all_errors.export('output/all_mendel_errors.tsv')
Annotate columns with the number of Mendel errors:
>>> annotated_samples = dataset.annotate_cols(mendel=per_sample[dataset.s])
Annotate rows with the number of Mendel errors:
>>> annotated_variants = dataset.annotate_rows(mendel=per_variant[dataset.locus, dataset.alleles])
Notes
The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats, resembling the
.mendel
,.fmendel
,.imendel
, and.lmendel
formats, respectively.First table: all Mendel errors. This table contains one row per Mendel error, keyed by the variant and proband id.
Second table: errors per nuclear family. This table contains one row per nuclear family, keyed by the parents.
 pat_id (
tstr
) – Paternal ID. (key field)  mat_id (
tstr
) – Maternal ID. (key field)  fam_id (
tstr
) – Family ID.  children (
tint32
) – Number of children in this nuclear family.  errors (
tint64
) – Number of Mendel errors in this nuclear family.  snp_errors (
tint64
) – Number of Mendel errors at SNPs in this nuclear family.
Third table: errors per individual. This table contains one row per individual. Each error is counted toward the proband, father, and mother according to the Implicated in the table below.
Fourth table: errors per variant.
This method only considers complete trios (two parents and proband with defined sex). The code of each Mendel error is determined by the table below, extending the Plink classification.
In the table, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by
in_autosome()
. Auto – in autosome or in PAR or female child
 HemiX – in nonPAR of X and male child
 HemiY – in nonPAR of Y and male child
Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.
Code Dad Mom Kid Copy State  Implicated 1 HomVar HomVar Het Auto Dad, Mom, Kid 2 HomRef HomRef Het Auto Dad, Mom, Kid 3 HomRef ~HomRef HomVar Auto Dad, Kid 4 ~HomRef HomRef HomVar Auto Mom, Kid 5 HomRef HomRef HomVar Auto Kid 6 HomVar ~HomVar HomRef Auto Dad, Kid 7 ~HomVar HomVar HomRef Auto Mom, Kid 8 HomVar HomVar HomRef Auto Kid 9 Any HomVar HomRef HemiX Mom, Kid 10 Any HomRef HomVar HemiX Mom, Kid 11 HomVar Any HomRef HemiY Dad, Kid 12 HomRef Any HomVar HemiY Dad, Kid See also
Parameters:  dataset (
MatrixTable
)  pedigree (
Pedigree
)
Returns:  pat_id (

hail.methods.
de_novo
(mt: hail.matrixtable.MatrixTable, pedigree: hail.genetics.pedigree.Pedigree, pop_frequency_prior, *, min_gq: int = 20, min_p: float = 0.05, max_parent_ab: float = 0.05, min_child_ab: float = 0.2, min_dp_ratio: float = 0.1) → hail.table.Table[source]¶ Call putative de novo events from trio data.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Call de novo events:
>>> pedigree = hl.Pedigree.read('data/trios.fam') >>> priors = hl.import_table('data/gnomadFreq.tsv', impute=True) >>> priors = priors.transmute(**hl.parse_variant(priors.Variant)).key_by('locus', 'alleles') >>> de_novo_results = hl.de_novo(dataset, pedigree, pop_frequency_prior=priors[dataset.row_key].AF)
Notes
This method assumes the GATK highthroughput sequencing fields exist: GT, AD, DP, GQ, PL.
This method replicates the functionality of Kaitlin Samocha’s de novo caller. The version corresponding to git commit
bde3e40
is implemented in Hail with her permission and assistance.This method produces a
Table
with the following fields: locus (
locus
) – Variant locus.  alleles (
array<str>
) – Variant alleles.  prior (
float64
) – Site frequency prior. It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior1 / 3e7
.  proband (
struct
) – Proband column fields from mt.  father (
struct
) – Father column fields from mt.  mother (
struct
) – Mother column fields from mt.  proband_entry (
struct
) – Proband entry fields from mt.  father_entry (
struct
) – Father entry fields from mt.  proband_entry (
struct
) – Mother entry fields from mt.  is_female (
bool
) –True
if proband is female.  p_de_novo (
float64
) – Unfiltered posterior probability that the event is de novo rather than a missed heterozygous event in a parent.  confidence (
str
) Validation confidence. One of:'HIGH'
,'MEDIUM'
,'LOW'
.
The model looks for de novo events in which both parents are homozygous reference and the proband is a heterozygous. The model makes the simplifying assumption that when this configuration
x = (AA, AA, AB)
of calls occurs, exactly one of the following is true:d
: a de novo mutation occurred in the proband and all calls are accurate.m
: at least one parental allele is actually heterozygous and the proband call is accurate.
We can then estimate the posterior probability of a de novo mutation as:
\[\mathrm{P_{\text{de novo}}} = \frac{\mathrm{P}(d\,\,x)}{\mathrm{P}(d\,\,x) + \mathrm{P}(m\,\,x)}\]Applying Bayes rule to the numerator and denominator yields
\[\frac{\mathrm{P}(x\,\,d)\,\mathrm{P}(d)}{\mathrm{P}(x\,\,d)\,\mathrm{P}(d) + \mathrm{P}(x\,\,m)\,\mathrm{P}(m)}\]The prior on de novo mutation is estimated from the rate in the literature:
\[\mathrm{P}(d) = \frac{1 \text{mutation}}{30,000,000\, \text{bases}}\]The prior used for at least one alternate allele between the parents depends on the alternate allele frequency:
\[\mathrm{P}(m) = 1  (1  AF)^4\]The likelihoods \(\mathrm{P}(x\,\,d)\) and \(\mathrm{P}(x\,\,m)\) are computed from the PL (genotype likelihood) fields using these factorizations:
\[\begin{split}\mathrm{P}(x = (AA, AA, AB) \,\,d) = \Big( &\mathrm{P}(x_{\mathrm{father}} = AA \,\, \mathrm{father} = AA) \\ \cdot &\mathrm{P}(x_{\mathrm{mother}} = AA \,\, \mathrm{mother} = AA) \\ \cdot &\mathrm{P}(x_{\mathrm{proband}} = AB \,\, \mathrm{proband} = AB) \Big)\end{split}\]\[\begin{split}\mathrm{P}(x = (AA, AA, AB) \,\,m) = \Big( & \mathrm{P}(x_{\mathrm{father}} = AA \,\, \mathrm{father} = AB) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \,\, \mathrm{mother} = AA) \\ + \, &\mathrm{P}(x_{\mathrm{father}} = AA \,\, \mathrm{father} = AA) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \,\, \mathrm{mother} = AB) \Big) \\ \cdot \, &\mathrm{P}(x_{\mathrm{proband}} = AB \,\, \mathrm{proband} = AB)\end{split}\](Technically, the second factorization assumes there is exactly (rather than at least) one alternate allele among the parents, which may be justified on the grounds that it is typically the most likely case by far.)
While this posterior probability is a good metric for grouping putative de novo mutations by validation likelihood, there exist error modes in highthroughput sequencing data that are not appropriately accounted for by the phredscaled genotype likelihoods. To this end, a number of hard filters are applied in order to assign validation likelihood.
These filters are different for SNPs and insertions/deletions. In the below rules, the following variables are used:
DR
refers to the ratio of the read depth in the proband to the combined read depth in the parents.AB
refers to the read allele balance of the proband (number of alternate reads divided by total reads).AC
refers to the count of alternate alleles across all individuals in the dataset at the site.p
refers to \(\mathrm{P_{\text{de novo}}}\).min_p
refers to themin_p
function parameter.
HIGHquality SNV:
p > 0.99 && AB > 0.3 && DR > 0.2 or p > 0.99 && AB > 0.3 && AC == 1
MEDIUMquality SNV:
p > 0.5 && AB > 0.3 or p > 0.5 && AB > 0.2 && AC == 1
LOWquality SNV:
p > min_p && AB > 0.2
HIGHquality indel:
p > 0.99 && AB > 0.3 && DR > 0.2 or p > 0.99 && AB > 0.3 && AC == 1
MEDIUMquality indel:
p > 0.5 && AB > 0.3 or p > 0.5 && AB > 0.2 and AC == 1
LOWquality indel:
p > min_p && AB > 0.2
Additionally, de novo candidates are not considered if the proband GQ is smaller than the
min_gq
parameter, if the proband allele balance is lower than themin_child_ab
parameter, if the depth ratio between the proband and parents is smaller than themin_depth_ratio
parameter, or if the allele balance in a parent is above themax_parent_ab
parameter.Parameters:  mt (
MatrixTable
) – Highthroughput sequencing dataset.  pedigree (
Pedigree
) – Sample pedigree.  pop_frequency_prior (
Float64Expression
) – Expression for population alternate allele frequency prior.  min_gq – Minimum proband GQ to be considered for de novo calling.
 min_p – Minimum posterior probability to be considered for de novo calling.
 max_parent_ab – Maximum parent allele balance.
 min_child_ab – Minimum proband allele balance/
 min_dp_ratio – Minimum ratio between proband read depth and parental read depth.
Returns:  locus (

hail.methods.
nirvana
(dataset, config, block_size=500000, name='nirvana') → hail.matrixtable.MatrixTable[source]¶ Annotate variants using Nirvana.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
nirvana()
runs Nirvana on the current dataset and adds a new row field in the location specified by name.Examples
Add Nirvana annotations to the dataset:
>>> result = hl.nirvana(dataset, "data/nirvana.properties")
Configuration
nirvana()
requires a configuration file. The format is a .properties file, where each line defines a property as a keyvalue pair of the formkey = value
.nirvana()
supports the following properties: hail.nirvana.dotnet – Location of dotnet. Optional, default: dotnet.
 hail.nirvana.path – Value of the PATH environment variable when invoking Nirvana. Optional, by default PATH is not set.
 hail.nirvana.location – Location of Nirvana.dll. Required.
 hail.nirvana.reference – Location of reference genome. Required.
 hail.nirvana.cache – Location of cache. Required.
 hail.nirvana.supplementaryAnnotationDirectory – Location of Supplementary Database. Optional, no supplementary database by default.
Here is an example
nirvana.properties
configuration file:hail.nirvana.location = /path/to/dotnet/netcoreapp2.0/Nirvana.dll hail.nirvana.reference = /path/to/nirvana/References/Homo_sapiens.GRCh37.Nirvana.dat hail.nirvana.cache = /path/to/nirvana/Cache/GRCh37/Ensembl hail.nirvana.supplementaryAnnotationDirectory = /path/to/nirvana/SupplementaryDatabase/GRCh37
Annotations
A new row field is added in the location specified by name with the following schema:
struct { chromosome: str, refAllele: str, position: int32, altAlleles: array<str>, cytogeneticBand: str, quality: float64, filters: array<str>, jointSomaticNormalQuality: int32, copyNumber: int32, strandBias: float64, recalibratedQuality: float64, variants: array<struct { altAllele: str, refAllele: str, chromosome: str, begin: int32, end: int32, phylopScore: float64, isReferenceMinor: bool, variantType: str, vid: str, hgvsg: str, isRecomposedVariant: bool, isDecomposedVariant: bool, regulatoryRegions: array<struct { id: str, type: str, consequence: set<str> }>, clinvar: array<struct { id: str, reviewStatus: str, isAlleleSpecific: bool, alleleOrigins: array<str>, refAllele: str, altAllele: str, phenotypes: array<str>, medGenIds: array<str>, omimIds: array<str>, orphanetIds: array<str>, significance: str, lastUpdatedDate: str, pubMedIds: array<str> }>, cosmic: array<struct { id: str, isAlleleSpecific: bool, refAllele: str, altAllele: str, gene: str, sampleCount: int32, studies: array<struct { id: int32, histology: str, primarySite: str }> }>, dbsnp: struct { ids: array<str> }, globalAllele: struct { globalMinorAllele: str, globalMinorAlleleFrequency: float64 }, gnomad: struct { coverage: str, allAf: float64, allAc: int32, allAn: int32, allHc: int32, afrAf: float64, afrAc: int32, afrAn: int32, afrHc: int32, amrAf: float64, amrAc: int32, amrAn: int32, amrHc: int32, easAf: float64, easAc: int32, easAn: int32, easHc: int32, finAf: float64, finAc: int32, finAn: int32, finHc: int32, nfeAf: float64, nfeAc: int32, nfeAn: int32, nfeHc: int32, othAf: float64, othAc: int32, othAn: int32, othHc: int32, asjAf: float64, asjAc: int32, asjAn: int32, asjHc: int32, failedFilter: bool }, gnomadExome: struct { coverage: str, allAf: float64, allAc: int32, allAn: int32, allHc: int32, afrAf: float64, afrAc: int32, afrAn: int32, afrHc: int32, amrAf: float64, amrAc: int32, amrAn: int32, amrHc: int32, easAf: float64, easAc: int32, easAn: int32, easHc: int32, finAf: float64, finAc: int32, finAn: int32, finHc: int32, nfeAf: float64, nfeAc: int32, nfeAn: int32, nfeHc: int32, othAf: float64, othAc: int32, othAn: int32, othHc: int32, asjAf: float64, asjAc: int32, asjAn: int32, asjHc: int32, sasAf: float64, sasAc: int32, sasAn: int32, sasHc: int32, failedFilter: bool }, topmed: struct { failedFilter: bool, allAc: int32, allAn: int32, allAf: float64, allHc: int32 }, oneKg: struct { ancestralAllele: str, allAf: float64, allAc: int32, allAn: int32, afrAf: float64, afrAc: int32, afrAn: int32, amrAf: float64, amrAc: int32, amrAn: int32, easAf: float64, easAc: int32, easAn: int32, eurAf: float64, eurAc: int32, eurAn: int32, sasAf: float64, sasAc: int32, sasAn: int32 }, mitomap: array<struct { refAllele: str, altAllele: str, diseases : array<str>, hasHomoplasmy: bool, hasHeteroplasmy: bool, status: str, clinicalSignificance: str, scorePercentile: float64, isAlleleSpecific: bool, chromosome: str, begin: int32, end: int32, variantType: str } transcripts: struct { refSeq: array<struct { transcript: str, bioType: str, aminoAcids: str, cdnaPos: str, codons: str, cdsPos: str, exons: str, introns: str, geneId: str, hgnc: str, consequence: array<str>, hgvsc: str, hgvsp: str, isCanonical: bool, polyPhenScore: float64, polyPhenPrediction: str, proteinId: str, proteinPos: str, siftScore: float64, siftPrediction: str }>, ensembl: array<struct { transcript: str, bioType: str, aminoAcids: str, cdnaPos: str, codons: str, cdsPos: str, exons: str, introns: str, geneId: str, hgnc: str, consequence: array<str>, hgvsc: str, hgvsp: str, isCanonical: bool, polyPhenScore: float64, polyPhenPrediction: str, proteinId: str, proteinPos: str, siftScore: float64, siftPrediction: str }> }, overlappingGenes: array<str> }> genes: array<struct { name: str, omim: array<struct { mimNumber: int32, hgnc: str, description: str, phenotypes: array<struct { mimNumber: int32, phenotype: str, mapping: str, inheritance: array<str>, comments: str }> }> exac: struct { pLi: float64, pRec: float64, pNull: float64 } }> }
Parameters:  dataset (
MatrixTable
) – Dataset.  config (
str
) – Path to Nirvana configuration file.  block_size (
int
) – Number of rows to process per Nirvana invocation.  name (
str
) – Name for resulting row field.
Returns: MatrixTable
– Dataset with new rowindexed field name containing Nirvana annotations.

hail.methods.
pc_relate
(call_expr, min_individual_maf, *, k=None, scores_expr=None, min_kinship=inf, statistics='all', block_size=None) → hail.table.Table[source]¶ Compute relatedness estimates between individuals using a variant of the PCRelate method.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Note
Requires the dataset to contain only diploid genotype calls.
Examples
Estimate kinship, identitybydescent two, identitybydescent one, and identitybydescent zero for every pair of samples, using a minimum minor allele frequency filter of 0.01 and 10 principal components to control for population structure.
>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10)
Only compute the kinship statistic. This is more efficient than computing all statistics.
>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, statistics='kin')
Compute all statistics, excluding samplepairs with kinship less than 0.1. This is more efficient than producing the full table and then filtering using
Table.filter()
.>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, min_kinship=0.1)
One can also pass in precomputed principal component scores. To produce the same results as in the previous example:
>>> _, scores_table, _ = hl.hwe_normalized_pca(dataset.GT, ... k=10, ... compute_loadings=False) >>> rel = hl.pc_relate(dataset.GT, ... 0.01, ... scores_expr=scores_table[dataset.col_key].scores, ... min_kinship=0.1)
Notes
The traditional estimator for kinship between a pair of individuals \(i\) and \(j\), sharing the set \(S_{ij}\) of singlenucleotide variants, from a population with allele frequencies \(p_s\), is given by:
\[\widehat{\phi_{ij}} := \frac{1}{S_{ij}} \sum_{s \in S_{ij}} \frac{(g_{is}  2 p_s) (g_{js}  2 p_s)} {4 \sum_{s \in S_{ij} p_s (1  p_s)}}\]This estimator is true under the model that the sharing of common (relative to the population) alleles is not very informative to relatedness (because they’re common) and the sharing of rare alleles suggests a recent common ancestor from which the allele was inherited by descent.
When multiple ancestry groups are mixed in a sample, this model breaks down. Alleles that are rare in all but one ancestry group are treated as very informative to relatedness. However, these alleles are simply markers of the ancestry group. The PCRelate method corrects for this situation and the related situation of admixed individuals.
PCRelate slightly modifies the usual estimator for relatedness: occurrences of population allele frequency are replaced with an “individualspecific allele frequency”. This modification allows the method to correctly weight an allele according to an individual’s unique ancestry profile.
The “individualspecific allele frequency” at a given genetic locus is modeled by PCRelate as a linear function of a sample’s first
k
principal component coordinates. As such, the efficacy of this method rests on two assumptions: an individual’s first
k
principal component coordinates fully describe their allelefrequencyrelevant ancestry, and  the relationship between ancestry (as described by principal component coordinates) and population allele frequency is linear
The estimators for kinship, and identitybydescent zero, one, and two follow. Let:
 \(S_{ij}\) be the set of genetic loci at which both individuals \(i\) and \(j\) have a defined genotype
 \(g_{is} \in {0, 1, 2}\) be the number of alternate alleles that individual \(i\) has at genetic locus \(s\)
 \(\widehat{\mu_{is}} \in [0, 1]\) be the individualspecific allele frequency for individual \(i\) at genetic locus \(s\)
 \({\widehat{\sigma^2_{is}}} := \widehat{\mu_{is}} (1  \widehat{\mu_{is}})\), the binomial variance of \(\widehat{\mu_{is}}\)
 \(\widehat{\sigma_{is}} := \sqrt{\widehat{\sigma^2_{is}}}\), the binomial standard deviation of \(\widehat{\mu_{is}}\)
 \(\text{IBS}^{(0)}_{ij} := \sum_{s \in S_{ij}} \mathbb{1}_{g_{is}  g_{js} = 2}\), the number of genetic loci at which individuals \(i\) and \(j\) share no alleles
 \(\widehat{f_i} := 2 \widehat{\phi_{ii}}  1\), the inbreeding coefficient for individual \(i\)
 \(g^D_{is}\) be a dominance encoding of the genotype matrix, and \(X_{is}\) be a normalized dominancecoded genotype matrix
\[ \begin{align}\begin{aligned}\begin{split}g^D_{is} := \begin{cases} \widehat{\mu_{is}} & g_{is} = 0 \\ 0 & g_{is} = 1 \\ 1  \widehat{\mu_{is}} & g_{is} = 2 \end{cases}\end{split}\\X_{is} := g^D_{is}  \widehat{\sigma^2_{is}} (1  \widehat{f_i})\end{aligned}\end{align} \]The estimator for kinship is given by:
\[\widehat{\phi_{ij}} := \frac{\sum_{s \in S_{ij}}(g  2 \mu)_{is} (g  2 \mu)_{js}} {4 * \sum_{s \in S_{ij}} \widehat{\sigma_{is}} \widehat{\sigma_{js}}}\]The estimator for identitybydescent two is given by:
\[\widehat{k^{(2)}_{ij}} := \frac{\sum_{s \in S_{ij}}X_{is} X_{js}}{\sum_{s \in S_{ij}} \widehat{\sigma^2_{is}} \widehat{\sigma^2_{js}}}\]The estimator for identitybydescent zero is given by:
\[\begin{split}\widehat{k^{(0)}_{ij}} := \begin{cases} \frac{\text{IBS}^{(0)}_{ij}} {\sum_{s \in S_{ij}} \widehat{\mu_{is}}^2(1  \widehat{\mu_{js}})^2 + (1  \widehat{\mu_{is}})^2\widehat{\mu_{js}}^2} & \widehat{\phi_{ij}} > 2^{5/2} \\ 1  4 \widehat{\phi_{ij}} + k^{(2)}_{ij} & \widehat{\phi_{ij}} \le 2^{5/2} \end{cases}\end{split}\]The estimator for identitybydescent one is given by:
\[\widehat{k^{(1)}_{ij}} := 1  \widehat{k^{(2)}_{ij}}  \widehat{k^{(0)}_{ij}}\]Note that, even if present, phase information is ignored by this method.
The PCRelate method is described in “Modelfree Estimation of Recent Genetic Relatedness”. Conomos MP, Reiner AP, Weir BS, Thornton TA. in American Journal of Human Genetics. 2016 Jan 7. The reference implementation is available in the GENESIS Bioconductor package .
pc_relate()
differs from the reference implementation in a few ways: if k is supplied, samples scores are computed via PCA on all samples, not a specified subset of genetically unrelated samples. The latter can be achieved by filtering samples, computing PCA variant loadings, and using these loadings to compute and pass in scores for all samples.
 the estimators do not perform small sample correction
 the algorithm does not provide an option to use populationwide allele frequency estimates
 the algorithm does not provide an option to not use “overall
standardization” (see R
pcrelate
documentation)
Under the PCRelate model, kinship, \(\phi_{ij}\), ranges from 0 to 0.5, and is precisely half of the fractionofgeneticmaterialshared. Listed below are the statistics for a few pairings:
 Monozygotic twins share all their genetic material so their kinship statistic is 0.5 in expection.
 Parentchild and sibling pairs both have kinship 0.25 in expectation and are separated by the identitybydescentzero, \(k^{(2)}_{ij}\), statistic which is zero for parentchild pairs and 0.25 for sibling pairs.
 Avuncular pairs and grandparent/child pairs both have kinship 0.125 in expectation and both have identitybydescentzero 0.5 in expectation
 “Third degree relatives” are those pairs sharing \(2^{3} = 12.5 %\) of their genetic material, the results of PCRelate are often too noisy to reliably distinguish these pairs from higherdegreerelativepairs or unrelated pairs.
Note that \(g_{is}\) is the number of alternate alleles. Hence, for multiallelic variants, a value of 2 may indicate two distinct alternative alleles rather than a homozygous variant genotype. To enforce the latter, either filter or split multiallelic variants first.
The resulting table has the first 3, 4, 5, or 6 fields below, depending on the statistics parameter:
 i (
col_key.dtype
) – First sample. (key field)  j (
col_key.dtype
) – Second sample. (key field)  kin (
tfloat64
) – Kinship estimate, \(\widehat{\phi_{ij}}\).  ibd2 (
tfloat64
) – IBD2 estimate, \(\widehat{k^{(2)}_{ij}}\).  ibd0 (
tfloat64
) – IBD0 estimate, \(\widehat{k^{(0)}_{ij}}\).  ibd1 (
tfloat64
) – IBD1 estimate, \(\widehat{k^{(1)}_{ij}}\).
Here
col_key
refers to the column key of the source matrix table, andcol_key.dtype
is a struct containing the column key fields.There is one row for each pair of distinct samples (columns), where i corresponds to the column of smaller column index. In particular, if the same column key value exists for \(n\) columns, then the resulting table will have \(\binom{n1}{2}\) rows with both key fields equal to that column key value. This may result in unexpected behavior in downstream processing.
Parameters:  call_expr (
CallExpression
) – Entryindexed call expression.  min_individual_maf (
float
) – The minimum individualspecific minor allele frequency. If either individualspecific minor allele frequency for a pair of individuals is below this threshold, then the variant will not be used to estimate relatedness for the pair.  k (
int
, optional) – If set, k principal component scores are computed and used. Exactly one of k and scores_expr must be specified.  scores_expr (
ArrayNumericExpression
, optional) – Columnindexed expression of principal component scores, with the same source as call_expr. All array values must have the same positive length, corresponding to the number of principal components, and all scores must be nonmissing. Exactly one of k and scores_expr must be specified.  min_kinship (
float
) – Pairs of samples with kinship lower thanmin_kinship
are excluded from the results.  statistics (
str
) – Set of statistics to compute. If'kin'
, only estimate the kinship statistic. If'kin2'
, estimate the above and IBD2. If'kin20'
, estimate the above and IBD0. If'all'
, estimate the above and IBD1.  block_size (
int
, optional) – Block size of block matrices used in the algorithm. Default given byBlockMatrix.default_block_size()
.
Returns: Table
– ATable
mapping pairs of samples to their pairwise statistics. an individual’s first

hail.methods.
sample_qc
(dataset, name='sample_qc') → hail.matrixtable.MatrixTable[source]¶ Compute persample metrics useful for quality control.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Examples
Compute sample QC metrics and remove lowquality samples:
>>> dataset = hl.sample_qc(dataset, name='sample_qc') >>> filtered_dataset = dataset.filter_cols((dataset.sample_qc.dp_mean > 20) & (dataset.sample_qc.r_ti_tv > 1.5))
Notes
This method computes summary statistics per sample from a genetic matrix and stores the results as a new columnindexed field in the matrix, named based on the name parameter.
Name Type Description call_rate
float64 Fraction of calls nonmissing n_hom_ref
int64 Number of homozygous reference calls n_het
int64 Number of heterozygous calls n_hom_var
int64 Number of homozygous alternate calls n_called
int64 Sum of n_hom_ref
+n_het
+n_hom_var
n_not_called
int64 Number of missing calls n_snp
int64 Number of SNP alternate alleles n_insertion
int64 Number of insertion alternate alleles n_deletion
int64 Number of deletion alternate alleles n_singleton
int64 Number of private alleles n_transition
int64 Number of transition (AG, CT) alternate alleles n_transversion
int64 Number of transversion alternate alleles n_star
int64 Number of star (upstream deletion) alleles n_non_ref
int64 Sum of n_het
andn_hom_var
r_ti_tv
float64 Transition/Transversion ratio r_het_hom_var
float64 Het/HomVar call ratio r_insertion_deletion
float64 Insertion/Deletion allele ratio dp_mean
float64 Depth mean across all calls dp_stdev
float64 Depth standard deviation across all calls gq_mean
float64 The average genotype quality across all calls gq_stdev
float64 Genotype quality standard deviation across all calls Missing values
NA
may result from division by zero. The empirical standard deviation is computed with zero degrees of freedom.Parameters:  dataset (
MatrixTable
) – Dataset.  name (
str
) – Name for resulting field.
Returns: MatrixTable
– Dataset with a new columnindexed field name. dataset (

hail.methods.
skat
(key_expr, weight_expr, y, x, covariates=[], logistic=False, max_size=46340, accuracy=1e06, iterations=10000) → hail.table.Table[source]¶ Test each keyed group of rows for association by linear or logistic SKAT test.
Examples
Test each gene for association using the linear sequence kernel association test:
>>> burden_ds = hl.read_matrix_table('data/example_burden.vds') >>> skat_table = hl.skat(key_expr=burden_ds.gene, ... weight_expr=burden_ds.weight, ... y=burden_ds.burden.pheno, ... x=burden_ds.GT.n_alt_alleles(), ... covariates=[burden_ds.burden.cov1, burden_ds.burden.cov2])
Caution
By default, the Davies algorithm iterates up to 10k times until an accuracy of 1e6 is achieved. Hence a reported pvalue of zero with no issues may truly be as large as 1e6. The accuracy and maximum number of iterations may be controlled by the corresponding function parameters. In general, higher accuracy requires more iterations.
Caution
To process a group with \(m\) rows, several copies of an \(m \times m\) matrix of doubles must fit in worker memory. Groups with tens of thousands of rows may exhaust worker memory causing the entire job to fail. In this case, use the max_size parameter to skip groups larger than max_size.
Notes
This method provides a scalable implementation of the scorebased variancecomponent test originally described in RareVariant Association Testing for Sequencing Data with the Sequence Kernel Association Test.
The test is run on columns with y and all covariates nonmissing. For each row, missing input (x) values are imputed as the mean of all nonmissing input values.
Row weights must be nonnegative. Rows with missing weights are ignored. In the R package
skat
—which assumes rows are variants—default weights are given by evaluating the Beta(1, 25) density at the minor allele frequency. To replicate these weights in Hail using alternate allele frequencies stored in a rowindexd field AF, one can use the expression:>>> hl.dbeta(hl.min(ds2.AF, 1  ds2.AF), ... 1.0, 25.0) ** 2
In the logistic case, the response y must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.
The resulting
Table
provides the group’s key, the size (number of rows) in the group, the variance component score q_stat, the SKAT pvalue, and a fault flag. For the toy example above, the table has the form:key size q_stat p_value fault geneA 2 4.136 0.205 0 geneB 1 5.659 0.195 0 geneC 3 4.122 0.192 0 Groups larger than max_size appear with missing q_stat, p_value, and fault. The hard limit on the number of rows in a group is 46340.
Note that the variance component score q_stat agrees with
Q
in the R packageskat
, but both differ from \(Q\) in the paper by the factor \(\frac{1}{2\sigma^2}\) in the linear case and \(\frac{1}{2}\) in the logistic case, where \(\sigma^2\) is the unbiased estimator of residual variance for the linear null model. The R package also applies a “smallsample adjustment” to the null distribution in the logistic case when the sample size is less than 2000. Hail does not apply this adjustment.The fault flag is an integer indicating whether any issues occurred when running the Davies algorithm to compute the pvalue as the right tail of a weighted sum of \(\chi^2(1)\) distributions.
fault value Description 0 no issues 1 accuracy NOT achieved 2 roundoff error possibly significant 3 invalid parameters 4 unable to locate integration parameters 5 out of memory Parameters:  key_expr (
Expression
) – Rowindexed expression for key associated to each row.  weight_expr (
Float64Expression
) – Rowindexed expression for row weights.  y (
Float64Expression
) – Columnindexed response expression.  x (
Float64Expression
) – Entryindexed expression for input variable.  covariates (
list
ofFloat64Expression
) – List of columnindexed covariate expressions.  logistic (
bool
) – If true, use the logistic test rather than the linear test.  max_size (
int
) – Maximum size of group on which to run the test.  accuracy (
float
) – Accuracy achieved by the Davies algorithm if fault value is zero.  iterations (
int
) – Maximum number of iterations attempted by the Davies algorithm.
Returns: Table
– Table of SKAT results. key_expr (

hail.methods.
split_multi_hts
(ds, keep_star=False, left_aligned=False) → hail.matrixtable.MatrixTable[source]¶ Split multiallelic variants for datasets with a standard highthroughput sequencing entry schema.
struct { GT: call, AD: array<int32>, DP: int32, GQ: int32, PL: array<int32> }
For other entry fields, use
SplitMulti
.Examples
>>> hl.split_multi_hts(dataset).write('output/split.vds')
Notes
We will explain by example. Consider a hypothetical 3allelic variant:
A C,T 0/2:7,2,6:15:45:99,50,99,0,45,99
split_multi_hts()
will create two biallelic variants (one for each alternate allele) at the same positionA C 0/0:13,2:15:45:0,45,99 A T 0/1:9,6:15:50:50,0,99
Each multiallelic GT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.
The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.
The biallelic DP is the same as the multiallelic DP.
The biallelic PL entry for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.
Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.
GQ is recomputed from PL.
Here is a second example for a het nonref
A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
splits as
A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99
VCF Info Fields
Hail does not split fields in the info field. This means that if a multiallelic site with info.AC value
[10, 2]
is split, each split site will contain the same array[10, 2]
. The provided allele index field a_index can be used to select the value corresponding to the split allele’s position:>>> split_ds = hl.split_multi_hts(dataset) >>> split_ds = split_ds.filter_rows(split_ds.info.AC[split_ds.a_index  1] < 10, ... keep = False)
VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number “.”.
If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:
>>> split_ds = hl.split_multi_hts(dataset) >>> split_ds = split_ds.annotate_rows(info = Struct(AC=split_ds.info.AC[split_ds.a_index  1], ... **split_ds.info)) >>> hl.export_vcf(split_ds, 'output/export.vcf')
The info field AC in data/export.vcf will have
Number=1
.New Fields
split_multi_hts()
adds the following fields: was_split (Boolean) –
True
if this variant was originally multiallelic, otherwiseFalse
.  a_index (Int) – The original index of this alternate allele in the
multiallelic representation (NB: 1 is the first alternate allele or the
only alternate allele in a biallelic variant). For example, 1:100:A:T,C
splits into two variants: 1:100:A:T with
a_index = 1
and 1:100:A:C witha_index = 2
.
Parameters:  keep_star (
bool
) – Do not filter out * alleles.  left_aligned (
bool
) – IfTrue
, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.
Returns: MatrixTable
– A biallelic variant dataset. was_split (Boolean) –

class
hail.methods.
SplitMulti
(ds, keep_star=False, left_aligned=False)[source]¶ Split multiallelic variants.
Example
split_multi_hts()
, which splits multiallelic variants for the HTS genotype schema and updates the genotype annotations by downcoding the genotype, is implemented as:>>> sm = hl.SplitMulti(ds) >>> pl = hl.or_missing( ... hl.is_defined(ds.PL), ... (hl.range(0, 3).map(lambda i: hl.min(hl.range(0, hl.len(ds.PL)) ... .filter(lambda j: hl.downcode(hl.unphased_diploid_gt_index_call(j), sm.a_index()) == hl.unphased_diploid_gt_index_call(i)) ... .map(lambda j: ds.PL[j]))))) >>> sm.update_rows(a_index=sm.a_index(), was_split=sm.was_split()) >>> sm.update_entries( ... GT=hl.downcode(ds.GT, sm.a_index()), ... AD=hl.or_missing(hl.is_defined(ds.AD), ... [hl.sum(ds.AD)  ds.AD[sm.a_index()], ds.AD[sm.a_index()]]), ... DP=ds.DP, ... PL=pl, ... GQ=hl.gq_from_pl(pl)) >>> split_ds = sm.result()
Warning
Any entry and row fields that are not updated will be copied (unchanged) for each split variant.

a_index
()[source]¶ The index of the input allele to the output variant.
Returns: Expression
of typetint32

new_locus
()[source]¶ The new, split variant locus.
Returns: LocusExpression

result
()[source]¶ Split the dataset.
Returns: MatrixTable
– A split dataset.

update_entries
(**kwargs)[source]¶ Set the entry field updates for this SplitMulti object.
Note
May only be called once.

update_rows
(**kwargs)[source]¶ Set the row field updates for this SplitMulti object.
Note
May only be called once.

was_split
()[source]¶ True
if the original variant was multiallelic.Returns: BooleanExpression


hail.methods.
transmission_disequilibrium_test
(dataset, pedigree) → hail.table.Table[source]¶ Performs the transmission disequilibrium test on trios.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Compute TDT association statistics and show the first two results:
>>> pedigree = hl.Pedigree.read('data/tdt_trios.fam') >>> tdt_table = hl.transmission_disequilibrium_test(tdt_dataset, pedigree) >>> tdt_table.show(2) +++++++  locus  alleles  t  u  chi2  p_values  +++++++  locus<GRCh37>  array<str>  int32  int32  float64  float64  +++++++  1:246714629  ["C","A"]  0  4  4.00000e+00  4.55003e02   2:167262169  ["T","C"]  NA  NA  NA  NA  +++++++
Export variants with pvalues below 0.001:
>>> tdt_table = tdt_table.filter(tdt_table.p_value < 0.001) >>> tdt_table.export("output/tdt_results.tsv")
Notes
The transmission disequilibrium test compares the number of times the alternate allele is transmitted (t) versus not transmitted (u) from a heterozgyous parent to an affected child. The null hypothesis holds that each case is equally likely. The TDT statistic is given by
\[(t  u)^2 \over (t + u)\]and asymptotically follows a chisquared distribution with one degree of freedom under the null hypothesis.
transmission_disequilibrium_test()
only considers complete trios (two parents and a proband with defined sex) and only returns results for the autosome, as defined byin_autosome()
, and chromosome X. Transmissions and nontransmissions are counted only for the configurations of genotypes and copy state in the table below, in order to filter out Mendel errors and configurations where transmission is guaranteed. The copy state of a locus with respect to a trio is defined as follows: Auto – in autosome or in PAR of X or female child
 HemiX – in nonPAR of X and male child
Here PAR is the pseudoautosomal region of X and Y defined by
ReferenceGenome
, which many variant callers map to chromosome X.Kid Dad Mom Copy State t u HomRef Het Het Auto 0 2 HomRef HomRef Het Auto 0 1 HomRef Het HomRef Auto 0 1 Het Het Het Auto 1 1 Het HomRef Het Auto 1 0 Het Het HomRef Auto 1 0 Het HomVar Het Auto 0 1 Het Het HomVar Auto 0 1 HomVar Het Het Auto 2 0 HomVar Het HomVar Auto 1 0 HomVar HomVar Het Auto 1 0 HomRef HomRef Het HemiX 0 1 HomRef HomVar Het HemiX 0 1 HomVar HomRef Het HemiX 1 0 HomVar HomVar Het HemiX 1 0 tdt()
produces a table with the following columns:Parameters:  dataset (
MatrixTable
) – Dataset.  pedigree (
Pedigree
) – Sample pedigree.
Returns: Table
– Table of TDT results.

hail.methods.
trio_matrix
(dataset, pedigree, complete_trios=False) → hail.matrixtable.MatrixTable[source]¶ Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio.
Note
Requires the column key to be one field of type
tstr
Examples
Create a trio matrix:
>>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
Notes
This method builds a new matrix table with one column per trio. If complete_trios is
True
, then only trios that satisfyTrio.is_complete()
are included. In this new dataset, the column identifiers are the sample IDs of the trio probands. The column fields and entries of the matrix are changed in the following ways:The new column fields consist of three structs (proband, father, mother), a Boolean field, and a string field:
 proband (
tstruct
)  Column fields on the proband.  father (
tstruct
)  Column fields on the father.  mother (
tstruct
)  Column fields on the mother.  id (
tstr
)  Column key for the proband.  is_female (
tbool
)  Proband is female.True
for female,False
for male, missing if unknown.  fam_id (
tstr
)  Family ID.
The new entry fields are:
 proband_entry (
tstruct
)  Proband entry fields.  father_entry (
tstruct
)  Father entry fields.  mother_entry (
tstruct
)  Mother entry fields.
Parameters: pedigree ( Pedigree
)Returns: MatrixTable
 proband (

hail.methods.
variant_qc
(dataset, name='variant_qc') → hail.matrixtable.MatrixTable[source]¶ Compute common variant statistics (quality control metrics).
Note
Requires the dataset to contain no multiallelic variants. Use
SplitMulti
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
Note
Requires the dataset to contain only diploid and unphased genotype calls. Use
call()
to recode genotype calls ornull()
to set genotype calls to missing.Examples
>>> dataset_result = hl.variant_qc(dataset)
Notes
This method computes 18 variant statistics from the genotype data, returning a new struct field name with the following metrics:
Name Type Description call_rate
float64 Fraction of samples with called genotypes AF
float64 Calculated alternate allele frequency (q) AC
int32 Count of alternate alleles r_heterozygosity
float64 Proportion of heterozygotes r_het_hom_var
float64 Ratio of heterozygotes to homozygous alternates r_expected_het_freq
float64 Expected r_heterozygosity based on HWE p_hwe
float64 pvalue from Hardy Weinberg Equilibrium null model n_hom_ref
int32 Number of homozygous reference samples n_het
int32 Number of heterozygous samples n_hom_var
int32 Number of homozygous alternate samples n_called
int32 Sum of n_hom_ref
,n_het
, andn_hom_var
n_not_called
int32 Number of uncalled samples n_non_ref
int32 Sum of n_het
andn_hom_var
dp_mean
float64 Depth mean across all samples dp_stdev
float64 Depth standard deviation across all samples gq_mean
float64 The average genotype quality across all samples gq_stdev
float64 Genotype quality standard deviation across all samples Missing values
NA
may result from division by zero. The empirical standard deviation is computed with zero degrees of freedom.Parameters:  dataset (
MatrixTable
) – Dataset.  name (
str
) – Name for resulting field.
Returns: MatrixTable
– Dataset with a new rowindexed field name. dataset (

hail.methods.
vep
(dataset, config, block_size=1000, name='vep', csq=False) → hail.matrixtable.MatrixTable[source]¶ Annotate variants with VEP.
Note
Requires the dataset to have a compound row key:
Also requires that locus is the partition key.
vep()
runs Variant Effect Predictor with the LOFTEE plugin on the current dataset and adds the result as a row field.Examples
Add VEP annotations to the dataset:
>>> result = hl.vep(dataset, "data/vep.properties")
Notes
Configuration
vep()
needs a configuration file to tell it how to run VEP. The format is a .properties file. Roughly, each line defines a property as a keyvalue pair of the form key = value.vep()
supports the following properties: hail.vep.perl – Location of Perl. Optional, default: perl.
 hail.vep.perl5lib – Value for the PERL5LIB environment variable when invoking VEP. Optional, by default PERL5LIB is not set.
 hail.vep.path – Value of the PATH environment variable when invoking VEP. Optional, by default PATH is not set.
 hail.vep.location – Location of the VEP Perl script. Required.
 hail.vep.cache_dir – Location of the VEP cache dir, passed to VEP
with the
dir
option. Required.  hail.vep.fasta – Location of the FASTA file to use to look up the reference sequence, passed to VEP with the –fasta option. Required.
 hail.vep.assembly – Genome assembly version to use. Optional, default: GRCh37
 hail.vep.plugin – VEP plugin, passed to VEP with the –plugin option. Optional. Overrides hail.vep.lof.human_ancestor and hail.vep.lof.conservation_file.
 hail.vep.lof.human_ancestor – Location of the human ancestor file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.
 hail.vep.lof.conservation_file – Location of the conservation file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.
Here is an example
vep.properties
configuration filehail.vep.perl = /usr/bin/perl hail.vep.path = /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin hail.vep.location = /path/to/vep/ensembltoolsrelease81/scripts/variant_effect_predictor/variant_effect_predictor.pl hail.vep.cache_dir = /path/to/vep hail.vep.lof.human_ancestor = /path/to/loftee_data/human_ancestor.fa.gz hail.vep.lof.conservation_file = /path/to/loftee_data/phylocsf.sql
VEP Invocation
<hail.vep.perl> <hail.vep.location> format vcf json everything allele_number no_stats cache offline dir <hail.vep.cache_dir> fasta <hail.vep.fasta> minimal assembly <hail.vep.assembly> plugin LoF, human_ancestor_fa:$<hail.vep.lof.human_ancestor>, filter_position:0.05, min_intron_size:15, conservation_file:<hail.vep.lof.conservation_file> o STDOUT
Annotations
A new row field is added in the location specified by name with the following schema:
struct { assembly_name: str, allele_string: str, ancestral: str, colocated_variants: array<struct { aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32 }>, context: str, end: int32, id: str, input: str, intergenic_consequences: array<struct { allele_num: int32, consequence_terms: array<str>, impact: str, minimised: int32, variant_allele: str }>, most_severe_consequence: str, motif_feature_consequences: array<struct { allele_num: int32, consequence_terms: array<str>, high_inf_pos: str, impact: str, minimised: int32, motif_feature_id: str, motif_name: str, motif_pos: int32, motif_score_change: float64, strand: int32, variant_allele: str }>, regulatory_feature_consequences: array<struct { allele_num: int32, biotype: str, consequence_terms: array<str>, impact: str, minimised: int32, regulatory_feature_id: str, variant_allele: str }>, seq_region_name: str, start: int32, strand: int32, transcript_consequences: array<struct { allele_num: int32, amino_acids: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct { db: str, name: str }>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, uniparc: str, variant_allele: str }>, variant_class: str }
Parameters:  dataset (
MatrixTable
) – Dataset.  config (
str
) – Path to VEP configuration file.  block_size (
int
) – Number of rows to process per VEP invocation.  name (
str
) – Name for resulting row field.  csq (
bool
) – IfTrue
, annotates VCF CSQ field as atstr
. IfFalse
, annotates with the full nested struct schema.
Returns: MatrixTable
– Dataset with new rowindexed field name containing VEP annotations.

hail.methods.
window_by_locus
(mt: hail.matrixtable.MatrixTable, bp_window_size: int) → hail.matrixtable.MatrixTable[source]¶ Collect arrays of row and entry values from preceding loci.
Note
Requires that the row key starts with:
 locus (type
tlocus
)
Also requires that locus is the partition key.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Examples
>>> ds_result = hl.window_by_locus(ds, 3)
Notes
This method groups each row (variant) with the previous rows in a window of bp_window_size base pairs, putting the row values from the previous variants into prev_rows (row field of type
array<struct>
) and entry values from those variants into prev_entries (entry field of typearray<struct>
).The bp_window_size argument is inclusive; if base_pairs is 2 and the loci are
1:100 1:100 1:102 1:102 1:103 2:100 2:101
then the size of prev_rows is 0, 1, 2, 3, 2, 0, and 1, respectively (and same for the size of prev_entries).
Parameters:  mt (
MatrixTable
) – Input dataset.  bp_window_size (
int
) – Base pairs to include in the backwards window (inclusive).
Returns:  locus (type