Genetics¶

Generate a matrix table of variants, samples, and genotypes using the BaldingNichols or PritchardStephensDonnelly model. 

Calculate call concordance with another dataset. 

Filter rows with a list of intervals. 

Filter alternate alleles. 

Filter alternate alleles and update standard GATK entry fields. 

Run principal component analysis (PCA) on the HardyWeinbergnormalized genotype call matrix. 

Compute the genetic relatedness matrix (GRM). 

Computes the realized relationship matrix (RRM). 

Impute sex of samples by calculating inbreeding coefficient on the X chromosome. 

Computes the windowed correlation (linkage disequilibrium) matrix between variants. 

Returns a maximal subset of variants that are nearly uncorrelated within each window. 

Find Mendel errors; count per variant, individual and nuclear family. 

Call putative de novo events from trio data. 

Annotate variants using Nirvana. 

Compute persample metrics useful for quality control. 

Test each keyed group of rows for association by linear or logistic SKAT test. 

Compute genomic inflation factor (lambda GC) from an Expression of pvalues. 

Split multiallelic variants. 

Split multiallelic variants for datasets that contain one or more fields from a standard highthroughput sequencing entry schema. 

Summarize the variants present in a dataset and print the results. 

Performs the transmission disequilibrium test on trios. 

Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio. 

Compute common variant statistics (quality control metrics). 

Annotate variants with VEP. 

hail.methods.
balding_nichols_model
(n_populations, n_samples, n_variants, n_partitions=None, pop_dist=None, fst=None, af_dist=None, reference_genome='default', mixture=False)[source]¶ Generate a matrix table of variants, samples, and genotypes using the BaldingNichols or PritchardStephensDonnelly model.
Examples
Generate a matrix table of genotypes with 1000 variants and 100 samples across 3 populations:
>>> bn_ds = hl.balding_nichols_model(3, 100, 1000, reference_genome='GRCh37')
Generate a matrix table using 4 populations, 40 samples, 150 variants, 3 partitions, population distribution
[0.1, 0.2, 0.3, 0.4]
, \(F_{ST}\) values[.02, .06, .04, .12]
, ancestral allele frequencies drawn from a truncated beta distribution witha = 0.01
andb = 0.05
over the interval[0.05, 1]
, and random seed 1:>>> hl.set_global_seed(1) >>> bn_ds = hl.balding_nichols_model(4, 40, 150, 3, ... pop_dist=[0.1, 0.2, 0.3, 0.4], ... fst=[.02, .06, .04, .12], ... af_dist=hl.rand_beta(a=0.01, b=2.0, lower=0.05, upper=1.0))
To guarantee reproducibility, we set the Hail global seed with
set_global_seed()
immediately prior to generating the dataset.Notes
This method simulates a matrix table of variants, samples, and genotypes using the BaldingNichols model, which we now define.
\(K\) populations are labeled by integers \(0, 1, \dots, K  1\).
\(N\) samples are labeled by strings \(0, 1, \dots, N  1\).
\(M\) variants are defined as
1:1:A:C
,1:2:A:C
, …,1:M:A:C
.The default distribution for population assignment \(\pi\) is uniform.
The default ancestral frequency distribution \(P_0\) is uniform on \([0.1, 0.9]\). All three classes are located in
hail.stats
.The default \(F_{ST}\) values are all \(0.1\).
The BaldingNichols model models genotypes of individuals from a structured population comprising \(K\) homogeneous modern populations that have each diverged from a single ancestral population (a star phylogeny). Each sample is assigned a population by sampling from the categorical distribution \(\pi\). Note that the actual size of each population is random.
Variants are modeled as biallelic and unlinked. Ancestral allele frequencies are drawn independently for each variant from a frequency spectrum \(P_0\). The extent of genetic drift of each modern population from the ancestral population is defined by the corresponding \(F_{ST}\) parameter \(F_k\) (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. \(k = 1, \ldots, K\)). For each variant and population, allele frequencies are drawn from a beta distribution whose parameters are determined by the ancestral allele frequency and \(F_{ST}\) parameter. The beta distribution gives a continuous approximation of the effect of genetic drift. We denote sample population assignments by \(k_n\), ancestral allele frequencies by \(p_m\), population allele frequencies by \(p_{k, m}\), and diploid, unphased genotype calls by \(g_{n, m}\) (0, 1, and 2 correspond to homozygous reference, heterozygous, and homozygous variant, respectively).
The generative model is then given by:
\[\begin{aligned} k_n \,&\sim\, \pi \\ p_m \,&\sim\, P_0 \\ p_{k,m} \mid p_m\,&\sim\, \mathrm{Beta}(\mu = p_m,\, \sigma^2 = F_k p_m (1  p_m)) \\ g_{n,m} \mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m}) \end{aligned} \]The beta distribution by its mean and variance above; the usual parameters are \(a = (1  p) \frac{1  F}{F}\) and \(b = p \frac{1  F}{F}\) with \(F = F_k\) and \(p = p_m\).
The resulting dataset has the following fields.
Global fields:
bn.n_populations (
tint32
) – Number of populations.bn.n_samples (
tint32
) – Number of samples.bn.n_variants (
tint32
) – Number of variants.bn.n_partitions (
tint32
) – Number of partitions.bn.pop_dist (
tarray
oftfloat64
) – Population distribution indexed by population.bn.fst (
tarray
oftfloat64
) – \(F_{ST}\) values indexed by population.bn.seed (
tint32
) – Random seed.bn.mixture (
tbool
) – Value of mixture parameter.
Row fields:
locus (
tlocus
) – Variant locus (key field).ancestral_af (
tfloat64
) – Ancestral allele frequency.af (
tarray
oftfloat64
) – Modern allele frequencies indexed by population.
Column fields:
Entry fields:
GT (
tcall
) – Genotype call (diploid, unphased).
For the PritchardStephensDonnelly model, set the mixture to true to treat pop_dist as the parameters of the Dirichlet distribution describing admixture between the modern populations. In this case, the type of pop is
tarray
oftfloat64
and the value is the mixture proportions. Parameters
n_populations (
int
) – Number of modern populations.n_samples (
int
) – Total number of samples.n_variants (
int
) – Number of variants.n_partitions (
int
, optional) – Number of partitions. Default is 1 partition per million entries or 8, whichever is larger.pop_dist (
list
offloat
, optional) – Unnormalized population distribution, a list of length n_populations with nonnegative values. Default is[1, ..., 1]
.fst (
list
offloat
, optional) – \(F_{ST}\) values, a list of length n_populations with values in (0, 1). Default is[0.1, ..., 0.1]
.af_dist (
Float64Expression
, optional) – Representing a random function. Ancestral allele frequency distribution. Default isrand_unif()
over the range [0.1, 0.9] with seed 0.reference_genome (
str
orReferenceGenome
) – Reference genome to use.mixture (
bool
) – Treat pop_dist as the parameters of a Dirichlet distribution, as in the PrichardStevensDonnelly model.
 Returns
MatrixTable
– Simulated matrix table of variants, samples, and genotypes.

hail.methods.
concordance
(left, right, *, _localize_global_statistics=True)[source]¶ Calculate call concordance with another dataset.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to contain only diploid and unphased genotype calls. Use
call()
to recode genotype calls ornull()
to set genotype calls to missing.Examples
Compute concordance between two datasets and output the global concordance statistics and two tables with concordance computed per column key and per row key:
>>> global_conc, cols_conc, rows_conc = hl.concordance(dataset, dataset2)
Notes
This method computes the genotype call concordance (from the entry field GT) between two biallelic variant datasets. It requires unique sample IDs and performs an inner join on samples (only samples in both datasets will be considered). In addition, all genotype calls must be diploid and unphased.
It performs an ordered zip join of the variants. That means the variants of each dataset are sorted, with duplicate variants appearing in some random relative order, and then zipped together. When a variant appears a different number of times between the two datasets, the dataset with the fewer number of instances is padded with “no data”. For example, if a variant is only in one dataset, then each genotype is treated as “no data” in the other.
This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.
Using the global summary result
The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:
No Data (missing variant)
No Call (missing genotype call)
Hom Ref
Heterozygous
Hom Var
The first index is the state in the left dataset and the second index is the state in the right dataset. Typical uses of the summary list are shown below.
>>> summary, samples, variants = hl.concordance(dataset, dataset2) >>> left_homref_right_homvar = summary[2][4] >>> left_het_right_missing = summary[3][1] >>> left_het_right_something_else = sum(summary[3][:])  summary[3][3] >>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4] >>> total_discordant = sum([sum(s[2:]) for s in summary[2:]])  total_concordant
Using the table results
Table 1: Concordance statistics by column
This table contains the column key field of left, and the following fields:
Table 2: Concordance statistics by row
This table contains the row key fields of left, and the following fields:
In these tables, the column n_discordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.
The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example,
concordance[1][4]
is the number of “no call” genotypes on the left that were called homozygous variant on the right. Parameters
left (
MatrixTable
) – First dataset to compare.right (
MatrixTable
) – Second dataset to compare.
 Returns
(list of list of int,
Table
,Table
) – The global concordance statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.

hail.methods.
filter_intervals
(ds, intervals, keep=True)[source]¶ Filter rows with a list of intervals.
Examples
Filter to loci falling within one interval:
>>> ds_result = hl.filter_intervals(dataset, [hl.parse_locus_interval('17:3844984038530994')])
Remove all loci within list of intervals:
>>> intervals = [hl.parse_locus_interval(x) for x in ['1:50M75M', '2:START400000', '322']] >>> ds_result = hl.filter_intervals(dataset, intervals, keep=False)
Notes
Based on the keep argument, this method will either restrict to points in the supplied interval ranges, or remove all rows in those ranges.
When
keep=True
, partitions that don’t overlap any supplied interval will not be loaded at all. This enablesfilter_intervals()
to be used for reasonably lowlatency queries of small ranges of the dataset, even on large datasets. Parameters
ds (
MatrixTable
orTable
) – Dataset to filter.intervals (
ArrayExpression
of typetinterval
) – Intervals to filter on. The point type of the interval must be a prefix of the key or equal to the first field of the key.keep (
bool
) – IfTrue
, keep only rows that fall within any interval in intervals. IfFalse
, keep only rows that fall outside all intervals in intervals.
 Returns
MatrixTable
orTable

hail.methods.
filter_alleles
(mt, f)[source]¶ Filter alternate alleles.
Note
Requires the dataset to have a compound row key:
Examples
Keep SNPs:
>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))
Keep alleles with AC > 0:
>>> ds_result = hl.filter_alleles(ds, lambda a, allele_index: ds.info.AC[allele_index  1] > 0)
Update the AC field of the resulting dataset:
>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i1])) >>> ds_result = ds_result.annotate_rows(info = updated_info)
Notes
The following new fields are generated:
old_locus (
locus
) – The old locus, before filtering and computing the minimal representation.old_alleles (
array<str>
) – The old alleles, before filtering and computing the minimal representation.old_to_new (
array<int32>
) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.new_to_old (
array<int32>
) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.
If all alternate alleles of a variant are filtered out, the variant itself is filtered out.
Using f
The f argument is a function or lambda evaluated per alternate allele to determine whether that allele is kept. If f evaluates to
True
, the allele is kept. If f evaluates toFalse
or missing, the allele is removed.f is a function that takes two arguments: the allele string (of type
StringExpression
) and the allele index (of typeInt32Expression
), and returns a boolean expression. This can be either a defined function or a lambda. For example, these two usages are equivalent:(with a lambda)
>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))
(with a defined function)
>>> def filter_f(allele, allele_index): ... return hl.is_snp(ds.alleles[0], allele) >>> ds_result = hl.filter_alleles(ds, filter_f)
Warning
filter_alleles()
does not update any fields other than locus and alleles. This means that row fields like allele count (AC) and entry fields like allele depth (AD) can become meaningless unless they are also updated. You can update them withannotate_rows()
andannotate_entries()
.See also
 Parameters
mt (
MatrixTable
) – Dataset.f (callable) – Function from (allele:
StringExpression
, allele_index:Int32Expression
) toBooleanExpression
 Returns

hail.methods.
filter_alleles_hts
(mt, f, subset=False)[source]¶ Filter alternate alleles and update standard GATK entry fields.
Examples
Filter to SNP alleles using the subset strategy:
>>> ds_result = hl.filter_alleles_hts( ... ds, ... lambda allele, _: hl.is_snp(ds.alleles[0], allele), ... subset=True)
Update the AC field of the resulting dataset:
>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i1])) >>> ds_result = ds_result.annotate_rows(info = updated_info)
Notes
For usage of the f argument, see the
filter_alleles()
documentation.filter_alleles_hts()
requires the dataset have the GATK VCF schema, namely the following entry fields in this order:GT: call AD: array<int32> DP: int32 GQ: int32 PL: array<int32>
Use
MatrixTable.select_entries()
to rearrange these fields if necessary.The following new fields are generated:
old_locus (
locus
) – The old locus, before filtering and computing the minimal representation.old_alleles (
array<str>
) – The old alleles, before filtering and computing the minimal representation.old_to_new (
array<int32>
) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.new_to_old (
array<int32>
) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.
Downcode algorithm
We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.
GT: 1/2 GQ: 10 AD: 0,50,35 0  1000 1  1000 10 2  1000 0 20 + 0 1 2
The downcode algorithm recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 > 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcoding filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then renormalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly to
split_multi_hts()
.The downcode algorithm would produce the following:
GT: 0/1 GQ: 10 AD: 35,50 0  20 1  0 10 + 0 1
In summary:
GT: Downcode filtered alleles to reference.
AD: Columns of filtered alleles are eliminated and their values are added to the reference column, e.g., filtering alleles 1 and 2 transforms
25,5,10,20
to40,20
.DP: No change.
PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.
GQ: The secondlowest PL (after shifting).
Subset algorithm
We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.
GT: 1/2 GQ: 10 AD: 0,50,35 0  1000 1  1000 10 2  1000 0 20 + 0 1 2
The subset algorithm subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are renormalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.
The subset algorithm would produce the following:
GT: 1/1 GQ: 980 AD: 0,50 0  980 1  980 0 + 0 1
In summary:
GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).
AD: The filtered alleles’ columns are eliminated, e.g., filtering alleles 1 and 2 transforms
25,5,10,20
to25,20
.DP: Unchanged.
PL: Columns involving filtered alleles are eliminated and the remaining columns’ values are shifted so the minimum value is 0.
GQ: The secondlowest PL (after shifting).
Warning
filter_alleles_hts()
does not update any row fields other than locus and alleles. This means that row fields like allele count (AC) can become meaningless unless they are also updated. You can update them withannotate_rows()
.See also
 Parameters
mt (
MatrixTable
)f (callable) – Function from (allele:
StringExpression
, allele_index:Int32Expression
) toBooleanExpression
subset (
bool
) – Subset PL field ifTrue
, otherwise downcode PL field. The calculation of GT and GQ also depend on whether one subsets or downcodes the PL.
 Returns

hail.methods.
hwe_normalized_pca
(call_expr, k=10, compute_loadings=False)[source]¶ Run principal component analysis (PCA) on the HardyWeinbergnormalized genotype call matrix.
Examples
>>> eigenvalues, scores, loadings = hl.hwe_normalized_pca(dataset.GT, k=5)
Notes
This method specializes
pca()
for the common use case of PCA in statistical genetics, that of projecting samples to a small number of ancestry coordinates. Variants that are all homozygous reference or all homozygous alternate are unnormalizable and removed before evaluation. Seepca()
for more details.Users of PLINK/GCTA should be aware that Hail computes the GRM slightly differently with regard to missing data. In Hail, the \(ij\) entry of the GRM \(MM^T\) is simply the dot product of rows \(i\) and \(j\) of \(M\); in terms of \(C\) it is
\[\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}2p_l)(C_{jl}  2p_l)}{2p_l(1p_l)}\]where \(\mathcal{C}_i = \{l \mid C_{il} \text{ is nonmissing}\}\). In PLINK/GCTA the denominator \(m\) is replaced with the number of terms in the sum \(\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert\), i.e. the number of variants where both samples have nonmissing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections
Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM, i.e. the left singular vectors \(U_k\) instead of the component scores \(U_k S_k\). The scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. In PC biplots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.
Compute the genetic relatedness matrix (GRM).
Examples
>>> grm = hl.genetic_relatedness_matrix(dataset.GT)
Notes
The genetic relationship matrix (GRM) \(G\) encodes genetic correlation between each pair of samples. It is defined by \(G = MM^T\) where \(M\) is a standardized version of the genotype matrix, computed as follows. Let \(C\) be the \(n \times m\) matrix of raw genotypes in the variant dataset, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the nonmissing entries of column \(j\). Entries of \(M\) are then meancentered and variancenormalized as
\[M_{ij} = \frac{C_{ij}2p_j}{\sqrt{2p_j(1p_j)m}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in HardyWeinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,
\[G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}2p_j)(C_{kj}2p_j)}{2 p_j (1p_j)}\]This method drops variants with \(p_j = 0\) or \(p_j = 1\) before computing kinship.
 Parameters
call_expr (
CallExpression
) – Entryindexed call expression with columns corresponding to samples. Returns
BlockMatrix
– Genetic relatedness matrix for all samples. Row and column indices correspond to matrix table column index.

hail.methods.
realized_relationship_matrix
(call_expr)[source]¶ Computes the realized relationship matrix (RRM).
Examples
>>> rrm = hl.realized_relationship_matrix(dataset.GT)
Notes
The realized relationship matrix (RRM) is defined as follows. Consider the \(n \times m\) matrix \(C\) of raw genotypes, with rows indexed by \(n\) samples and columns indexed by the \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the nonmissing entries of column \(j\). Entries of \(M\) are then meancentered and variancenormalized as
\[M_{ij} = \frac{C_{ij}2p_j} {\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}2p_j)^2}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance \(1/m\), which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the \(n \times n\) sample correlation or realized relationship matrix (RRM) \(K\) as simply
\[K = MM^T\]Note that the only difference between the realized relationship matrix and the genetic relatedness matrix (GRM) used in
realized_relationship_matrix()
is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under HardyWeinberg Equilibrium.This method drops variants with zero variance before computing kinship.
 Parameters
call_expr (
CallExpression
) – Entryindexed call expression on matrix table with columns corresponding to samples. Returns
BlockMatrix
– Realized relationship matrix for all samples. Row and column indices correspond to matrix table column index.

hail.methods.
impute_sex
(call, aaf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, aaf=None)[source]¶ Impute sex of samples by calculating inbreeding coefficient on the X chromosome.
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Remove samples where imputed sex does not equal reported sex:
>>> imputed_sex = hl.impute_sex(dataset.GT) >>> dataset_result = dataset.filter_cols(imputed_sex[dataset.s].is_female != dataset.pheno.is_female, ... keep=False)
Notes
We have used the same implementation as PLINK v1.7.
Let gr be the the reference genome of the type of the locus key (as given by
tlocus.reference_genome
)Filter the dataset to loci on the X contig defined by gr.
Calculate alternate allele frequency (AAF) for each row from the dataset.
Filter to variants with AAF above aaf_threshold.
Remove loci in the pseudoautosomal region, as defined by gr, unless include_par is
True
(it defaults toFalse
)For each row and column with a nonmissing genotype call, \(E\), the expected number of homozygotes (from population AAF), is computed as \(1.0  (2.0*\mathrm{maf}*(1.0\mathrm{maf}))\).
For each row and column with a nonmissing genotype call, \(O\), the observed number of homozygotes, is computed interpreting
0
as heterozygote and1
as homozygote`For each row and column with a nonmissing genotype call, \(N\) is incremented by 1
For each column, \(E\), \(O\), and \(N\) are combined across variants
For each column, \(F\) is calculated by \((O  E) / (N  E)\)
A sex is assigned to each sample with the following criteria:  Female when
F < 0.2
 Male whenF > 0.8
Use female_threshold and male_threshold to change this behavior.
Annotations
The returned columnkey indexed
Table
has the following fields in addition to the matrix table’s column keys:is_female (
tbool
) – True if the imputed sex is female, false if male, missing if undetermined.f_stat (
tfloat64
) – Inbreeding coefficient.n_called (
tint64
) – Number of variants with a genotype call.expected_homs (
tfloat64
) – Expected number of homozygotes.observed_homs (
tint64
) – Observed number of homozygotes.
 call
CallExpression
A genotype call for each row and column. The source dataset’s row keys must be [[locus], alleles] with types
tlocus
andtarray
oftstr
. Moreover, the alleles array must have exactly two elements (i.e. the variant must be biallelic). aaf_threshold
float
Minimum alternate allele frequency threshold.
 include_par
bool
Include pseudoautosomal regions.
 female_threshold
float
Samples are called females if F < female_threshold.
 male_threshold
float
Samples are called males if F > male_threshold.
 aaf
str
orNone
A field defining the alternate allele frequency for each row. If
None
, AAF will be computed from call.
 Returns
Table
– Sex imputation statistics per sample.

hail.methods.
ld_matrix
(entry_expr, locus_expr, radius, coord_expr=None, block_size=None)[source]¶ Computes the windowed correlation (linkage disequilibrium) matrix between variants.
Examples
Consider the following dataset consisting of three variants with centimorgan coordinates and four samples:
>>> data = [{'v': '1:1:A:C', 'cm': 0.1, 's': 'a', 'GT': hl.Call([0, 0])}, ... {'v': '1:1:A:C', 'cm': 0.1, 's': 'b', 'GT': hl.Call([0, 0])}, ... {'v': '1:1:A:C', 'cm': 0.1, 's': 'c', 'GT': hl.Call([0, 1])}, ... {'v': '1:1:A:C', 'cm': 0.1, 's': 'd', 'GT': hl.Call([1, 1])}, ... {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'a', 'GT': hl.Call([0, 1])}, ... {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'b', 'GT': hl.Call([1, 1])}, ... {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'c', 'GT': hl.Call([0, 1])}, ... {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'd', 'GT': hl.Call([0, 0])}, ... {'v': '2:1:C:G', 'cm': 0.2, 's': 'a', 'GT': hl.Call([0, 1])}, ... {'v': '2:1:C:G', 'cm': 0.2, 's': 'b', 'GT': hl.Call([0, 0])}, ... {'v': '2:1:C:G', 'cm': 0.2, 's': 'c', 'GT': hl.Call([1, 1])}, ... {'v': '2:1:C:G', 'cm': 0.2, 's': 'd', 'GT': hl.null(hl.tcall)}] >>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, cm: float64, GT: call}')) >>> ht = ht.transmute(**hl.parse_variant(ht.v)) >>> mt = ht.to_matrix_table(row_key=['locus', 'alleles'], col_key=['s'], row_fields=['cm'])
Compute linkage disequilibrium between all pairs of variants on the same contig and within two megabases:
>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=2e6) >>> ld.to_numpy() array([[ 1. , 0.85280287, 0. ], [0.85280287, 1. , 0. ], [ 0. , 0. , 1. ]])
Within one megabases:
>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1e6) >>> ld.to_numpy() array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
Within one centimorgan:
>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm) >>> ld.to_numpy() array([[ 1. , 0.85280287, 0. ], [0.85280287, 1. , 0. ], [ 0. , 0. , 1. ]])
Within one centimorgan, and only calculate the upper triangle:
>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm) >>> ld = ld.sparsify_triangle() >>> ld.to_numpy() array([[ 1. , 0.85280287, 0. ], [ 0. , 1. , 0. ], [ 0. , 0. , 1. ]])
Notes
This method sparsifies the result of
row_correlation()
usinglinalg.utils.locus_windows()
andBlockMatrix.sparsify_row_intervals()
in order to only compute linkage disequilibrium between nearby variants. Userow_correlation()
directly to calculate correlation without windowing.More precisely, variants are 0indexed by their order in the matrix table (see
add_row_index()
). Each variant is regarded as a vector of elements defined by entry_expr, typically the number of alternate alleles or genotype dosage. Missing values are meanimputed within variant.The method produces a symmetric blocksparse matrix supported in a neighborhood of the diagonal. If variants \(i\) and \(j\) are on the same contig and within radius base pairs (inclusive) then the \((i, j)\) element is their Pearson correlation coefficient. Otherwise, the \((i, j)\) element is
0.0
.Rows with a constant value (i.e., zero variance) will result in
nan
correlation values. To avoid this, first check that all variants vary or filter out constant variants (for example, with the help ofaggregators.stats()
).If the
global_position()
on locus_expr is not in ascending order, this method will fail. Ascending order should hold for a matrix table keyed by locus or variant (and the associated row table), or for a table that’s been ordered by locus_expr.Set coord_expr to use a value other than position to define the windows. This rowindexed numeric expression must be nonmissing, non
nan
, on the same source as locus_expr, and ascending with respect to locus position for each contig; otherwise the method will raise an error.Warning
See the warnings in
row_correlation()
. In particular, for large matrices it may be preferable to run its stages separately.entry_expr and locus_expr are implicitly aligned by rowindex, though they need not be on the same source. If their sources differ in the number of rows, an error will be raised; otherwise, unintended misalignment may silently produce unexpected results.
 Parameters
entry_expr (
Float64Expression
) – Entryindexed numeric expression on matrix table.locus_expr (
LocusExpression
) – Rowindexed locus expression on a table or matrix table that is rowaligned with the matrix table of entry_expr.coord_expr (
Float64Expression
, optional) – Rowindexed numeric expression for the row value on the same table or matrix table as locus_expr. By default, the row value is given by the locus position.block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
 Returns
BlockMatrix
– Windowed correlation matrix between variants. Row and column indices correspond to matrix table variant index.

hail.methods.
ld_prune
(call_expr, r2=0.2, bp_window_size=1000000, memory_per_core=256, keep_higher_maf=True, block_size=None)[source]¶ Returns a maximal subset of variants that are nearly uncorrelated within each window.
Note
Requires the dataset to contain only diploid genotype calls.
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to have a compound row key:
Examples
Prune variants in linkage disequilibrium by filtering a dataset to those variants returned by
ld_prune()
. If the dataset contains multiallelic variants, the multiallelic variants must be filtered out or split before being passed told_prune()
.>>> biallelic_dataset = dataset.filter_rows(hl.len(dataset.alleles) == 2) >>> pruned_variant_table = hl.ld_prune(biallelic_dataset.GT, r2=0.2, bp_window_size=500000) >>> filtered_ds = dataset.filter_rows(hl.is_defined(pruned_variant_table[dataset.row_key]))
Notes
This method finds a maximal subset of variants such that the squared Pearson correlation coefficient \(r^2\) of any pair at most bp_window_size base pairs apart is strictly less than r2. Each variant is represented as a vector over samples with elements given by the (meanimputed) number of alternate alleles. In particular, even if present, phase information is ignored. Variants that do not vary across samples are dropped.
The method prunes variants in linkage disequilibrium in three stages.
The first, “local pruning” stage prunes correlated variants within each partition, using a local variant queue whose size is determined by memory_per_core. A larger queue may facilitate more local pruning in this stage. Minor allele frequency is not taken into account. The parallelism is the number of matrix table partitions.
The second, “global correlation” stage uses blocksparse matrix multiplication to compute correlation between each pair of remaining variants within bp_window_size base pairs, and then forms a graph of correlated variants. The parallelism of writing the locallypruned matrix table as a block matrix is
n_locally_pruned_variants / block_size
.The third, “global pruning” stage applies
maximal_independent_set()
to prune variants from this graph until no edges remain. This algorithm iteratively removes the variant with the highest vertex degree. If keep_higher_maf is true, then in the case of a tie for highest degree, the variant with lowest minor allele frequency is removed.
Warning
The locallypruned matrix table and block matrix are stored as temporary files on persistent disk. See the warnings on BlockMatrix.from_entry_expr with regard to memory and Hadoop replication errors.
 Parameters
call_expr (
CallExpression
) – Entryindexed call expression on a matrix table with rowindexed variants and columnindexed samples.r2 (
float
) – Squared correlation threshold (exclusive upper bound). Must be in the range [0.0, 1.0].bp_window_size (
int
) – Window size in base pairs (inclusive upper bound).memory_per_core (
int
) – Memory in MB per core for local pruning queue.keep_higher_maf (
int
) – IfTrue
, break ties at each step of the global pruning stage by preferring to keep variants with higher minor allele frequency.block_size (
int
, optional) – Block size for block matrices in the second stage. Default given byBlockMatrix.default_block_size()
.
 Returns
Table
– Table of a maximal independent set of variants.

hail.methods.
mendel_errors
(call, pedigree)[source]¶ Find Mendel errors; count per variant, individual and nuclear family.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables (all errors, errors by family, errors by individual, errors by variant):
>>> ped = hl.Pedigree.read('data/trios.fam') >>> all_errors, per_fam, per_sample, per_variant = hl.mendel_errors(dataset['GT'], ped)
Export all mendel errors to a text file:
>>> all_errors.export('output/all_mendel_errors.tsv')
Annotate columns with the number of Mendel errors:
>>> annotated_samples = dataset.annotate_cols(mendel=per_sample[dataset.s])
Annotate rows with the number of Mendel errors:
>>> annotated_variants = dataset.annotate_rows(mendel=per_variant[dataset.locus, dataset.alleles])
Notes
The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats, resembling the
.mendel
,.fmendel
,.imendel
, and.lmendel
formats, respectively.First table: all Mendel errors. This table contains one row per Mendel error, keyed by the variant and proband id.
Second table: errors per nuclear family. This table contains one row per nuclear family, keyed by the parents.
pat_id (
tstr
) – Paternal ID. (key field)mat_id (
tstr
) – Maternal ID. (key field)fam_id (
tstr
) – Family ID.children (
tint32
) – Number of children in this nuclear family.errors (
tint64
) – Number of Mendel errors in this nuclear family.snp_errors (
tint64
) – Number of Mendel errors at SNPs in this nuclear family.
Third table: errors per individual. This table contains one row per individual. Each error is counted toward the proband, father, and mother according to the Implicated in the table below.
Fourth table: errors per variant.
This method only considers complete trios (two parents and proband with defined sex). The code of each Mendel error is determined by the table below, extending the Plink classification.
In the table, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by
in_autosome()
.Auto – in autosome or in PAR or female child
HemiX – in nonPAR of X and male child
HemiY – in nonPAR of Y and male child
Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.
Code
Dad
Mom
Kid
Copy State  Implicated
1
HomVar
HomVar
Het
Auto
Dad, Mom, Kid
2
HomRef
HomRef
Het
Auto
Dad, Mom, Kid
3
HomRef
~HomRef
HomVar
Auto
Dad, Kid
4
~HomRef
HomRef
HomVar
Auto
Mom, Kid
5
HomRef
HomRef
HomVar
Auto
Kid
6
HomVar
~HomVar
HomRef
Auto
Dad, Kid
7
~HomVar
HomVar
HomRef
Auto
Mom, Kid
8
HomVar
HomVar
HomRef
Auto
Kid
9
Any
HomVar
HomRef
HemiX
Mom, Kid
10
Any
HomRef
HomVar
HemiX
Mom, Kid
11
HomVar
Any
HomRef
HemiY
Dad, Kid
12
HomRef
Any
HomVar
HemiY
Dad, Kid
See also

hail.methods.
de_novo
(mt, pedigree, pop_frequency_prior, *, min_gq=20, min_p=0.05, max_parent_ab=0.05, min_child_ab=0.2, min_dp_ratio=0.1, ignore_in_sample_allele_frequency=False)[source]¶ Call putative de novo events from trio data.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Call de novo events:
>>> pedigree = hl.Pedigree.read('data/trios.fam') >>> priors = hl.import_table('data/gnomadFreq.tsv', impute=True) >>> priors = priors.transmute(**hl.parse_variant(priors.Variant)).key_by('locus', 'alleles') >>> de_novo_results = hl.de_novo(dataset, pedigree, pop_frequency_prior=priors[dataset.row_key].AF)
Notes
This method assumes the GATK highthroughput sequencing fields exist: GT, AD, DP, GQ, PL.
This method replicates the functionality of Kaitlin Samocha’s de novo caller. The version corresponding to git commit
bde3e40
is implemented in Hail with her permission and assistance.This method produces a
Table
with the following fields:locus (
locus
) – Variant locus.alleles (
array<str>
) – Variant alleles.id (
str
) – Proband sample ID.prior (
float64
) – Site frequency prior. It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior1 / 3e7
. If the ignore_in_sample_allele_frequency parameter isTrue
, then the computed allele frequency is not included in the calculation, and the prior is the maximum of the pop_frequency_prior and1 / 3e7
.proband (
struct
) – Proband column fields from mt.father (
struct
) – Father column fields from mt.mother (
struct
) – Mother column fields from mt.proband_entry (
struct
) – Proband entry fields from mt.father_entry (
struct
) – Father entry fields from mt.proband_entry (
struct
) – Mother entry fields from mt.is_female (
bool
) –True
if proband is female.p_de_novo (
float64
) – Unfiltered posterior probability that the event is de novo rather than a missed heterozygous event in a parent.confidence (
str
) Validation confidence. One of:'HIGH'
,'MEDIUM'
,'LOW'
.
The key of the table is
['locus', 'alleles', 'id']
.The model looks for de novo events in which both parents are homozygous reference and the proband is a heterozygous. The model makes the simplifying assumption that when this configuration
x = (AA, AA, AB)
of calls occurs, exactly one of the following is true:d
: a de novo mutation occurred in the proband and all calls are accurate.m
: at least one parental allele is actually heterozygous and the proband call is accurate.
We can then estimate the posterior probability of a de novo mutation as:
\[\mathrm{P_{\text{de novo}}} = \frac{\mathrm{P}(d \mid x)}{\mathrm{P}(d \mid x) + \mathrm{P}(m \mid x)}\]Applying Bayes rule to the numerator and denominator yields
\[\frac{\mathrm{P}(x \mid d)\,\mathrm{P}(d)}{\mathrm{P}(x \mid d)\,\mathrm{P}(d) + \mathrm{P}(x \mid m)\,\mathrm{P}(m)}\]The prior on de novo mutation is estimated from the rate in the literature:
\[\mathrm{P}(d) = \frac{1 \, \text{mutation}}{30{,}000{,}000 \, \text{bases}}\]The prior used for at least one alternate allele between the parents depends on the alternate allele frequency:
\[\mathrm{P}(m) = 1  (1  AF)^4\]The likelihoods \(\mathrm{P}(x \mid d)\) and \(\mathrm{P}(x \mid m)\) are computed from the PL (genotype likelihood) fields using these factorizations:
\[\mathrm{P}(x = (AA, AA, AB) \mid d) = \left( \begin{aligned} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AA) \\ {} \cdot {} &\mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AA) \\ {} \cdot {} &\mathrm{P}(x_{\mathrm{proband}} = AB \mid \mathrm{proband} = AB) \end{aligned} \right) \]\[\begin{aligned} \mathrm{P}(x = (AA, AA, AB) \mid m) = &\left( \begin{aligned} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AB) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AA) \\ {} + {} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AA) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AB) \end{aligned} \right) \\ &{} \cdot \mathrm{P}(x_{\mathrm{proband}} = AB \mid \mathrm{proband} = AB) \end{aligned} \](Technically, the second factorization assumes there is exactly (rather than at least) one alternate allele among the parents, which may be justified on the grounds that it is typically the most likely case by far.)
While this posterior probability is a good metric for grouping putative de novo mutations by validation likelihood, there exist error modes in highthroughput sequencing data that are not appropriately accounted for by the phredscaled genotype likelihoods. To this end, a number of hard filters are applied in order to assign validation likelihood.
These filters are different for SNPs and insertions/deletions. In the below rules, the following variables are used:
DR
refers to the ratio of the read depth in the proband to the combined read depth in the parents.DP
refers to the read depth (DP field) of the proband.AB
refers to the read allele balance of the proband (number of alternate reads divided by total reads).AC
refers to the count of alternate alleles across all individuals in the dataset at the site.p
refers to \(\mathrm{P_{\text{de novo}}}\).min_p
refers to the min_p function parameter.
HIGHquality SNV:
(p > 0.99) AND (AB > 0.3) AND (AC == 1) OR (p > 0.99) AND (AB > 0.3) AND (DR > 0.2) OR (p > 0.5) AND (AB > 0.3) AND (AC < 10) AND (DP > 10)
MEDIUMquality SNV:
(p > 0.5) AND (AB > 0.3) OR (AC == 1)
LOWquality SNV:
(AB > 0.2)
HIGHquality indel:
(p > 0.99) AND (AB > 0.3) AND (AC == 1)
MEDIUMquality indel:
(p > 0.5) AND (AB > 0.3) AND (AC < 10)
LOWquality indel:
(AB > 0.2)
Additionally, de novo candidates are not considered if the proband GQ is smaller than the min_gq parameter, if the proband allele balance is lower than the min_child_ab parameter, if the depth ratio between the proband and parents is smaller than the min_depth_ratio parameter, if the allele balance in a parent is above the max_parent_ab parameter, or if the posterior probability p is smaller than the min_p parameter.
 Parameters
mt (
MatrixTable
) – Highthroughput sequencing dataset.pedigree (
Pedigree
) – Sample pedigree.pop_frequency_prior (
Float64Expression
) – Expression for population alternate allele frequency prior.min_gq – Minimum proband GQ to be considered for de novo calling.
min_p – Minimum posterior probability to be considered for de novo calling.
max_parent_ab – Maximum parent allele balance.
min_child_ab – Minimum proband allele balance/
min_dp_ratio – Minimum ratio between proband read depth and parental read depth.
ignore_in_sample_allele_frequency – Ignore insample allele frequency in computing site prior. Experimental.
 Returns

hail.methods.
nirvana
(dataset, config, block_size=500000, name='nirvana')[source]¶ Annotate variants using Nirvana.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Note
Requires the dataset to have a compound row key:
nirvana()
runs Nirvana on the current dataset and adds a new row field in the location specified by name.Examples
Add Nirvana annotations to the dataset:
>>> result = hl.nirvana(dataset, "data/nirvana.properties")
Configuration
nirvana()
requires a configuration file. The format is a .properties file, where each line defines a property as a keyvalue pair of the formkey = value
.nirvana()
supports the following properties:hail.nirvana.dotnet – Location of dotnet. Optional, default: dotnet.
hail.nirvana.path – Value of the PATH environment variable when invoking Nirvana. Optional, by default PATH is not set.
hail.nirvana.location – Location of Nirvana.dll. Required.
hail.nirvana.reference – Location of reference genome. Required.
hail.nirvana.cache – Location of cache. Required.
hail.nirvana.supplementaryAnnotationDirectory – Location of Supplementary Database. Optional, no supplementary database by default.
Here is an example
nirvana.properties
configuration file:hail.nirvana.location = /path/to/dotnet/netcoreapp2.0/Nirvana.dll hail.nirvana.reference = /path/to/nirvana/References/Homo_sapiens.GRCh37.Nirvana.dat hail.nirvana.cache = /path/to/nirvana/Cache/GRCh37/Ensembl hail.nirvana.supplementaryAnnotationDirectory = /path/to/nirvana/SupplementaryDatabase/GRCh37
Annotations
A new row field is added in the location specified by name with the following schema:
struct { chromosome: str, refAllele: str, position: int32, altAlleles: array<str>, cytogeneticBand: str, quality: float64, filters: array<str>, jointSomaticNormalQuality: int32, copyNumber: int32, strandBias: float64, recalibratedQuality: float64, variants: array<struct { altAllele: str, refAllele: str, chromosome: str, begin: int32, end: int32, phylopScore: float64, isReferenceMinor: bool, variantType: str, vid: str, hgvsg: str, isRecomposedVariant: bool, isDecomposedVariant: bool, regulatoryRegions: array<struct { id: str, type: str, consequence: set<str> }>, clinvar: array<struct { id: str, reviewStatus: str, isAlleleSpecific: bool, alleleOrigins: array<str>, refAllele: str, altAllele: str, phenotypes: array<str>, medGenIds: array<str>, omimIds: array<str>, orphanetIds: array<str>, significance: str, lastUpdatedDate: str, pubMedIds: array<str> }>, cosmic: array<struct { id: str, isAlleleSpecific: bool, refAllele: str, altAllele: str, gene: str, sampleCount: int32, studies: array<struct { id: int32, histology: str, primarySite: str }> }>, dbsnp: struct { ids: array<str> }, globalAllele: struct { globalMinorAllele: str, globalMinorAlleleFrequency: float64 }, gnomad: struct { coverage: str, allAf: float64, allAc: int32, allAn: int32, allHc: int32, afrAf: float64, afrAc: int32, afrAn: int32, afrHc: int32, amrAf: float64, amrAc: int32, amrAn: int32, amrHc: int32, easAf: float64, easAc: int32, easAn: int32, easHc: int32, finAf: float64, finAc: int32, finAn: int32, finHc: int32, nfeAf: float64, nfeAc: int32, nfeAn: int32, nfeHc: int32, othAf: float64, othAc: int32, othAn: int32, othHc: int32, asjAf: float64, asjAc: int32, asjAn: int32, asjHc: int32, failedFilter: bool }, gnomadExome: struct { coverage: str, allAf: float64, allAc: int32, allAn: int32, allHc: int32, afrAf: float64, afrAc: int32, afrAn: int32, afrHc: int32, amrAf: float64, amrAc: int32, amrAn: int32, amrHc: int32, easAf: float64, easAc: int32, easAn: int32, easHc: int32, finAf: float64, finAc: int32, finAn: int32, finHc: int32, nfeAf: float64, nfeAc: int32, nfeAn: int32, nfeHc: int32, othAf: float64, othAc: int32, othAn: int32, othHc: int32, asjAf: float64, asjAc: int32, asjAn: int32, asjHc: int32, sasAf: float64, sasAc: int32, sasAn: int32, sasHc: int32, failedFilter: bool }, topmed: struct { failedFilter: bool, allAc: int32, allAn: int32, allAf: float64, allHc: int32 }, oneKg: struct { ancestralAllele: str, allAf: float64, allAc: int32, allAn: int32, afrAf: float64, afrAc: int32, afrAn: int32, amrAf: float64, amrAc: int32, amrAn: int32, easAf: float64, easAc: int32, easAn: int32, eurAf: float64, eurAc: int32, eurAn: int32, sasAf: float64, sasAc: int32, sasAn: int32 }, mitomap: array<struct { refAllele: str, altAllele: str, diseases : array<str>, hasHomoplasmy: bool, hasHeteroplasmy: bool, status: str, clinicalSignificance: str, scorePercentile: float64, isAlleleSpecific: bool, chromosome: str, begin: int32, end: int32, variantType: str } transcripts: struct { refSeq: array<struct { transcript: str, bioType: str, aminoAcids: str, cdnaPos: str, codons: str, cdsPos: str, exons: str, introns: str, geneId: str, hgnc: str, consequence: array<str>, hgvsc: str, hgvsp: str, isCanonical: bool, polyPhenScore: float64, polyPhenPrediction: str, proteinId: str, proteinPos: str, siftScore: float64, siftPrediction: str }>, ensembl: array<struct { transcript: str, bioType: str, aminoAcids: str, cdnaPos: str, codons: str, cdsPos: str, exons: str, introns: str, geneId: str, hgnc: str, consequence: array<str>, hgvsc: str, hgvsp: str, isCanonical: bool, polyPhenScore: float64, polyPhenPrediction: str, proteinId: str, proteinPos: str, siftScore: float64, siftPrediction: str }> }, overlappingGenes: array<str> }> genes: array<struct { name: str, omim: array<struct { mimNumber: int32, hgnc: str, description: str, phenotypes: array<struct { mimNumber: int32, phenotype: str, mapping: str, inheritance: array<str>, comments: str }> }> exac: struct { pLi: float64, pRec: float64, pNull: float64 } }> }
 Parameters
dataset (
MatrixTable
orTable
) – Dataset.config (
str
) – Path to Nirvana configuration file.block_size (
int
) – Number of rows to process per Nirvana invocation.name (
str
) – Name for resulting row field.
 Returns
MatrixTable
orTable
– Dataset with new rowindexed field name containing Nirvana annotations.

hail.methods.
sample_qc
(mt, name='sample_qc')[source]¶ Compute persample metrics useful for quality control.
Note
Requires the dataset to have a compound row key:
Examples
Compute sample QC metrics and remove lowquality samples:
>>> dataset = hl.sample_qc(dataset, name='sample_qc') >>> filtered_dataset = dataset.filter_cols((dataset.sample_qc.dp_stats.mean > 20) & (dataset.sample_qc.r_ti_tv > 1.5))
Notes
This method computes summary statistics per sample from a genetic matrix and stores the results as a new columnindexed struct field in the matrix, named based on the name parameter.
If mt contains an entry field DP of type
tint32
, then the field dp_stats is computed. If mt contains an entry field GQ of typetint32
, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:mean (
float64
) – Mean value.stdev (
float64
) – Standard deviation (zero degrees of freedom).min (
int32
) – Minimum value.max (
int32
) – Maximum value.
If the dataset does not contain an entry field GT of type
tcall
, then an error is raised. The following fields are always computed from GT:call_rate (
float64
) – Fraction of calls not missing or filtered. Equivalent to n_called divided bycount_rows()
.n_called (
int64
) – Number of nonmissing calls.n_not_called (
int64
) – Number of missing calls.n_filtered (
int64
) – Number of filtered entries.n_hom_ref (
int64
) – Number of homozygous reference calls.n_het (
int64
) – Number of heterozygous calls.n_hom_var (
int64
) – Number of homozygous alternate calls.n_non_ref (
int64
) – Sum of n_het and n_hom_var.n_snp (
int64
) – Number of SNP alternate alleles.n_insertion (
int64
) – Number of insertion alternate alleles.n_deletion (
int64
) – Number of deletion alternate alleles.n_singleton (
int64
) – Number of private alleles.n_transition (
int64
) – Number of transition (AG, CT) alternate alleles.n_transversion (
int64
) – Number of transversion alternate alleles.n_star (
int64
) – Number of star (upstream deletion) alleles.r_ti_tv (
float64
) – Transition/Transversion ratio.r_het_hom_var (
float64
) – Het/HomVar call ratio.r_insertion_deletion (
float64
) – Insertion/Deletion allele ratio.
Missing values
NA
may result from division by zero. Parameters
mt (
MatrixTable
) – Dataset.name (
str
) – Name for resulting field.
 Returns
MatrixTable
– Dataset with a new columnindexed field name.

hail.methods.
skat
(key_expr, weight_expr, y, x, covariates, logistic=False, max_size=46340, accuracy=1e06, iterations=10000)[source]¶ Test each keyed group of rows for association by linear or logistic SKAT test.
Examples
Test each gene for association using the linear sequence kernel association test:
>>> skat_table = hl.skat(key_expr=burden_ds.gene, ... weight_expr=burden_ds.weight, ... y=burden_ds.burden.pheno, ... x=burden_ds.GT.n_alt_alleles(), ... covariates=[1, burden_ds.burden.cov1, burden_ds.burden.cov2])
Caution
By default, the Davies algorithm iterates up to 10k times until an accuracy of 1e6 is achieved. Hence a reported pvalue of zero with no issues may truly be as large as 1e6. The accuracy and maximum number of iterations may be controlled by the corresponding function parameters. In general, higher accuracy requires more iterations.
Caution
To process a group with \(m\) rows, several copies of an \(m \times m\) matrix of doubles must fit in worker memory. Groups with tens of thousands of rows may exhaust worker memory causing the entire job to fail. In this case, use the max_size parameter to skip groups larger than max_size.
Warning
skat()
considers the same set of columns (i.e., samples, points) for every group, namely those columns for which all covariates are defined. For each row, missing values of x are meanimputed over these columns. As in the example, the intercept covariate1
must be included explicitly if desired.Notes
This method provides a scalable implementation of the scorebased variancecomponent test originally described in RareVariant Association Testing for Sequencing Data with the Sequence Kernel Association Test.
Row weights must be nonnegative. Rows with missing weights are ignored. In the R package
skat
—which assumes rows are variants—default weights are given by evaluating the Beta(1, 25) density at the minor allele frequency. To replicate these weights in Hail using alternate allele frequencies stored in a rowindexed field AF, one can use the expression:>>> hl.dbeta(hl.min(ds2.AF), 1.0, 25.0) ** 2
In the logistic case, the response y must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.
The resulting
Table
provides the group’s key (id), thenumber of rows in the group (size), the variance component score q_stat, the SKAT pvalue, and a fault flag. For the toy example above, the table has the form:id
size
q_stat
p_value
fault
geneA
2
4.136
0.205
0
geneB
1
5.659
0.195
0
geneC
3
4.122
0.192
0
Groups larger than max_size appear with missing q_stat, p_value, and fault. The hard limit on the number of rows in a group is 46340.
Note that the variance component score q_stat agrees with
Q
in the R packageskat
, but both differ from \(Q\) in the paper by the factor \(\frac{1}{2\sigma^2}\) in the linear case and \(\frac{1}{2}\) in the logistic case, where \(\sigma^2\) is the unbiased estimator of residual variance for the linear null model. The R package also applies a “smallsample adjustment” to the null distribution in the logistic case when the sample size is less than 2000. Hail does not apply this adjustment.The fault flag is an integer indicating whether any issues occurred when running the Davies algorithm to compute the pvalue as the right tail of a weighted sum of \(\chi^2(1)\) distributions.
fault value
Description
0
no issues
1
accuracy NOT achieved
2
roundoff error possibly significant
3
invalid parameters
4
unable to locate integration parameters
5
out of memory
 Parameters
key_expr (
Expression
) – Rowindexed expression for key associated to each row.weight_expr (
Float64Expression
) – Rowindexed expression for row weights.y (
Float64Expression
) – Columnindexed response expression. If logistic isTrue
, all nonmissing values must evaluate to 0 or 1. Note that aBooleanExpression
will be implicitly converted to aFloat64Expression
with this property.x (
Float64Expression
) – Entryindexed expression for input variable.covariates (
list
ofFloat64Expression
) – List of columnindexed covariate expressions.logistic (
bool
) – If true, use the logistic test rather than the linear test.max_size (
int
) – Maximum size of group on which to run the test.accuracy (
float
) – Accuracy achieved by the Davies algorithm if fault value is zero.iterations (
int
) – Maximum number of iterations attempted by the Davies algorithm.
 Returns
Table
– Table of SKAT results.

hail.methods.
lambda_gc
(p_value, approximate=True)[source]¶ Compute genomic inflation factor (lambda GC) from an Expression of pvalues.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
 Parameters
p_value (
NumericExpression
) – Rowindexed numeric expression of pvalues.approximate (
bool
) – If False, computes exact lambda GC (slower and uses more memory).
 Returns
float
– Genomic inflation factor (lambda genomic control).

hail.methods.
split_multi
(ds, keep_star=False, left_aligned=False, *, permit_shuffle=False)[source]¶ Split multiallelic variants.
Warning
In order to support a wide variety of data types, this function splits only the variants on a
MatrixTable
, but not the genotypes. Usesplit_multi_hts()
if possible, or split the genotypes yourself using one of the entry modification methods:MatrixTable.annotate_entries()
,MatrixTable.select_entries()
,MatrixTable.transmute_entries()
.The resulting dataset will be keyed by the split locus and alleles.
split_multi()
adds the following fields:was_split (bool) –
True
if this variant was originally multiallelic, otherwiseFalse
.a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with
a_index = 1
and 1:100:A:C witha_index = 2
.old_locus (locus) – The original, unsplit locus.
old_alleles (array<str>) – The original, unsplit alleles.
All other fields are left unchanged.
Example
split_multi_hts()
, which splits multiallelic variants for the HTS genotype schema and updates the entry fields by downcoding the genotype, is implemented as:>>> sm = hl.split_multi(ds) >>> pl = hl.or_missing( ... hl.is_defined(sm.PL), ... (hl.range(0, 3).map(lambda i: hl.min(hl.range(0, hl.len(sm.PL)) ... .filter(lambda j: hl.downcode(hl.unphased_diploid_gt_index_call(j), sm.a_index) == hl.unphased_diploid_gt_index_call(i)) ... .map(lambda j: sm.PL[j]))))) >>> split_ds = sm.annotate_entries( ... GT=hl.downcode(sm.GT, sm.a_index), ... AD=hl.or_missing(hl.is_defined(sm.AD), ... [hl.sum(sm.AD)  sm.AD[sm.a_index], sm.AD[sm.a_index]]), ... DP=sm.DP, ... PL=pl, ... GQ=hl.gq_from_pl(pl)).drop('old_locus', 'old_alleles')
See also
 Parameters
ds (
MatrixTable
orTable
) – An unsplit dataset.keep_star (
bool
) – Do not filter out * alleles.left_aligned (
bool
) – IfTrue
, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.permit_shuffle (
bool
) – IfTrue
, permit a data shuffle to sort outoforder split results. This will only be required if input data has duplicate loci, one of which contains more than one alternate allele.
 Returns
MatrixTable
orTable

hail.methods.
split_multi_hts
(ds, keep_star=False, left_aligned=False, vep_root='vep', *, permit_shuffle=False)[source]¶ Split multiallelic variants for datasets that contain one or more fields from a standard highthroughput sequencing entry schema.
struct { GT: call, AD: array<int32>, DP: int32, GQ: int32, PL: array<int32>, PGT: call, PID: str }
For other entry fields, write your own splitting logic using
MatrixTable.annotate_entries()
.Examples
>>> hl.split_multi_hts(dataset).write('output/split.vds')
Notes
We will explain by example. Consider a hypothetical 3allelic variant:
A C,T 0/2:7,2,6:15:45:99,50,99,0,45,99
split_multi_hts()
will create two biallelic variants (one for each alternate allele) at the same positionA C 0/0:13,2:15:45:0,45,99 A T 0/1:9,6:15:50:50,0,99
Each multiallelic GT or PGT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.
The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.
The biallelic DP is the same as the multiallelic DP.
The biallelic PL entry for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.
Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.
GQ is recomputed from PL if PL is provided and is not missing. If not, it is copied from the original GQ.
Here is a second example for a het nonref
A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
splits as
A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99
VCF Info Fields
Hail does not split fields in the info field. This means that if a multiallelic site with info.AC value
[10, 2]
is split, each split site will contain the same array[10, 2]
. The provided allele index field a_index can be used to select the value corresponding to the split allele’s position:>>> split_ds = hl.split_multi_hts(dataset) >>> split_ds = split_ds.filter_rows(split_ds.info.AC[split_ds.a_index  1] < 10, ... keep = False)
VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number “.”.
If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:
>>> split_ds = hl.split_multi_hts(dataset) >>> split_ds = split_ds.annotate_rows(info = hl.struct(AC=split_ds.info.AC[split_ds.a_index  1], ... **split_ds.info)) >>> hl.export_vcf(split_ds, 'output/export.vcf')
The info field AC in data/export.vcf will have
Number=1
.New Fields
split_multi_hts()
adds the following fields:was_split (bool) –
True
if this variant was originally multiallelic, otherwiseFalse
.a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with
a_index = 1
and 1:100:A:C witha_index = 2
.
See also
 Parameters
ds (
MatrixTable
orTable
) – An unsplit dataset.keep_star (
bool
) – Do not filter out * alleles.left_aligned (
bool
) – IfTrue
, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.vep_root (
str
) – Toplevel location of vep data. All variablelength VEP fields (intergenic_consequences, motif_feature_consequences, regulatory_feature_consequences, and transcript_consequences) will be split properly (i.e. a_index corresponding to the VEP allele_num).permit_shuffle (
bool
) – IfTrue
, permit a data shuffle to sort outoforder split results. This will only be required if input data has duplicate loci, one of which contains more than one alternate allele.
 Returns
MatrixTable
orTable
– A biallelic variant dataset.

hail.methods.
summarize_variants
(mt, show=True, *, handler=None)[source]¶ Summarize the variants present in a dataset and print the results.
Examples
>>> hl.summarize_variants(dataset) ============================== Number of variants: 346 ============================== Alleles per variant  2 alleles: 346 variants ============================== Variants per contig  20: 346 variants ============================== Allele type distribution  SNP: 301 alleles Deletion: 27 alleles Insertion: 18 alleles ==============================
 Parameters
mt (
MatrixTable
orTable
) – Matrix table with a variant (locus / alleles) row key.show (
bool
) – IfTrue
, print results instead of returning them.handler
Notes
The result returned if show is
False
is aStruct
with five fields:n_variants (
int
): Number of variants present in the matrix table.allele_types (
dict
[str
,int
]): Number of alternate alleles in each allele allele category.contigs (
dict
[str
,int
]): Number of variants on each contig.allele_counts (
dict
[int
,int
]): Number of variants broken down by number of alleles (biallelic is 2, for example).r_ti_tv (
float
): Ratio of transition alternate alleles to transversion alternate alleles.

hail.methods.
transmission_disequilibrium_test
(dataset, pedigree)[source]¶ Performs the transmission disequilibrium test on trios.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Compute TDT association statistics and show the first two results:
>>> pedigree = hl.Pedigree.read('data/tdt_trios.fam') >>> tdt_table = hl.transmission_disequilibrium_test(tdt_dataset, pedigree) >>> tdt_table.show(2) +++++++  locus  alleles  t  u  chi_sq  p_value  +++++++  locus<GRCh37>  array<str>  int64  int64  float64  float64  +++++++  1:246714629  ["C","A"]  0  4  4.00e+00  4.55e02   2:167262169  ["T","C"]  NA  NA  NA  NA  +++++++
Export variants with pvalues below 0.001:
>>> tdt_table = tdt_table.filter(tdt_table.p_value < 0.001) >>> tdt_table.export(f"output/tdt_results.tsv")
Notes
The transmission disequilibrium test compares the number of times the alternate allele is transmitted (t) versus not transmitted (u) from a heterozgyous parent to an affected child. The null hypothesis holds that each case is equally likely. The TDT statistic is given by
\[(t  u)^2 \over (t + u)\]and asymptotically follows a chisquared distribution with one degree of freedom under the null hypothesis.
transmission_disequilibrium_test()
only considers complete trios (two parents and a proband with defined sex) and only returns results for the autosome, as defined byin_autosome()
, and chromosome X. Transmissions and nontransmissions are counted only for the configurations of genotypes and copy state in the table below, in order to filter out Mendel errors and configurations where transmission is guaranteed. The copy state of a locus with respect to a trio is defined as follows:Auto – in autosome or in PAR of X or female child
HemiX – in nonPAR of X and male child
Here PAR is the pseudoautosomal region of X and Y defined by
ReferenceGenome
, which many variant callers map to chromosome X.Kid
Dad
Mom
Copy State
t
u
HomRef
Het
Het
Auto
0
2
HomRef
HomRef
Het
Auto
0
1
HomRef
Het
HomRef
Auto
0
1
Het
Het
Het
Auto
1
1
Het
HomRef
Het
Auto
1
0
Het
Het
HomRef
Auto
1
0
Het
HomVar
Het
Auto
0
1
Het
Het
HomVar
Auto
0
1
HomVar
Het
Het
Auto
2
0
HomVar
Het
HomVar
Auto
1
0
HomVar
HomVar
Het
Auto
1
0
HomRef
HomRef
Het
HemiX
0
1
HomRef
HomVar
Het
HemiX
0
1
HomVar
HomRef
Het
HemiX
1
0
HomVar
HomVar
Het
HemiX
1
0
transmission_disequilibrium_test()
produces a table with the following columns: Parameters
dataset (
MatrixTable
) – Dataset.pedigree (
Pedigree
) – Sample pedigree.
 Returns
Table
– Table of TDT results.

hail.methods.
trio_matrix
(dataset, pedigree, complete_trios=False)[source]¶ Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio.
Note
Requires the column key to be one field of type
tstr
Examples
Create a trio matrix:
>>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
Notes
This method builds a new matrix table with one column per trio. If complete_trios is
True
, then only trios that satisfyTrio.is_complete()
are included. In this new dataset, the column identifiers are the sample IDs of the trio probands. The column fields and entries of the matrix are changed in the following ways:The new column fields consist of three structs (proband, father, mother), a Boolean field, and a string field:
proband (
tstruct
)  Column fields on the proband.father (
tstruct
)  Column fields on the father.mother (
tstruct
)  Column fields on the mother.id (
tstr
)  Column key for the proband.is_female (
tbool
)  Proband is female.True
for female,False
for male, missing if unknown.fam_id (
tstr
)  Family ID.
The new entry fields are:
proband_entry (
tstruct
)  Proband entry fields.father_entry (
tstruct
)  Father entry fields.mother_entry (
tstruct
)  Mother entry fields.
 Parameters
pedigree (
Pedigree
) Returns

hail.methods.
variant_qc
(mt, name='variant_qc')[source]¶ Compute common variant statistics (quality control metrics).
Note
Requires the dataset to have a compound row key:
Examples
>>> dataset_result = hl.variant_qc(dataset)
Notes
This method computes variant statistics from the genotype data, returning a new struct field name with the following metrics based on the fields present in the entry schema.
If mt contains an entry field DP of type
tint32
, then the field dp_stats is computed. If mt contains an entry field GQ of typetint32
, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:mean (
float64
) – Mean value.stdev (
float64
) – Standard deviation (zero degrees of freedom).min (
int32
) – Minimum value.max (
int32
) – Maximum value.
If the dataset does not contain an entry field GT of type
tcall
, then an error is raised. The following fields are always computed from GT:AF (
array<float64>
) – Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN.AC (
array<int32>
) – Calculated allele count, one element per allele, including the reference. Sums to AN.AN (
int32
) – Total number of called alleles.homozygote_count (
array<int32>
) – Number of homozygotes per allele. One element per allele, including the reference.call_rate (
float64
) – Fraction of calls neither missing nor filtered. Equivalent to n_called /count_cols()
.n_called (
int64
) – Number of samples with a defined GT.n_not_called (
int64
) – Number of samples with a missing GT.n_filtered (
int64
) – Number of filtered entries.n_het (
int64
) – Number of heterozygous samples.n_non_ref (
int64
) – Number of samples with at least one called nonreference allele.het_freq_hwe (
float64
) – Expected frequency of heterozygous samples under HardyWeinberg equilibrium. Seefunctions.hardy_weinberg_test()
for details.p_value_hwe (
float64
) – pvalue from test of HardyWeinberg equilibrium. Seefunctions.hardy_weinberg_test()
for details.
Warning
het_freq_hwe and p_value_hwe are calculated as in
functions.hardy_weinberg_test()
, with nondiploid calls (ploidy != 2
) ignored in the counts. As this test is only statistically rigorous in the biallelic setting,variant_qc()
sets both fields to missing for multiallelic variants. Consider usingsplit_multi()
to split multiallelic variants beforehand. Parameters
mt (
MatrixTable
) – Dataset.name (
str
) – Name for resulting field.
 Returns

hail.methods.
vep
(dataset, config=None, block_size=1000, name='vep', csq=False)[source]¶ Annotate variants with VEP.
Note
Requires the dataset to have a compound row key:
vep()
runs Variant Effect Predictor on the current dataset and adds the result as a row field.Examples
Add VEP annotations to the dataset:
>>> result = hl.vep(dataset, "data/vepconfiguration.json")
Notes
Installation
This VEP command only works if you have already installed VEP on your computing environment. If you use hailctl dataproc to start Hail clusters, installing VEP is achieved by specifying the –vep flag. For more detailed instructions, see Variant Effect Predictor (VEP).
Configuration
vep()
needs a configuration file to tell it how to run VEP. This is theconfig
argument to the VEP function. If you are using hailctl dataproc as mentioned above, you can just use the default argument forconfig
and everything will work. If you need to run VEP with Hail in other environments, there are detailed instructions below.The format of the configuration file is JSON, and
vep()
expects a JSON object with three fields:command (array of string) – The VEP command line to run. The string literal __OUTPUT_FORMAT_FLAG__ is replaced with –json or –vcf depending on csq.
env (object) – A map of environment variables to values to add to the environment when invoking the command. The value of each object member must be a string.
vep_json_schema (string): The type of the VEP JSON schema (as produced by the VEP when invoked with the –json option). Note: This is the oldstyle ‘parseable’ Hail type syntax. This will change.
Here is an example configuration file for invoking VEP release 85 installed in /vep with the Loftee plugin:
{ "command": [ "/vep", "format", "vcf", "__OUTPUT_FORMAT_FLAG__", "everything", "allele_number", "no_stats", "cache", "offline", "minimal", "assembly", "GRCh37", "plugin", "LoF,human_ancestor_fa:/root/.vep/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/root/.vep/loftee_data/phylocsf_gerp.sql,gerp_file:/root/.vep/loftee_data/GERP_scores.final.sorted.txt.gz", "o", "STDOUT" ], "env": { "PERL5LIB": "/vep_data/loftee" }, "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,uniparc:String,variant_allele:String}],variant_class:String}" }
The configuration files used by``hailctl dataproc`` can be found at the following locations:
GRCh37
:gs://hailusvep/vep85lofteegcloud.json
GRCh38
:gs://hailusvep/vep95GRCh38lofteegcloud.json
If no config file is specified, this function will check to see if environment variable VEP_CONFIG_URI is set with a path to a config file.
Annotations
A new row field is added in the location specified by name with type given by the type given by the json_vep_schema (if csq is
False
) ortstr
(if csq isTrue
).If csq is
True
, then the CSQ header string is also added as a global field with namename + '_csq_header'
. Parameters
dataset (
MatrixTable
orTable
) – Dataset.config (
str
) – Path to VEP configuration file.block_size (
int
) – Number of rows to process per VEP invocation.name (
str
) – Name for resulting row field.csq (
bool
) – IfTrue
, annotates with the VCF CSQ field as atstr
. IfFalse
, annotates as the vep_json_schema.
 Returns
MatrixTable
orTable
– Dataset with new rowindexed field name containing VEP annotations.