Genetics

VEPConfig()

Base class for configuring VEP.

VEPConfigGRCh37Version85(*, data_bucket, ...)

The Hail-maintained VEP configuration for GRCh37 for VEP version 85.

VEPConfigGRCh38Version95(*, data_bucket, ...)

The Hail-maintained VEP configuration for GRCh38 for VEP version 95.

balding_nichols_model(n_populations, ...[, ...])

Generate a matrix table of variants, samples, and genotypes using the Balding-Nichols or Pritchard-Stephens-Donnelly model.

concordance(left, right, *[, ...])

Calculate call concordance with another dataset.

filter_intervals(ds, intervals[, keep])

Filter rows with a list of intervals.

filter_alleles(mt, f)

Filter alternate alleles.

filter_alleles_hts(mt, f[, subset])

Filter alternate alleles and update standard GATK entry fields.

hwe_normalized_pca(call_expr[, k, ...])

Run principal component analysis (PCA) on the Hardy-Weinberg-normalized genotype call matrix.

genetic_relatedness_matrix(call_expr)

Compute the genetic relatedness matrix (GRM).

realized_relationship_matrix(call_expr)

Computes the realized relationship matrix (RRM).

impute_sex(call[, aaf_threshold, ...])

Impute sex of samples by calculating inbreeding coefficient on the X chromosome.

ld_matrix(entry_expr, locus_expr, radius[, ...])

Computes the windowed correlation (linkage disequilibrium) matrix between variants.

ld_prune(call_expr[, r2, bp_window_size, ...])

Returns a maximal subset of variants that are nearly uncorrelated within each window.

compute_charr(ds[, min_af, max_af, min_dp, ...])

Compute CHARR, the DNA sample contamination estimator.

mendel_errors(call, pedigree)

Find Mendel errors; count per variant, individual and nuclear family.

de_novo(mt, pedigree, pop_frequency_prior, *)

Call putative de novo events from trio data.

nirvana(dataset, config[, block_size, name])

Annotate variants using Nirvana.

sample_qc(mt[, name])

Compute per-sample metrics useful for quality control.

_logistic_skat(group, weight, y, x, covariates)

The logistic sequence kernel association test (SKAT).

skat(key_expr, weight_expr, y, x, covariates)

Test each keyed group of rows for association by linear or logistic SKAT test.

lambda_gc(p_value[, approximate])

Compute genomic inflation factor (lambda GC) from an Expression of p-values.

split_multi(ds[, keep_star, left_aligned, ...])

Split multiallelic variants.

split_multi_hts(ds[, keep_star, ...])

Split multiallelic variants for datasets that contain one or more fields from a standard high-throughput sequencing entry schema.

summarize_variants(mt[, show, handler])

Summarize the variants present in a dataset and print the results.

transmission_disequilibrium_test(dataset, ...)

Performs the transmission disequilibrium test on trios.

trio_matrix(dataset, pedigree[, complete_trios])

Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio.

variant_qc(mt[, name])

Compute common variant statistics (quality control metrics).

vep(dataset[, config, block_size, name, ...])

Annotate variants with VEP.

class hail.methods.VEPConfig[source]

Base class for configuring VEP.

To define a custom VEP configuration to for Query on Batch, construct a new class that inherits from VEPConfig and has the following parameters defined:

  • json_type (HailType): The type of the VEP JSON schema (as produced by VEP when invoked with the –json option).

  • data_bucket (str) – The location where the VEP data is stored.

  • data_mount (str) – The location in the container where the data should be mounted.

  • batch_run_command (list of str) – The command line to run for a VEP job for a partition.

  • batch_run_csq_header_command (list of str) – The command line to run when generating the consequence header.

  • env (dict of str to str) – A map of environment variables to values to add to the environment when invoking the command.

  • cloud (str) – The cloud where the Batch Service is located.

  • image (str) – The docker image to run VEP.

  • data_bucket_is_requester_pays (bool) – True if the data bucket is requester pays.

  • regions (list of str) – A list of regions the VEP jobs can run in.

In addition, the method command must be defined with the following signature. The output is the exact command to run the VEP executable. The inputs are consequence and tolerate_parse_error which are user-defined parameters to vep(), part_id which is the partition ID, input_file which is the path to the input file where the input data can be found, and output_file is the path to the output file where the VEP annotations are written to. An example is shown below:

def command(self,
            consequence: bool,
            tolerate_parse_error: bool,
            part_id: int,
            input_file: Optional[str],
            output_file: str) -> List[str]:
    vcf_or_json = '--vcf' if consequence else '--json'
    input_file = f'--input_file {input_file}' if input_file else ''
    return f'''/vep/vep {input_file}         --format vcf         {vcf_or_json}         --everything         --allele_number         --no_stats         --cache         --offline         --minimal         --assembly GRCh37         --dir={self.data_mount}         --plugin LoF,human_ancestor_fa:{self.data_mount}/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:{self.data_mount}/loftee_data/phylocsf_gerp.sql,gerp_file:{self.data_mount}/loftee_data/GERP_scores.final.sorted.txt.gz         -o STDOUT
'''

The following environment variables are added to the job’s environment:

  • VEP_BLOCK_SIZE - The maximum number of variants provided as input to each invocation of VEP.

  • VEP_PART_ID - Partition ID.

  • VEP_DATA_MOUNT - Location where the vep data is mounted (same as data_mount in the config).

  • VEP_CONSEQUENCE - Integer equal to 0 or 1 on whether csq is False or True.

  • VEP_TOLERATE_PARSE_ERROR - Integer equal to 0 or 1 on whether tolerate_parse_error is False or True.

  • VEP_OUTPUT_FILE - String specifying the local path where the output TSV file with the VEP result should be located.

  • VEP_INPUT_FILE - String specifying the local path where the input VCF shard is located for all jobs.

The VEP_INPUT_FILE environment variable is not available for the single job that computes the consequence header when csq=True

class hail.methods.VEPConfigGRCh37Version85(*, data_bucket, data_mount, image, regions, cloud, data_bucket_is_requester_pays)[source]

The Hail-maintained VEP configuration for GRCh37 for VEP version 85.

This class takes the following constructor arguments:

  • data_bucket (str) – The location where the VEP data is stored.

  • data_mount (str) – The location in the container where the data should be mounted.

  • image (str) – The docker image to run VEP.

  • cloud (str) – The cloud where the Batch Service is located.

  • data_bucket_is_requester_pays (bool) – True if the data bucket is requester pays.

  • regions (list of str) – A list of regions the VEP jobs can run in.

class hail.methods.VEPConfigGRCh38Version95(*, data_bucket, data_mount, image, regions, cloud, data_bucket_is_requester_pays)[source]

The Hail-maintained VEP configuration for GRCh38 for VEP version 95.

This class takes the following constructor arguments:

  • data_bucket (str) – The location where the VEP data is stored.

  • data_mount (str) – The location in the container where the data should be mounted.

  • image (str) – The docker image to run VEP.

  • cloud (str) – The cloud where the Batch Service is located.

  • data_bucket_is_requester_pays (bool) – True if the data bucket is set to requester pays.

  • regions (list of str) – A list of regions the VEP jobs can run in.

hail.methods.balding_nichols_model(n_populations, n_samples, n_variants, n_partitions=None, pop_dist=None, fst=None, af_dist=None, reference_genome='default', mixture=False, *, phased=False)[source]

Generate a matrix table of variants, samples, and genotypes using the Balding-Nichols or Pritchard-Stephens-Donnelly model.

Examples

Generate a matrix table of genotypes with 1000 variants and 100 samples across 3 populations:

>>> hl.reset_global_randomness()
>>> bn_ds = hl.balding_nichols_model(3, 100, 1000)
>>> bn_ds.show(n_rows=5, n_cols=5)
+---------------+------------+------+------+------+------+------+
| locus         | alleles    | 0.GT | 1.GT | 2.GT | 3.GT | 4.GT |
+---------------+------------+------+------+------+------+------+
| locus<GRCh37> | array<str> | call | call | call | call | call |
+---------------+------------+------+------+------+------+------+
| 1:1           | ["A","C"]  | 0/1  | 0/0  | 0/1  | 0/0  | 0/0  |
| 1:2           | ["A","C"]  | 1/1  | 1/1  | 1/1  | 1/1  | 0/1  |
| 1:3           | ["A","C"]  | 0/1  | 0/1  | 1/1  | 0/1  | 1/1  |
| 1:4           | ["A","C"]  | 0/1  | 0/0  | 0/1  | 0/0  | 0/1  |
| 1:5           | ["A","C"]  | 0/1  | 0/1  | 0/1  | 0/0  | 0/0  |
+---------------+------------+------+------+------+------+------+
showing top 5 rows
showing the first 5 of 100 columns

Generate a dataset as above but with phased genotypes:

>>> hl.reset_global_randomness()
>>> bn_ds = hl.balding_nichols_model(3, 100, 1000, phased=True)
>>> bn_ds.show(n_rows=5, n_cols=5)
+---------------+------------+------+------+------+------+------+
| locus         | alleles    | 0.GT | 1.GT | 2.GT | 3.GT | 4.GT |
+---------------+------------+------+------+------+------+------+
| locus<GRCh37> | array<str> | call | call | call | call | call |
+---------------+------------+------+------+------+------+------+
| 1:1           | ["A","C"]  | 0|0  | 0|0  | 0|0  | 0|0  | 1|0  |
| 1:2           | ["A","C"]  | 1|1  | 1|1  | 1|1  | 1|1  | 1|1  |
| 1:3           | ["A","C"]  | 1|1  | 1|1  | 0|1  | 1|1  | 1|1  |
| 1:4           | ["A","C"]  | 0|0  | 1|0  | 0|0  | 1|0  | 0|0  |
| 1:5           | ["A","C"]  | 0|0  | 0|1  | 0|0  | 0|0  | 0|0  |
+---------------+------------+------+------+------+------+------+
showing top 5 rows
showing the first 5 of 100 columns

Generate a matrix table using 4 populations, 40 samples, 150 variants, 3 partitions, population distribution [0.1, 0.2, 0.3, 0.4], \(F_{ST}\) values [.02, .06, .04, .12], ancestral allele frequencies drawn from a truncated beta distribution with a = 0.01 and b = 0.05 over the interval [0.05, 1], and random seed 1:

>>> hl.reset_global_randomness()
>>> bn_ds = hl.balding_nichols_model(4, 40, 150, 3,
...          pop_dist=[0.1, 0.2, 0.3, 0.4],
...          fst=[.02, .06, .04, .12],
...          af_dist=hl.rand_beta(a=0.01, b=2.0, lower=0.05, upper=1.0))

To guarantee reproducibility, we set the Hail global seed with set_global_seed() immediately prior to generating the dataset.

Notes

This method simulates a matrix table of variants, samples, and genotypes using the Balding-Nichols model, which we now define.

  • \(K\) populations are labeled by integers \(0, 1, \dots, K - 1\).

  • \(N\) samples are labeled by strings \(0, 1, \dots, N - 1\).

  • \(M\) variants are defined as 1:1:A:C, 1:2:A:C, …, 1:M:A:C.

  • The default distribution for population assignment \(\pi\) is uniform.

  • The default ancestral frequency distribution \(P_0\) is uniform on \([0.1, 0.9]\). All three classes are located in hail.stats.

  • The default \(F_{ST}\) values are all \(0.1\).

The Balding-Nichols model models genotypes of individuals from a structured population comprising \(K\) homogeneous modern populations that have each diverged from a single ancestral population (a star phylogeny). Each sample is assigned a population by sampling from the categorical distribution \(\pi\). Note that the actual size of each population is random.

Variants are modeled as biallelic and unlinked. Ancestral allele frequencies are drawn independently for each variant from a frequency spectrum \(P_0\). The extent of genetic drift of each modern population from the ancestral population is defined by the corresponding \(F_{ST}\) parameter \(F_k\) (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. \(k = 1, \ldots, K\)). For each variant and population, allele frequencies are drawn from a beta distribution whose parameters are determined by the ancestral allele frequency and \(F_{ST}\) parameter. The beta distribution gives a continuous approximation of the effect of genetic drift. We denote sample population assignments by \(k_n\), ancestral allele frequencies by \(p_m\), population allele frequencies by \(p_{k, m}\), and diploid, unphased genotype calls by \(g_{n, m}\) (0, 1, and 2 correspond to homozygous reference, heterozygous, and homozygous variant, respectively).

The generative model is then given by:

\[\begin{aligned} k_n \,&\sim\, \pi \\ p_m \,&\sim\, P_0 \\ p_{k,m} \mid p_m\,&\sim\, \mathrm{Beta}(\mu = p_m,\, \sigma^2 = F_k p_m (1 - p_m)) \\ g_{n,m} \mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m}) \end{aligned} \]

The beta distribution by its mean and variance above; the usual parameters are \(a = (1 - p) \frac{1 - F}{F}\) and \(b = p \frac{1 - F}{F}\) with \(F = F_k\) and \(p = p_m\).

The resulting dataset has the following fields.

Global fields:

  • bn.n_populations (tint32) – Number of populations.

  • bn.n_samples (tint32) – Number of samples.

  • bn.n_variants (tint32) – Number of variants.

  • bn.n_partitions (tint32) – Number of partitions.

  • bn.pop_dist (tarray of tfloat64) – Population distribution indexed by population.

  • bn.fst (tarray of tfloat64) – \(F_{ST}\) values indexed by population.

  • bn.seed (tint32) – Random seed.

  • bn.mixture (tbool) – Value of mixture parameter.

Row fields:

  • locus (tlocus) – Variant locus (key field).

  • alleles (tarray of tstr) – Variant alleles (key field).

  • ancestral_af (tfloat64) – Ancestral allele frequency.

  • af (tarray of tfloat64) – Modern allele frequencies indexed by population.

Column fields:

  • sample_idx (tint32) - Sample index (key field).

  • pop (tint32) – Population of sample.

Entry fields:

  • GT (tcall) – Genotype call (diploid, unphased).

For the Pritchard-Stephens-Donnelly model, set the mixture to true to treat pop_dist as the parameters of the Dirichlet distribution describing admixture between the modern populations. In this case, the type of pop is tarray of tfloat64 and the value is the mixture proportions.

Parameters:
  • n_populations (int) – Number of modern populations.

  • n_samples (int) – Total number of samples.

  • n_variants (int) – Number of variants.

  • n_partitions (int, optional) – Number of partitions. Default is 1 partition per million entries or 8, whichever is larger.

  • pop_dist (list of float, optional) – Unnormalized population distribution, a list of length n_populations with non-negative values. Default is [1, ..., 1].

  • fst (list of float, optional) – \(F_{ST}\) values, a list of length n_populations with values in (0, 1). Default is [0.1, ..., 0.1].

  • af_dist (Float64Expression, optional) – Representing a random function. Ancestral allele frequency distribution. Default is rand_unif() over the range [0.1, 0.9] with seed 0.

  • reference_genome (str or ReferenceGenome) – Reference genome to use.

  • mixture (bool) – Treat pop_dist as the parameters of a Dirichlet distribution, as in the Prichard-Stevens-Donnelly model.

  • phased (bool) – Generate phased genotypes.

Returns:

MatrixTable – Simulated matrix table of variants, samples, and genotypes.

hail.methods.concordance(left, right, *, _localize_global_statistics=True)[source]

Calculate call concordance with another dataset.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Note

Requires the dataset to contain only diploid and unphased genotype calls. Use call() to recode genotype calls or missing() to set genotype calls to missing.

Examples

Compute concordance between two datasets and output the global concordance statistics and two tables with concordance computed per column key and per row key:

>>> global_conc, cols_conc, rows_conc = hl.concordance(dataset, dataset2)

Notes

This method computes the genotype call concordance (from the entry field GT) between two biallelic variant datasets. It requires unique sample IDs and performs an inner join on samples (only samples in both datasets will be considered). In addition, all genotype calls must be diploid and unphased.

It performs an ordered zip join of the variants. That means the variants of each dataset are sorted, with duplicate variants appearing in some random relative order, and then zipped together. When a variant appears a different number of times between the two datasets, the dataset with the fewer number of instances is padded with “no data”. For example, if a variant is only in one dataset, then each genotype is treated as “no data” in the other.

This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.

Using the global summary result

The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:

  1. No Data (missing variant or filtered entry)

  2. No Call (missing genotype call)

  3. Hom Ref

  4. Heterozygous

  5. Hom Var

The first index is the state in the left dataset and the second index is the state in the right dataset. Typical uses of the summary list are shown below.

>>> summary, samples, variants = hl.concordance(dataset, dataset2)
>>> left_homref_right_homvar = summary[2][4]
>>> left_het_right_missing = summary[3][1]
>>> left_het_right_something_else = sum(summary[3][:]) - summary[3][3]
>>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4]
>>> total_discordant = sum([sum(s[2:]) for s in summary[2:]]) - total_concordant

Using the table results

Table 1: Concordance statistics by column

This table contains the column key field of left, and the following fields:

  • n_discordant (tint64) – Count of discordant calls (see below for full definition).

  • concordance (tarray of tarray of tint64) – Array of concordance per state on left and right, matching the structure of the global summary defined above.

Table 2: Concordance statistics by row

This table contains the row key fields of left, and the following fields:

  • n_discordant (tfloat64) – Count of discordant calls (see below for full definition).

  • concordance (tarray of tarray of tint64) – Array of concordance per state on left and right, matching the structure of the global summary defined above.

In these tables, the column n_discordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.

The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example, concordance[1][4] is the number of “no call” genotypes on the left that were called homozygous variant on the right.

Parameters:
Returns:

(list of list of int, Table, Table) – The global concordance statistics, a table with concordance statistics per column key, and a table with concordance statistics per row key.

hail.methods.filter_intervals(ds, intervals, keep=True)[source]

Filter rows with a list of intervals.

Examples

Filter to loci falling within one interval:

>>> ds_result = hl.filter_intervals(dataset, [hl.parse_locus_interval('17:38449840-38530994')])

Remove all loci within list of intervals:

>>> intervals = [hl.parse_locus_interval(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
>>> ds_result = hl.filter_intervals(dataset, intervals, keep=False)

Notes

Based on the keep argument, this method will either restrict to points in the supplied interval ranges, or remove all rows in those ranges.

When keep=True, partitions that don’t overlap any supplied interval will not be loaded at all. This enables filter_intervals() to be used for reasonably low-latency queries of small ranges of the dataset, even on large datasets.

Parameters:
  • ds (MatrixTable or Table) – Dataset to filter.

  • intervals (ArrayExpression of type tinterval) – Intervals to filter on. The point type of the interval must be a prefix of the key or equal to the first field of the key.

  • keep (bool) – If True, keep only rows that fall within any interval in intervals. If False, keep only rows that fall outside all intervals in intervals.

Returns:

MatrixTable or Table

hail.methods.filter_alleles(mt, f)[source]

Filter alternate alleles.

Note

Requires the dataset to have a compound row key:

Examples

Keep SNPs:

>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))

Keep alleles with AC > 0:

>>> ds_result = hl.filter_alleles(ds, lambda a, allele_index: ds.info.AC[allele_index - 1] > 0)

Update the AC field of the resulting dataset:

>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i-1]))
>>> ds_result = ds_result.annotate_rows(info = updated_info)

Notes

The following new fields are generated:

  • old_locus (locus) – The old locus, before filtering and computing the minimal representation.

  • old_alleles (array<str>) – The old alleles, before filtering and computing the minimal representation.

  • old_to_new (array<int32>) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.

  • new_to_old (array<int32>) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.

If all alternate alleles of a variant are filtered out, the variant itself is filtered out.

Using f

The f argument is a function or lambda evaluated per alternate allele to determine whether that allele is kept. If f evaluates to True, the allele is kept. If f evaluates to False or missing, the allele is removed.

f is a function that takes two arguments: the allele string (of type StringExpression) and the allele index (of type Int32Expression), and returns a boolean expression. This can be either a defined function or a lambda. For example, these two usages are equivalent:

(with a lambda)

>>> ds_result = hl.filter_alleles(ds, lambda allele, i: hl.is_snp(ds.alleles[0], allele))

(with a defined function)

>>> def filter_f(allele, allele_index):
...     return hl.is_snp(ds.alleles[0], allele)
>>> ds_result = hl.filter_alleles(ds, filter_f)

Warning

filter_alleles() does not update any fields other than locus and alleles. This means that row fields like allele count (AC) and entry fields like allele depth (AD) can become meaningless unless they are also updated. You can update them with annotate_rows() and annotate_entries().

Parameters:
Returns:

MatrixTable

hail.methods.filter_alleles_hts(mt, f, subset=False)[source]

Filter alternate alleles and update standard GATK entry fields.

Examples

Filter to SNP alleles using the subset strategy:

>>> ds_result = hl.filter_alleles_hts(
...     ds,
...     lambda allele, _: hl.is_snp(ds.alleles[0], allele),
...     subset=True)

Update the AC field of the resulting dataset:

>>> updated_info = ds_result.info.annotate(AC = ds_result.new_to_old.map(lambda i: ds_result.info.AC[i-1]))
>>> ds_result = ds_result.annotate_rows(info = updated_info)

Notes

For usage of the f argument, see the filter_alleles() documentation.

filter_alleles_hts() requires the dataset have the GATK VCF schema, namely the following entry fields in this order:

GT: call
AD: array<int32>
DP: int32
GQ: int32
PL: array<int32>

Use MatrixTable.select_entries() to rearrange these fields if necessary.

The following new fields are generated:

  • old_locus (locus) – The old locus, before filtering and computing the minimal representation.

  • old_alleles (array<str>) – The old alleles, before filtering and computing the minimal representation.

  • old_to_new (array<int32>) – An array that maps old allele index to new allele index. Its length is the same as old_alleles. Alleles that are filtered are missing.

  • new_to_old (array<int32>) – An array that maps new allele index to the old allele index. Its length is the same as the modified alleles field.

Downcode algorithm

We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.

GT: 1/2
GQ: 10
AD: 0,50,35

0 | 1000
1 | 1000   10
2 | 1000   0     20
  +-----------------
     0     1     2

The downcode algorithm recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 -> 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcoding filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then re-normalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly to split_multi_hts().

The downcode algorithm would produce the following:

GT: 0/1
GQ: 10
AD: 35,50

0 | 20
1 | 0    10
  +-----------
    0    1

In summary:

  • GT: Downcode filtered alleles to reference.

  • AD: Columns of filtered alleles are eliminated and their values are added to the reference column, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 40,20.

  • DP: No change.

  • PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.

  • GQ: The second-lowest PL (after shifting).

Subset algorithm

We will illustrate the behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.

GT: 1/2
GQ: 10
AD: 0,50,35

0 | 1000
1 | 1000   10
2 | 1000   0     20
  +-----------------
     0     1     2

The subset algorithm subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are re-normalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.

The subset algorithm would produce the following:

GT: 1/1
GQ: 980
AD: 0,50

0 | 980
1 | 980    0
  +-----------
     0      1

In summary:

  • GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).

  • AD: The filtered alleles’ columns are eliminated, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 25,20.

  • DP: Unchanged.

  • PL: Columns involving filtered alleles are eliminated and the remaining columns’ values are shifted so the minimum value is 0.

  • GQ: The second-lowest PL (after shifting).

Warning

filter_alleles_hts() does not update any row fields other than locus and alleles. This means that row fields like allele count (AC) can become meaningless unless they are also updated. You can update them with annotate_rows().

See also

filter_alleles()

Parameters:
Returns:

MatrixTable

hail.methods.hwe_normalized_pca(call_expr, k=10, compute_loadings=False)[source]

Run principal component analysis (PCA) on the Hardy-Weinberg-normalized genotype call matrix.

Examples

>>> eigenvalues, scores, loadings = hl.hwe_normalized_pca(dataset.GT, k=5)

Notes

This method specializes pca() for the common use case of PCA in statistical genetics, that of projecting samples to a small number of ancestry coordinates. Variants that are all homozygous reference or all homozygous alternate are unnormalizable and removed before evaluation. See pca() for more details.

Users of PLINK/GCTA should be aware that Hail computes the GRM slightly differently with regard to missing data. In Hail, the \(ij\) entry of the GRM \(MM^T\) is simply the dot product of rows \(i\) and \(j\) of \(M\); in terms of \(C\) it is

\[\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}-2p_l)(C_{jl} - 2p_l)}{2p_l(1-p_l)}\]

where \(\mathcal{C}_i = \{l \mid C_{il} \text{ is non-missing}\}\). In PLINK/GCTA the denominator \(m\) is replaced with the number of terms in the sum \(\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert\), i.e. the number of variants where both samples have non-missing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections

Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM, i.e. the left singular vectors \(U_k\) instead of the component scores \(U_k S_k\). The scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. In PC bi-plots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.

Parameters:
  • call_expr (CallExpression) – Entry-indexed call expression.

  • k (int) – Number of principal components.

  • compute_loadings (bool) – If True, compute row loadings.

Returns:

(list of float, Table, Table) – List of eigenvalues, table with column scores, table with row loadings.

hail.methods.genetic_relatedness_matrix(call_expr)[source]

Compute the genetic relatedness matrix (GRM).

Examples

>>> grm = hl.genetic_relatedness_matrix(dataset.GT)

Notes

The genetic relationship matrix (GRM) \(G\) encodes genetic correlation between each pair of samples. It is defined by \(G = MM^T\) where \(M\) is a standardized version of the genotype matrix, computed as follows. Let \(C\) be the \(n \times m\) matrix of raw genotypes in the variant dataset, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as

\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},\]

with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in Hardy-Weinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,

\[G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}-2p_j)(C_{kj}-2p_j)}{2 p_j (1-p_j)}\]

This method drops variants with \(p_j = 0\) or \(p_j = 1\) before computing kinship.

Parameters:

call_expr (CallExpression) – Entry-indexed call expression with columns corresponding to samples.

Returns:

BlockMatrix – Genetic relatedness matrix for all samples. Row and column indices correspond to matrix table column index.

hail.methods.realized_relationship_matrix(call_expr)[source]

Computes the realized relationship matrix (RRM).

Examples

>>> rrm = hl.realized_relationship_matrix(dataset.GT)

Notes

The realized relationship matrix (RRM) is defined as follows. Consider the \(n \times m\) matrix \(C\) of raw genotypes, with rows indexed by \(n\) samples and columns indexed by the \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as

\[M_{ij} = \frac{C_{ij}-2p_j} {\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}-2p_j)^2}},\]

with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance \(1/m\), which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the \(n \times n\) sample correlation or realized relationship matrix (RRM) \(K\) as simply

\[K = MM^T\]

Note that the only difference between the realized relationship matrix and the genetic relatedness matrix (GRM) used in realized_relationship_matrix() is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under Hardy-Weinberg Equilibrium.

This method drops variants with zero variance before computing kinship.

Parameters:

call_expr (CallExpression) – Entry-indexed call expression on matrix table with columns corresponding to samples.

Returns:

BlockMatrix – Realized relationship matrix for all samples. Row and column indices correspond to matrix table column index.

hail.methods.impute_sex(call, aaf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, aaf=None)[source]

Impute sex of samples by calculating inbreeding coefficient on the X chromosome.

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Remove samples where imputed sex does not equal reported sex:

>>> imputed_sex = hl.impute_sex(dataset.GT)
>>> dataset_result = dataset.filter_cols(imputed_sex[dataset.s].is_female != dataset.pheno.is_female,
...                                      keep=False)

Notes

We have used the same implementation as PLINK v1.7.

Let gr be the the reference genome of the type of the locus key (as given by tlocus.reference_genome)

  1. Filter the dataset to loci on the X contig defined by gr.

  2. Calculate alternate allele frequency (AAF) for each row from the dataset.

  3. Filter to variants with AAF above aaf_threshold.

  4. Remove loci in the pseudoautosomal region, as defined by gr, unless include_par is True (it defaults to False)

  5. For each row and column with a non-missing genotype call, \(E\), the expected number of homozygotes (from population AAF), is computed as \(1.0 - (2.0*\mathrm{maf}*(1.0-\mathrm{maf}))\).

  6. For each row and column with a non-missing genotype call, \(O\), the observed number of homozygotes, is computed interpreting 0 as heterozygote and 1 as homozygote`

  7. For each row and column with a non-missing genotype call, \(N\) is incremented by 1

  8. For each column, \(E\), \(O\), and \(N\) are combined across variants

  9. For each column, \(F\) is calculated by \((O - E) / (N - E)\)

  10. A sex is assigned to each sample with the following criteria: - Female when F < 0.2 - Male when F > 0.8 Use female_threshold and male_threshold to change this behavior.

Annotations

The returned column-key indexed Table has the following fields in addition to the matrix table’s column keys:

  • is_female (tbool) – True if the imputed sex is female, false if male, missing if undetermined.

  • f_stat (tfloat64) – Inbreeding coefficient.

  • n_called (tint64) – Number of variants with a genotype call.

  • expected_homs (tfloat64) – Expected number of homozygotes.

  • observed_homs (tint64) – Observed number of homozygotes.

callCallExpression

A genotype call for each row and column. The source dataset’s row keys must be [[locus], alleles] with types tlocus and tarray of tstr. Moreover, the alleles array must have exactly two elements (i.e. the variant must be biallelic).

aaf_thresholdfloat

Minimum alternate allele frequency threshold.

include_parbool

Include pseudoautosomal regions.

female_thresholdfloat

Samples are called females if F < female_threshold.

male_thresholdfloat

Samples are called males if F > male_threshold.

aafstr or None

A field defining the alternate allele frequency for each row. If None, AAF will be computed from call.

Returns:

Table – Sex imputation statistics per sample.

hail.methods.ld_matrix(entry_expr, locus_expr, radius, coord_expr=None, block_size=None)[source]

Computes the windowed correlation (linkage disequilibrium) matrix between variants.

Examples

Consider the following dataset consisting of three variants with centimorgan coordinates and four samples:

>>> data = [{'v': '1:1:A:C',       'cm': 0.1, 's': 'a', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'd', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'b', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'd', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'c', 'GT': hl.Call([1, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'd', 'GT': hl.missing(hl.tcall)}]
>>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, cm: float64, GT: call}'))
>>> ht = ht.transmute(**hl.parse_variant(ht.v))
>>> mt = ht.to_matrix_table(row_key=['locus', 'alleles'], col_key=['s'], row_fields=['cm'])

Compute linkage disequilibrium between all pairs of variants on the same contig and within two megabases:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=2e6)
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
       [-0.85280287,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

Within one megabases:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1e6)
>>> ld.to_numpy()
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Within one centimorgan:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm)
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
       [-0.85280287,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

Within one centimorgan, and only calculate the upper triangle:

>>> ld = hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, radius=1.0, coord_expr=mt.cm)
>>> ld = ld.sparsify_triangle()
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
       [ 0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

Notes

This method sparsifies the result of row_correlation() using linalg.utils.locus_windows() and BlockMatrix.sparsify_row_intervals() in order to only compute linkage disequilibrium between nearby variants. Use row_correlation() directly to calculate correlation without windowing.

More precisely, variants are 0-indexed by their order in the matrix table (see add_row_index()). Each variant is regarded as a vector of elements defined by entry_expr, typically the number of alternate alleles or genotype dosage. Missing values are mean-imputed within variant.

The method produces a symmetric block-sparse matrix supported in a neighborhood of the diagonal. If variants \(i\) and \(j\) are on the same contig and within radius base pairs (inclusive) then the \((i, j)\) element is their Pearson correlation coefficient. Otherwise, the \((i, j)\) element is 0.0.

Rows with a constant value (i.e., zero variance) will result in nan correlation values. To avoid this, first check that all variants vary or filter out constant variants (for example, with the help of aggregators.stats()).

If the global_position() on locus_expr is not in ascending order, this method will fail. Ascending order should hold for a matrix table keyed by locus or variant (and the associated row table), or for a table that’s been ordered by locus_expr.

Set coord_expr to use a value other than position to define the windows. This row-indexed numeric expression must be non-missing, non-nan, on the same source as locus_expr, and ascending with respect to locus position for each contig; otherwise the method will raise an error.

Warning

See the warnings in row_correlation(). In particular, for large matrices it may be preferable to run its stages separately.

entry_expr and locus_expr are implicitly aligned by row-index, though they need not be on the same source. If their sources differ in the number of rows, an error will be raised; otherwise, unintended misalignment may silently produce unexpected results.

Parameters:
  • entry_expr (Float64Expression) – Entry-indexed numeric expression on matrix table.

  • locus_expr (LocusExpression) – Row-indexed locus expression on a table or matrix table that is row-aligned with the matrix table of entry_expr.

  • radius (int or float) – Radius of window for row values.

  • coord_expr (Float64Expression, optional) – Row-indexed numeric expression for the row value on the same table or matrix table as locus_expr. By default, the row value is given by the locus position.

  • block_size (int, optional) – Block size. Default given by BlockMatrix.default_block_size().

Returns:

BlockMatrix – Windowed correlation matrix between variants. Row and column indices correspond to matrix table variant index.

hail.methods.ld_prune(call_expr, r2=0.2, bp_window_size=1000000, memory_per_core=256, keep_higher_maf=True, block_size=None)[source]

Returns a maximal subset of variants that are nearly uncorrelated within each window.

Note

Requires the dataset to contain only diploid genotype calls.

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Note

Requires the dataset to have a compound row key:

Examples

Prune variants in linkage disequilibrium by filtering a dataset to those variants returned by ld_prune(). If the dataset contains multiallelic variants, the multiallelic variants must be filtered out or split before being passed to ld_prune().

>>> biallelic_dataset = dataset.filter_rows(hl.len(dataset.alleles) == 2)
>>> pruned_variant_table = hl.ld_prune(biallelic_dataset.GT, r2=0.2, bp_window_size=500000)
>>> filtered_ds = dataset.filter_rows(hl.is_defined(pruned_variant_table[dataset.row_key]))

Notes

This method finds a maximal subset of variants such that the squared Pearson correlation coefficient \(r^2\) of any pair at most bp_window_size base pairs apart is strictly less than r2. Each variant is represented as a vector over samples with elements given by the (mean-imputed) number of alternate alleles. In particular, even if present, phase information is ignored. Variants that do not vary across samples are dropped.

The method prunes variants in linkage disequilibrium in three stages.

  • The first, “local pruning” stage prunes correlated variants within each partition, using a local variant queue whose size is determined by memory_per_core. A larger queue may facilitate more local pruning in this stage. Minor allele frequency is not taken into account. The parallelism is the number of matrix table partitions.

  • The second, “global correlation” stage uses block-sparse matrix multiplication to compute correlation between each pair of remaining variants within bp_window_size base pairs, and then forms a graph of correlated variants. The parallelism of writing the locally-pruned matrix table as a block matrix is n_locally_pruned_variants / block_size.

  • The third, “global pruning” stage applies maximal_independent_set() to prune variants from this graph until no edges remain. This algorithm iteratively removes the variant with the highest vertex degree. If keep_higher_maf is true, then in the case of a tie for highest degree, the variant with lowest minor allele frequency is removed.

Warning

The locally-pruned matrix table and block matrix are stored as temporary files on persistent disk. See the warnings on BlockMatrix.from_entry_expr with regard to memory and Hadoop replication errors.

Parameters:
  • call_expr (CallExpression) – Entry-indexed call expression on a matrix table with row-indexed variants and column-indexed samples.

  • r2 (float) – Squared correlation threshold (exclusive upper bound). Must be in the range [0.0, 1.0].

  • bp_window_size (int) – Window size in base pairs (inclusive upper bound).

  • memory_per_core (int) – Memory in MB per core for local pruning queue.

  • keep_higher_maf (int) – If True, break ties at each step of the global pruning stage by preferring to keep variants with higher minor allele frequency.

  • block_size (int, optional) – Block size for block matrices in the second stage. Default given by BlockMatrix.default_block_size().

Returns:

Table – Table of a maximal independent set of variants.

hail.methods.compute_charr(ds, min_af=0.05, max_af=0.95, min_dp=10, max_dp=100, min_gq=20, ref_AF=None)[source]

Compute CHARR, the DNA sample contamination estimator.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Notes

The returned table has the sample ID field, plus the field:

  • charr (float64): CHARR contamination estimation.

Note

It is possible to use gnomAD reference allele frequencies with the following:

>>> gnomad_sites = hl.experimental.load_dataset('gnomad_genome_sites', version='3.1.2') 
>>> charr_result = hl.compute_charr(mt, ref_af=(1 - gnomad_sites[mt.row_key].freq[0].AF)) 

If the dataset is loaded from a gvcf and has NON_REF alleles, drop the last allele with the following or load it with the hail vcf combiner:

>>> mt = mt.key_rows_by(locus=mt.locus, alleles=mt.alleles[:-1])
Parameters:
  • ds (MatrixTable or VariantDataset) – Dataset.

  • min_af – Minimum reference allele frequency to filter variants.

  • max_af – Maximum reference allele frequency to filter variants.

  • min_dp – Minimum sequencing depth to filter variants.

  • max_dp – Maximum sequencing depth to filter variants.

  • min_gq – Minimum genotype quality to filter variants

  • ref_AF – Reference AF expression. Necessary when the sample size is below 10,000.

Returns:

Table

hail.methods.mendel_errors(call, pedigree)[source]

Find Mendel errors; count per variant, individual and nuclear family.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables (all errors, errors by family, errors by individual, errors by variant):

>>> ped = hl.Pedigree.read('data/trios.fam')
>>> all_errors, per_fam, per_sample, per_variant = hl.mendel_errors(dataset['GT'], ped)

Export all mendel errors to a text file:

>>> all_errors.export('output/all_mendel_errors.tsv')

Annotate columns with the number of Mendel errors:

>>> annotated_samples = dataset.annotate_cols(mendel=per_sample[dataset.s])

Annotate rows with the number of Mendel errors:

>>> annotated_variants = dataset.annotate_rows(mendel=per_variant[dataset.locus, dataset.alleles])

Notes

The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats, resembling the .mendel, .fmendel, .imendel, and .lmendel formats, respectively.

First table: all Mendel errors. This table contains one row per Mendel error, keyed by the variant and proband id.

  • locus (tlocus) – Variant locus, key field.

  • alleles (tarray of tstr) – Variant alleles, key field.

  • (column key of dataset) (tstr) – Proband ID, key field.

  • fam_id (tstr) – Family ID.

  • mendel_code (tint32) – Mendel error code, see below.

Second table: errors per nuclear family. This table contains one row per nuclear family, keyed by the parents.

  • pat_id (tstr) – Paternal ID. (key field)

  • mat_id (tstr) – Maternal ID. (key field)

  • fam_id (tstr) – Family ID.

  • children (tint32) – Number of children in this nuclear family.

  • errors (tint64) – Number of Mendel errors in this nuclear family.

  • snp_errors (tint64) – Number of Mendel errors at SNPs in this nuclear family.

Third table: errors per individual. This table contains one row per individual. Each error is counted toward the proband, father, and mother according to the Implicated in the table below.

  • (column key of dataset) (tstr) – Sample ID (key field).

  • fam_id (tstr) – Family ID.

  • errors (tint64) – Number of Mendel errors involving this individual.

  • snp_errors (tint64) – Number of Mendel errors involving this individual at SNPs.

Fourth table: errors per variant.

  • locus (tlocus) – Variant locus, key field.

  • alleles (tarray of tstr) – Variant alleles, key field.

  • errors (tint64) – Number of Mendel errors in this variant.

This method only considers complete trios (two parents and proband with defined sex). The code of each Mendel error is determined by the table below, extending the Plink classification.

In the table, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by in_autosome().

  • Auto – in autosome or in PAR or female child

  • HemiX – in non-PAR of X and male child

  • HemiY – in non-PAR of Y and male child

Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.

Code

Dad

Mom

Kid

Copy State | Implicated

1

HomVar

HomVar

Het

Auto

Dad, Mom, Kid

2

HomRef

HomRef

Het

Auto

Dad, Mom, Kid

3

HomRef

~HomRef

HomVar

Auto

Dad, Kid

4

~HomRef

HomRef

HomVar

Auto

Mom, Kid

5

HomRef

HomRef

HomVar

Auto

Kid

6

HomVar

~HomVar

HomRef

Auto

Dad, Kid

7

~HomVar

HomVar

HomRef

Auto

Mom, Kid

8

HomVar

HomVar

HomRef

Auto

Kid

9

Any

HomVar

HomRef

HemiX

Mom, Kid

10

Any

HomRef

HomVar

HemiX

Mom, Kid

11

HomVar

Any

HomRef

HemiY

Dad, Kid

12

HomRef

Any

HomVar

HemiY

Dad, Kid

Parameters:
Returns:

(Table, Table, Table, Table)

hail.methods.de_novo(mt, pedigree, pop_frequency_prior, *, min_gq=20, min_p=0.05, max_parent_ab=0.05, min_child_ab=0.2, min_dp_ratio=0.1, ignore_in_sample_allele_frequency=False)[source]

Call putative de novo events from trio data.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Call de novo events:

>>> pedigree = hl.Pedigree.read('data/trios.fam')
>>> priors = hl.import_table('data/gnomadFreq.tsv', impute=True)
>>> priors = priors.transmute(**hl.parse_variant(priors.Variant)).key_by('locus', 'alleles')
>>> de_novo_results = hl.de_novo(dataset, pedigree, pop_frequency_prior=priors[dataset.row_key].AF)

Notes

This method assumes the GATK high-throughput sequencing fields exist: GT, AD, DP, GQ, PL.

This method replicates the functionality of Kaitlin Samocha’s de novo caller. The version corresponding to git commit bde3e40 is implemented in Hail with her permission and assistance.

This method produces a Table with the following fields:

  • locus (locus) – Variant locus.

  • alleles (array<str>) – Variant alleles.

  • id (str) – Proband sample ID.

  • prior (float64) – Site frequency prior. It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7. If the ignore_in_sample_allele_frequency parameter is True, then the computed allele frequency is not included in the calculation, and the prior is the maximum of the pop_frequency_prior and 1 / 3e7.

  • proband (struct) – Proband column fields from mt.

  • father (struct) – Father column fields from mt.

  • mother (struct) – Mother column fields from mt.

  • proband_entry (struct) – Proband entry fields from mt.

  • father_entry (struct) – Father entry fields from mt.

  • proband_entry (struct) – Mother entry fields from mt.

  • is_female (bool) – True if proband is female.

  • p_de_novo (float64) – Unfiltered posterior probability that the event is de novo rather than a missed heterozygous event in a parent.

  • confidence (str) Validation confidence. One of: 'HIGH', 'MEDIUM', 'LOW'.

The key of the table is ['locus', 'alleles', 'id'].

The model looks for de novo events in which both parents are homozygous reference and the proband is a heterozygous. The model makes the simplifying assumption that when this configuration x = (AA, AA, AB) of calls occurs, exactly one of the following is true:

  • d: a de novo mutation occurred in the proband and all calls are accurate.

  • m: at least one parental allele is actually heterozygous and the proband call is accurate.

We can then estimate the posterior probability of a de novo mutation as:

\[\mathrm{P_{\text{de novo}}} = \frac{\mathrm{P}(d \mid x)}{\mathrm{P}(d \mid x) + \mathrm{P}(m \mid x)}\]

Applying Bayes rule to the numerator and denominator yields

\[\frac{\mathrm{P}(x \mid d)\,\mathrm{P}(d)}{\mathrm{P}(x \mid d)\,\mathrm{P}(d) + \mathrm{P}(x \mid m)\,\mathrm{P}(m)}\]

The prior on de novo mutation is estimated from the rate in the literature:

\[\mathrm{P}(d) = \frac{1 \, \text{mutation}}{30{,}000{,}000 \, \text{bases}}\]

The prior used for at least one alternate allele between the parents depends on the alternate allele frequency:

\[\mathrm{P}(m) = 1 - (1 - AF)^4\]

The likelihoods \(\mathrm{P}(x \mid d)\) and \(\mathrm{P}(x \mid m)\) are computed from the PL (genotype likelihood) fields using these factorizations:

\[\mathrm{P}(x = (AA, AA, AB) \mid d) = \left( \begin{aligned} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AA) \\ {} \cdot {} &\mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AA) \\ {} \cdot {} &\mathrm{P}(x_{\mathrm{proband}} = AB \mid \mathrm{proband} = AB) \end{aligned} \right) \]
\[\begin{aligned} \mathrm{P}(x = (AA, AA, AB) \mid m) = &\left( \begin{aligned} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AB) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AA) \\ {} + {} &\mathrm{P}(x_{\mathrm{father}} = AA \mid \mathrm{father} = AA) \cdot \mathrm{P}(x_{\mathrm{mother}} = AA \mid \mathrm{mother} = AB) \end{aligned} \right) \\ &{} \cdot \mathrm{P}(x_{\mathrm{proband}} = AB \mid \mathrm{proband} = AB) \end{aligned} \]

(Technically, the second factorization assumes there is exactly (rather than at least) one alternate allele among the parents, which may be justified on the grounds that it is typically the most likely case by far.)

While this posterior probability is a good metric for grouping putative de novo mutations by validation likelihood, there exist error modes in high-throughput sequencing data that are not appropriately accounted for by the phred-scaled genotype likelihoods. To this end, a number of hard filters are applied in order to assign validation likelihood.

These filters are different for SNPs and insertions/deletions. In the below rules, the following variables are used:

  • DR refers to the ratio of the read depth in the proband to the combined read depth in the parents.

  • DP refers to the read depth (DP field) of the proband.

  • AB refers to the read allele balance of the proband (number of alternate reads divided by total reads).

  • AC refers to the count of alternate alleles across all individuals in the dataset at the site.

  • p refers to \(\mathrm{P_{\text{de novo}}}\).

  • min_p refers to the min_p function parameter.

HIGH-quality SNV:

(p > 0.99) AND (AB > 0.3) AND (AC == 1)
    OR
(p > 0.99) AND (AB > 0.3) AND (DR > 0.2)
    OR
(p > 0.5) AND (AB > 0.3) AND (AC < 10) AND (DP > 10)

MEDIUM-quality SNV:

(p > 0.5) AND (AB > 0.3)
    OR
(AC == 1)

LOW-quality SNV:

(AB > 0.2)

HIGH-quality indel:

(p > 0.99) AND (AB > 0.3) AND (AC == 1)

MEDIUM-quality indel:

(p > 0.5) AND (AB > 0.3) AND (AC < 10)

LOW-quality indel:

(AB > 0.2)

Additionally, de novo candidates are not considered if the proband GQ is smaller than the min_gq parameter, if the proband allele balance is lower than the min_child_ab parameter, if the depth ratio between the proband and parents is smaller than the min_depth_ratio parameter, if the allele balance in a parent is above the max_parent_ab parameter, or if the posterior probability p is smaller than the min_p parameter.

Parameters:
  • mt (MatrixTable) – High-throughput sequencing dataset.

  • pedigree (Pedigree) – Sample pedigree.

  • pop_frequency_prior (Float64Expression) – Expression for population alternate allele frequency prior.

  • min_gq – Minimum proband GQ to be considered for de novo calling.

  • min_p – Minimum posterior probability to be considered for de novo calling.

  • max_parent_ab – Maximum parent allele balance.

  • min_child_ab – Minimum proband allele balance/

  • min_dp_ratio – Minimum ratio between proband read depth and parental read depth.

  • ignore_in_sample_allele_frequency – Ignore in-sample allele frequency in computing site prior. Experimental.

Returns:

Table

hail.methods.nirvana(dataset, config, block_size=500000, name='nirvana')[source]

Annotate variants using Nirvana.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Note

Requires the dataset to have a compound row key:

nirvana() runs Nirvana on the current dataset and adds a new row field in the location specified by name.

Examples

Add Nirvana annotations to the dataset:

>>> result = hl.nirvana(dataset, "data/nirvana.properties") 

Configuration

nirvana() requires a configuration file. The format is a .properties file, where each line defines a property as a key-value pair of the form key = value. nirvana() supports the following properties:

  • hail.nirvana.dotnet – Location of dotnet. Optional, default: dotnet.

  • hail.nirvana.path – Value of the PATH environment variable when invoking Nirvana. Optional, by default PATH is not set.

  • hail.nirvana.location – Location of Nirvana.dll. Required.

  • hail.nirvana.reference – Location of reference genome. Required.

  • hail.nirvana.cache – Location of cache. Required.

  • hail.nirvana.supplementaryAnnotationDirectory – Location of Supplementary Database. Optional, no supplementary database by default.

Here is an example nirvana.properties configuration file:

hail.nirvana.location = /path/to/dotnet/netcoreapp2.0/Nirvana.dll
hail.nirvana.reference = /path/to/nirvana/References/Homo_sapiens.GRCh37.Nirvana.dat
hail.nirvana.cache = /path/to/nirvana/Cache/GRCh37/Ensembl
hail.nirvana.supplementaryAnnotationDirectory = /path/to/nirvana/SupplementaryDatabase/GRCh37

Annotations

A new row field is added in the location specified by name with the following schema:

struct {
    chromosome: str,
    refAllele: str,
    position: int32,
    altAlleles: array<str>,
    cytogeneticBand: str,
    quality: float64,
    filters: array<str>,
    jointSomaticNormalQuality: int32,
    copyNumber: int32,
    strandBias: float64,
    recalibratedQuality: float64,
    variants: array<struct {
        altAllele: str,
        refAllele: str,
        chromosome: str,
        begin: int32,
        end: int32,
        phylopScore: float64,
        isReferenceMinor: bool,
        variantType: str,
        vid: str,
        hgvsg: str,
        isRecomposedVariant: bool,
        isDecomposedVariant: bool,
        regulatoryRegions: array<struct {
            id: str,
            type: str,
            consequence: set<str>
        }>,
        clinvar: array<struct {
            id: str,
            reviewStatus: str,
            isAlleleSpecific: bool,
            alleleOrigins: array<str>,
            refAllele: str,
            altAllele: str,
            phenotypes: array<str>,
            medGenIds: array<str>,
            omimIds: array<str>,
            orphanetIds: array<str>,
            significance: str,
            lastUpdatedDate: str,
            pubMedIds: array<str>
        }>,
        cosmic: array<struct {
            id: str,
            isAlleleSpecific: bool,
            refAllele: str,
            altAllele: str,
            gene: str,
            sampleCount: int32,
            studies: array<struct {
                id: int32,
                histology: str,
                primarySite: str
            }>
        }>,
        dbsnp: struct {
            ids: array<str>
        },
        globalAllele: struct {
            globalMinorAllele: str,
            globalMinorAlleleFrequency: float64
        },
        gnomad: struct {
            coverage: str,
            allAf: float64,
            allAc: int32,
            allAn: int32,
            allHc: int32,
            afrAf: float64,
            afrAc: int32,
            afrAn: int32,
            afrHc: int32,
            amrAf: float64,
            amrAc: int32,
            amrAn: int32,
            amrHc: int32,
            easAf: float64,
            easAc: int32,
            easAn: int32,
            easHc: int32,
            finAf: float64,
            finAc: int32,
            finAn: int32,
            finHc: int32,
            nfeAf: float64,
            nfeAc: int32,
            nfeAn: int32,
            nfeHc: int32,
            othAf: float64,
            othAc: int32,
            othAn: int32,
            othHc: int32,
            asjAf: float64,
            asjAc: int32,
            asjAn: int32,
            asjHc: int32,
            failedFilter: bool
        },
        gnomadExome: struct {
            coverage: str,
            allAf: float64,
            allAc: int32,
            allAn: int32,
            allHc: int32,
            afrAf: float64,
            afrAc: int32,
            afrAn: int32,
            afrHc: int32,
            amrAf: float64,
            amrAc: int32,
            amrAn: int32,
            amrHc: int32,
            easAf: float64,
            easAc: int32,
            easAn: int32,
            easHc: int32,
            finAf: float64,
            finAc: int32,
            finAn: int32,
            finHc: int32,
            nfeAf: float64,
            nfeAc: int32,
            nfeAn: int32,
            nfeHc: int32,
            othAf: float64,
            othAc: int32,
            othAn: int32,
            othHc: int32,
            asjAf: float64,
            asjAc: int32,
            asjAn: int32,
            asjHc: int32,
            sasAf: float64,
            sasAc: int32,
            sasAn: int32,
            sasHc: int32,
            failedFilter: bool
        },
        topmed: struct {
            failedFilter: bool,
            allAc: int32,
            allAn: int32,
            allAf: float64,
            allHc: int32
        },
        oneKg: struct {
            ancestralAllele: str,
            allAf: float64,
            allAc: int32,
            allAn: int32,
            afrAf: float64,
            afrAc: int32,
            afrAn: int32,
            amrAf: float64,
            amrAc: int32,
            amrAn: int32,
            easAf: float64,
            easAc: int32,
            easAn: int32,
            eurAf: float64,
            eurAc: int32,
            eurAn: int32,
            sasAf: float64,
            sasAc: int32,
            sasAn: int32
        },
        mitomap: array<struct {
            refAllele: str,
            altAllele: str,
            diseases : array<str>,
            hasHomoplasmy: bool,
            hasHeteroplasmy: bool,
            status: str,
            clinicalSignificance: str,
            scorePercentile: float64,
            isAlleleSpecific: bool,
            chromosome: str,
            begin: int32,
            end: int32,
            variantType: str
        }
        transcripts: struct {
            refSeq: array<struct {
                transcript: str,
                bioType: str,
                aminoAcids: str,
                cdnaPos: str,
                codons: str,
                cdsPos: str,
                exons: str,
                introns: str,
                geneId: str,
                hgnc: str,
                consequence: array<str>,
                hgvsc: str,
                hgvsp: str,
                isCanonical: bool,
                polyPhenScore: float64,
                polyPhenPrediction: str,
                proteinId: str,
                proteinPos: str,
                siftScore: float64,
                siftPrediction: str
            }>,
            ensembl: array<struct {
                transcript: str,
                bioType: str,
                aminoAcids: str,
                cdnaPos: str,
                codons: str,
                cdsPos: str,
                exons: str,
                introns: str,
                geneId: str,
                hgnc: str,
                consequence: array<str>,
                hgvsc: str,
                hgvsp: str,
                isCanonical: bool,
                polyPhenScore: float64,
                polyPhenPrediction: str,
                proteinId: str,
                proteinPos: str,
                siftScore: float64,
                siftPrediction: str
            }>
        },
        overlappingGenes: array<str>
    }>
    genes: array<struct {
        name: str,
        omim: array<struct {
            mimNumber: int32,
            hgnc: str,
            description: str,
            phenotypes: array<struct {
                mimNumber: int32,
                phenotype: str,
                mapping: str,
                inheritance: array<str>,
                comments: str
            }>
        }>
        exac: struct {
            pLi: float64,
            pRec: float64,
            pNull: float64
        }
    }>
}
Parameters:
  • dataset (MatrixTable or Table) – Dataset.

  • config (str) – Path to Nirvana configuration file.

  • block_size (int) – Number of rows to process per Nirvana invocation.

  • name (str) – Name for resulting row field.

Returns:

MatrixTable or Table – Dataset with new row-indexed field name containing Nirvana annotations.

hail.methods.sample_qc(mt, name='sample_qc')[source]

Compute per-sample metrics useful for quality control.

Note

Requires the dataset to have a compound row key:

Examples

Compute sample QC metrics and remove low-quality samples:

>>> dataset = hl.sample_qc(dataset, name='sample_qc')
>>> filtered_dataset = dataset.filter_cols((dataset.sample_qc.dp_stats.mean > 20) & (dataset.sample_qc.r_ti_tv > 1.5))

Notes

This method computes summary statistics per sample from a genetic matrix and stores the results as a new column-indexed struct field in the matrix, named based on the name parameter.

If mt contains an entry field DP of type tint32, then the field dp_stats is computed. If mt contains an entry field GQ of type tint32, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:

  • mean (float64) – Mean value.

  • stdev (float64) – Standard deviation (zero degrees of freedom).

  • min (int32) – Minimum value.

  • max (int32) – Maximum value.

If the dataset does not contain an entry field GT of type tcall, then an error is raised. The following fields are always computed from GT:

  • call_rate (float64) – Fraction of calls not missing or filtered. Equivalent to n_called divided by count_rows().

  • n_called (int64) – Number of non-missing calls.

  • n_not_called (int64) – Number of missing calls.

  • n_filtered (int64) – Number of filtered entries.

  • n_hom_ref (int64) – Number of homozygous reference calls.

  • n_het (int64) – Number of heterozygous calls.

  • n_hom_var (int64) – Number of homozygous alternate calls.

  • n_non_ref (int64) – Sum of n_het and n_hom_var.

  • n_snp (int64) – Number of SNP alternate alleles.

  • n_insertion (int64) – Number of insertion alternate alleles.

  • n_deletion (int64) – Number of deletion alternate alleles.

  • n_singleton (int64) – Number of private alleles. Reference alleles are never counted as singletons, even if every other allele at a site is non-reference.

  • n_transition (int64) – Number of transition (A-G, C-T) alternate alleles.

  • n_transversion (int64) – Number of transversion alternate alleles.

  • n_star (int64) – Number of star (upstream deletion) alleles.

  • r_ti_tv (float64) – Transition/Transversion ratio.

  • r_het_hom_var (float64) – Het/HomVar call ratio.

  • r_insertion_deletion (float64) – Insertion/Deletion allele ratio.

Missing values NA may result from division by zero.

Parameters:
  • mt (MatrixTable) – Dataset.

  • name (str) – Name for resulting field.

Returns:

MatrixTable – Dataset with a new column-indexed field name.

hail.methods._logistic_skat(group, weight, y, x, covariates, max_size=46340, null_max_iterations=25, null_tolerance=1e-06, accuracy=1e-06, iterations=10000)[source]

The logistic sequence kernel association test (SKAT).

Logistic SKAT tests if the phenotype, y, is significantly associated with the genotype, x. For \(N\) samples, in a group of \(M\) variants, with \(K\) covariates, the model is given by:

\[\begin{align*} X &: R^{N \times K} \\ G &: \{0, 1, 2\}^{N \times M} \\ \\ Y &\sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\beta_0 X + \beta_1 G)) \end{align*}\]

The usual null hypothesis is \(\beta_1 = 0\). SKAT tests for an association, but does not provide an effect size or other information about the association.

Wu et al. argue that, under the null hypothesis, a particular value, \(Q\), is distributed according to a generalized chi-squared distribution with parameters determined by the genotypes, weights, and residual phenotypes. The SKAT p-value is the probability of drawing even larger values of \(Q\). If \(\widehat{\beta_\textrm{null}}\) is the best-fit beta under the null model:

\[Y \sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\beta_\textrm{null} X))\]

Then \(Q\) is defined by Wu et al. as:

\[\begin{align*} p_i &= \textrm{logit}^{-1}(\widehat{\beta_\textrm{null}} X) \\ r_i &= y_i - p_i \\ W_{ii} &= w_i \\ \\ Q &= r^T G W G^T r \end{align*}\]

Therefore \(r_i\), the residual phenotype, is the portion of the phenotype unexplained by the covariates alone. Also notice:

  1. Each sample’s phenotype is Bernoulli distributed with mean \(p_i\) and variance \(\sigma^2_i = p_i(1 - p_i)\), the binomial variance.

  2. \(G W G^T\), is a symmetric positive-definite matrix when the weights are non-negative.

We describe below our interpretation of the mathematics as described in the main body and appendix of Wu, et al. According to the paper, the distribution of \(Q\) is given by a generalized chi-squared distribution whose weights are the eigenvalues of a symmetric matrix which we call \(Z Z^T\):

\[\begin{align*} V_{ii} &= \sigma^2_i \\ W_{ii} &= w_i \quad\quad \textrm{the weight for variant } i \\ \\ P_0 &= V - V X (X^T V X)^{-1} X^T V \\ Z Z^T &= P_0^{1/2} G W G^T P_0^{1/2} \end{align*}\]

The eigenvalues of \(Z Z^T\) and \(Z^T Z\) are the squared singular values of \(Z\); therefore, we instead focus on \(Z^T Z\). In the expressions below, we elide transpositions of symmetric matrices:

\[\begin{align*} Z Z^T &= P_0^{1/2} G W G^T P_0^{1/2} \\ Z &= P_0^{1/2} G W^{1/2} \\ Z^T Z &= W^{1/2} G^T P_0 G W^{1/2} \end{align*}\]

Before substituting the definition of \(P_0\), simplify it using the reduced QR decomposition:

\[\begin{align*} Q R &= V^{1/2} X \\ R^T Q^T &= X^T V^{1/2} \\ \\ P_0 &= V - V X (X^T V X)^{-1} X^T V \\ &= V - V X (R^T Q^T Q R)^{-1} X^T V \\ &= V - V X (R^T R)^{-1} X^T V \\ &= V - V X R^{-1} (R^T)^{-1} X^T V \\ &= V - V^{1/2} Q (R^T)^{-1} X^T V^{1/2} \\ &= V - V^{1/2} Q Q^T V^{1/2} \\ &= V^{1/2} (I - Q Q^T) V^{1/2} \\ \end{align*}\]

Substitute this simplified expression into \(Z\):

\[\begin{align*} Z^T Z &= W^{1/2} G^T V^{1/2} (I - Q Q^T) V^{1/2} G W^{1/2} \\ \end{align*}\]

Split this symmetric matrix by observing that \(I - Q Q^T\) is idempotent:

\[\begin{align*} I - Q Q^T &= (I - Q Q^T)(I - Q Q^T)^T \\ \\ Z &= (I - Q Q^T) V^{1/2} G W^{1/2} \\ Z &= (G - Q Q^T G) V^{1/2} W^{1/2} \end{align*}\]

Finally, the squared singular values of \(Z\) are the eigenvalues of \(Z^T Z\), so \(Q\) should be distributed as follows:

\[\begin{align*} U S V^T &= Z \quad\quad \textrm{the singular value decomposition} \\ \lambda_s &= S_{ss}^2 \\ \\ Q &\sim \textrm{GeneralizedChiSquared}(\lambda, \vec{1}, \vec{0}, 0, 0) \end{align*}\]

The null hypothesis test tests for the probability of observing even larger values of \(Q\).

The SKAT method was originally described in:

Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011 Jul 15;89(1):82-93. doi: 10.1016/j.ajhg.2011.05.029. Epub 2011 Jul 7. PMID: 21737059; PMCID: PMC3135811. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3135811/

Examples

Generate a dataset with a phenotype noisily computed from the genotypes:

>>> hl.reset_global_randomness()
>>> mt = hl.balding_nichols_model(1, n_samples=100, n_variants=20)
>>> mt = mt.annotate_rows(gene = mt.locus.position // 12)
>>> mt = mt.annotate_rows(weight = 1)
>>> mt = mt.annotate_cols(phenotype = (hl.agg.sum(mt.GT.n_alt_alleles()) - 20 + hl.rand_norm(0, 1)) > 0.5)

Test if the phenotype is significantly associated with the genotype:

>>> skat = hl._logistic_skat(
...     mt.gene,
...     mt.weight,
...     mt.phenotype,
...     mt.GT.n_alt_alleles(),
...     covariates=[1.0])
>>> skat.show()
+-------+-------+----------+----------+-------+
| group |  size |   q_stat |  p_value | fault |
+-------+-------+----------+----------+-------+
| int32 | int64 |  float64 |  float64 | int32 |
+-------+-------+----------+----------+-------+
|     0 |    11 | 1.78e+02 | 1.68e-04 |     0 |
|     1 |     9 | 1.39e+02 | 1.82e-03 |     0 |
+-------+-------+----------+----------+-------+

The same test, but using the original paper’s suggested weights which are derived from the allele frequency.

>>> mt = hl.variant_qc(mt)
>>> skat = hl._logistic_skat(
...     mt.gene,
...     hl.dbeta(mt.variant_qc.AF[0], 1, 25),
...     mt.phenotype,
...     mt.GT.n_alt_alleles(),
...     covariates=[1.0])
>>> skat.show()
+-------+-------+----------+----------+-------+
| group |  size |   q_stat |  p_value | fault |
+-------+-------+----------+----------+-------+
| int32 | int64 |  float64 |  float64 | int32 |
+-------+-------+----------+----------+-------+
|     0 |    11 | 8.04e+00 | 3.50e-01 |     0 |
|     1 |     9 | 1.22e+00 | 5.04e-01 |     0 |
+-------+-------+----------+----------+-------+

Our simulated data was unweighted, so the null hypothesis appears true. In real datasets, we expect the allele frequency to correlate with effect size.

Notice that, in the second group, the fault flag is set to 1. This indicates that the numerical integration to calculate the p-value failed to achieve the required accuracy (by default, 1e-6). In this particular case, the null hypothesis is likely true and the numerical integration returned a (nonsensical) value greater than one.

The max_size parameter allows us to skip large genes that would cause “out of memory” errors:

>>> skat = hl._logistic_skat(
...     mt.gene,
...     mt.weight,
...     mt.phenotype,
...     mt.GT.n_alt_alleles(),
...     covariates=[1.0],
...     max_size=10)
>>> skat.show()
+-------+-------+----------+----------+-------+
| group |  size |   q_stat |  p_value | fault |
+-------+-------+----------+----------+-------+
| int32 | int64 |  float64 |  float64 | int32 |
+-------+-------+----------+----------+-------+
|     0 |    11 |       NA |       NA |    NA |
|     1 |     9 | 1.39e+02 | 1.82e-03 |     0 |
+-------+-------+----------+----------+-------+

Notes

In the SKAT R package, the “weights” are actually the square root of the weight expression from the paper. This method uses the definition from the paper.

The paper includes an explicit intercept term but this method expects the user to specify the intercept as an extra covariate with the value 1.

This method does not perform small sample size correction.

The q_stat return value is not the \(Q\) statistic from the paper. We match the output of the SKAT R package which returns \(\tilde{Q}\):

\[\tilde{Q} = \frac{Q}{2}\]
Parameters:
  • group (Expression) – Row-indexed expression indicating to which group a variant belongs. This is typically a gene name or an interval.

  • weight (Float64Expression) – Row-indexed expression for weights. Must be non-negative.

  • y (Float64Expression) – Column-indexed response (dependent variable) expression.

  • x (Float64Expression) – Entry-indexed expression for input (independent variable).

  • covariates (list of Float64Expression) – List of column-indexed covariate expressions. You must explicitly provide an intercept term if desired. You must provide at least one covariate.

  • max_size (int) – Maximum size of group on which to run the test. Groups which exceed this size will have a missing p-value and missing q statistic. Defaults to 46340.

  • null_max_iterations (int) – The maximum number of iterations when fitting the logistic null model. Defaults to 25.

  • null_tolerance (float) – The null model logisitic regression converges when the errors is less than this. Defaults to 1e-6.

  • accuracy (float) – The accuracy of the p-value if fault value is zero. Defaults to 1e-6.

  • iterations (int) – The maximum number of iterations used to calculate the p-value (which has no closed form). Defaults to 1e5.

Returns:

Table – One row per-group. The key is group. The row fields are:

  • group : the group parameter.

  • size : tint64, the number of variants in this group.

  • q_stat : tfloat64, the \(Q\) statistic, see Notes for why this differs from the paper.

  • p_value : tfloat64, the test p-value for the null hypothesis that the genotypes have no linear influence on the phenotypes.

  • fault : tint32, the fault flag from pgenchisq().

The global fields are:

  • n_complete_samples : tint32, the number of samples with neither a missing phenotype nor a missing covariate.

  • y_residual : tint32, the residual phenotype from the null model. This may be interpreted as the component of the phenotype not explained by the covariates alone.

  • s2 : tfloat64, the variance of the residuals, \(\sigma^2\) in the paper.

  • null_fit:

    • b : tndarray vector of coefficients.

    • score : tndarray vector of score statistics.

    • fisher : tndarray matrix of fisher statistics.

    • mu : tndarray the expected value under the null model.

    • n_iterations : tint32 the number of iterations before termination.

    • log_lkhd : tfloat64 the log-likelihood of the final iteration.

    • converged : tbool True if the null model converged.

    • exploded : tbool True if the null model failed to converge due to numerical explosion.

hail.methods.skat(key_expr, weight_expr, y, x, covariates, logistic=False, max_size=46340, accuracy=1e-06, iterations=10000)[source]

Test each keyed group of rows for association by linear or logistic SKAT test.

Examples

Test each gene for association using the linear sequence kernel association test:

>>> skat_table = hl.skat(key_expr=burden_ds.gene,
...                      weight_expr=burden_ds.weight,
...                      y=burden_ds.burden.pheno,
...                      x=burden_ds.GT.n_alt_alleles(),
...                      covariates=[1, burden_ds.burden.cov1, burden_ds.burden.cov2])

Caution

By default, the Davies algorithm iterates up to 10k times until an accuracy of 1e-6 is achieved. Hence a reported p-value of zero with no issues may truly be as large as 1e-6. The accuracy and maximum number of iterations may be controlled by the corresponding function parameters. In general, higher accuracy requires more iterations.

Caution

To process a group with \(m\) rows, several copies of an \(m \times m\) matrix of doubles must fit in worker memory. Groups with tens of thousands of rows may exhaust worker memory causing the entire job to fail. In this case, use the max_size parameter to skip groups larger than max_size.

Warning

skat() considers the same set of columns (i.e., samples, points) for every group, namely those columns for which all covariates are defined. For each row, missing values of x are mean-imputed over these columns. As in the example, the intercept covariate 1 must be included explicitly if desired.

Notes

This method provides a scalable implementation of the score-based variance-component test originally described in Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test.

Row weights must be non-negative. Rows with missing weights are ignored. In the R package skat—which assumes rows are variants—default weights are given by evaluating the Beta(1, 25) density at the minor allele frequency. To replicate these weights in Hail using alternate allele frequencies stored in a row-indexed field AF, one can use the expression:

>>> hl.dbeta(hl.min(ds2.AF), 1.0, 25.0) ** 2

In the logistic case, the response y must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.

The resulting Table provides the group’s key (id), thenumber of rows in the group (size), the variance component score q_stat, the SKAT p-value, and a fault flag. For the toy example above, the table has the form:

id

size

q_stat

p_value

fault

geneA

2

4.136

0.205

0

geneB

1

5.659

0.195

0

geneC

3

4.122

0.192

0

Groups larger than max_size appear with missing q_stat, p_value, and fault. The hard limit on the number of rows in a group is 46340.

Note that the variance component score q_stat agrees with Q in the R package skat, but both differ from \(Q\) in the paper by the factor \(\frac{1}{2\sigma^2}\) in the linear case and \(\frac{1}{2}\) in the logistic case, where \(\sigma^2\) is the unbiased estimator of residual variance for the linear null model. The R package also applies a “small-sample adjustment” to the null distribution in the logistic case when the sample size is less than 2000. Hail does not apply this adjustment.

The fault flag is an integer indicating whether any issues occurred when running the Davies algorithm to compute the p-value as the right tail of a weighted sum of \(\chi^2(1)\) distributions.

fault value

Description

0

no issues

1

accuracy NOT achieved

2

round-off error possibly significant

3

invalid parameters

4

unable to locate integration parameters

5

out of memory

Parameters:
  • key_expr (Expression) – Row-indexed expression for key associated to each row.

  • weight_expr (Float64Expression) – Row-indexed expression for row weights.

  • y (Float64Expression) – Column-indexed response expression. If logistic is True, all non-missing values must evaluate to 0 or 1. Note that a BooleanExpression will be implicitly converted to a Float64Expression with this property.

  • x (Float64Expression) – Entry-indexed expression for input variable.

  • covariates (list of Float64Expression) – List of column-indexed covariate expressions.

  • logistic (bool or tuple of int and float) – If false, use the linear test. If true, use the logistic test with no more than 25 logistic iterations and a convergence tolerance of 1e-6. If a tuple is given, use the logistic test with the tuple elements as the maximum nubmer of iterations and convergence tolerance, respectively.

  • max_size (int) – Maximum size of group on which to run the test.

  • accuracy (float) – Accuracy achieved by the Davies algorithm if fault value is zero.

  • iterations (int) – Maximum number of iterations attempted by the Davies algorithm.

Returns:

Table – Table of SKAT results.

hail.methods.lambda_gc(p_value, approximate=True)[source]

Compute genomic inflation factor (lambda GC) from an Expression of p-values.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

Parameters:
  • p_value (NumericExpression) – Row-indexed numeric expression of p-values.

  • approximate (bool) – If False, computes exact lambda GC (slower and uses more memory).

Returns:

float – Genomic inflation factor (lambda genomic control).

hail.methods.split_multi(ds, keep_star=False, left_aligned=False, *, permit_shuffle=False)[source]

Split multiallelic variants.

Warning

In order to support a wide variety of data types, this function splits only the variants on a MatrixTable, but not the genotypes. Use split_multi_hts() if possible, or split the genotypes yourself using one of the entry modification methods: MatrixTable.annotate_entries(), MatrixTable.select_entries(), MatrixTable.transmute_entries().

The resulting dataset will be keyed by the split locus and alleles.

split_multi() adds the following fields:

  • was_split (bool) – True if this variant was originally multiallelic, otherwise False.

  • a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with a_index = 1 and 1:100:A:C with a_index = 2.

  • old_locus (locus) – The original, unsplit locus.

  • old_alleles (array<str>) – The original, unsplit alleles.

All other fields are left unchanged.

Warning

This method assumes ds contains at most one non-split variant per locus. This assumption permits the most efficient implementation of the splitting algorithm. If your queries involving split_multi crash with errors about out-of-order keys, this assumption may be violated. Otherwise, this warning likely does not apply to your dataset.

If each locus in ds contains one multiallelic variant and one or more biallelic variants, you can filter to the multiallelic variants, split those, and then combine the split variants with the original biallelic variants.

For example, the following code splits a dataset mt which contains a mixture of split and non-split variants.

>>> bi = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> bi = bi.annotate_rows(a_index=1, was_split=False, old_locus=bi.locus, old_alleles=bi.alleles)
>>> multi = mt.filter_rows(hl.len(mt.alleles) > 2)
>>> split = hl.split_multi(multi)
>>> mt = split.union_rows(bi)

Example

split_multi_hts(), which splits multiallelic variants for the HTS genotype schema and updates the entry fields by downcoding the genotype, is implemented as:

>>> sm = hl.split_multi(ds)
>>> pl = hl.or_missing(
...      hl.is_defined(sm.PL),
...      (hl.range(0, 3).map(lambda i: hl.min(hl.range(0, hl.len(sm.PL))
...                     .filter(lambda j: hl.downcode(hl.unphased_diploid_gt_index_call(j), sm.a_index) == hl.unphased_diploid_gt_index_call(i))
...                     .map(lambda j: sm.PL[j])))))
>>> split_ds = sm.annotate_entries(
...     GT=hl.downcode(sm.GT, sm.a_index),
...     AD=hl.or_missing(hl.is_defined(sm.AD),
...                     [hl.sum(sm.AD) - sm.AD[sm.a_index], sm.AD[sm.a_index]]),
...     DP=sm.DP,
...     PL=pl,
...     GQ=hl.gq_from_pl(pl)).drop('old_locus', 'old_alleles')
Parameters:
  • ds (MatrixTable or Table) – An unsplit dataset.

  • keep_star (bool) – Do not filter out * alleles.

  • left_aligned (bool) – If True, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.

  • permit_shuffle (bool) – If True, permit a data shuffle to sort out-of-order split results. This will only be required if input data has duplicate loci, one of which contains more than one alternate allele.

Returns:

MatrixTable or Table

hail.methods.split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep', *, permit_shuffle=False)[source]

Split multiallelic variants for datasets that contain one or more fields from a standard high-throughput sequencing entry schema.

struct {
  GT: call,
  AD: array<int32>,
  DP: int32,
  GQ: int32,
  PL: array<int32>,
  PGT: call,
  PID: str
}

For other entry fields, write your own splitting logic using MatrixTable.annotate_entries().

Examples

>>> hl.split_multi_hts(dataset).write('output/split.mt')

Warning

This method assumes ds contains at most one non-split variant per locus. This assumption permits the most efficient implementation of the splitting algorithm. If your queries involving split_multi_hts crash with errors about out-of-order keys, this assumption may be violated. Otherwise, this warning likely does not apply to your dataset.

If each locus in ds contains one multiallelic variant and one or more biallelic variants, you can filter to the multiallelic variants, split those, and then combine the split variants with the original biallelic variants.

For example, the following code splits a dataset mt which contains a mixture of split and non-split variants.

>>> bi = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> bi = bi.annotate_rows(a_index=1, was_split=False)
>>> multi = mt.filter_rows(hl.len(mt.alleles) > 2)
>>> split = hl.split_multi_hts(multi)
>>> mt = split.union_rows(bi)

Notes

We will explain by example. Consider a hypothetical 3-allelic variant:

A   C,T 0/2:7,2,6:15:45:99,50,99,0,45,99

split_multi_hts() will create two biallelic variants (one for each alternate allele) at the same position

A   C   0/0:13,2:15:45:0,45,99
A   T   0/1:9,6:15:50:50,0,99

Each multiallelic GT or PGT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.

The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.

The biallelic DP is the same as the multiallelic DP.

The biallelic PL entry for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.

Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.

GQ is recomputed from PL if PL is provided and is not missing. If not, it is copied from the original GQ.

Here is a second example for a het non-ref

A   C,T 1/2:2,8,6:16:45:99,50,99,45,0,99

splits as

A   C   0/1:8,8:16:45:45,0,99
A   T   0/1:10,6:16:50:50,0,99

VCF Info Fields

Hail does not split fields in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index field a_index can be used to select the value corresponding to the split allele’s position:

>>> split_ds = hl.split_multi_hts(dataset)
>>> split_ds = split_ds.filter_rows(split_ds.info.AC[split_ds.a_index - 1] < 10,
...                                 keep = False)

VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number “.”.

If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:

>>> split_ds = hl.split_multi_hts(dataset)
>>> split_ds = split_ds.annotate_rows(info = split_ds.info.annotate(AC = split_ds.info.AC[split_ds.a_index - 1]))
>>> hl.export_vcf(split_ds, 'output/export.vcf') 

The info field AC in data/export.vcf will have Number=1.

New Fields

split_multi_hts() adds the following fields:

  • was_split (bool) – True if this variant was originally multiallelic, otherwise False.

  • a_index (int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with a_index = 1 and 1:100:A:C with a_index = 2.

See also

split_multi()

Parameters:
  • ds (MatrixTable or Table) – An unsplit dataset.

  • keep_star (bool) – Do not filter out * alleles.

  • left_aligned (bool) – If True, variants are assumed to be left aligned and have unique loci. This avoids a shuffle. If the assumption is violated, an error is generated.

  • vep_root (str) – Top-level location of vep data. All variable-length VEP fields (intergenic_consequences, motif_feature_consequences, regulatory_feature_consequences, and transcript_consequences) will be split properly (i.e. a_index corresponding to the VEP allele_num).

  • permit_shuffle (bool) – If True, permit a data shuffle to sort out-of-order split results. This will only be required if input data has duplicate loci, one of which contains more than one alternate allele.

Returns:

MatrixTable or Table – A biallelic variant dataset.

hail.methods.summarize_variants(mt, show=True, *, handler=None)[source]

Summarize the variants present in a dataset and print the results.

Examples

>>> hl.summarize_variants(dataset)  
==============================
Number of variants: 346
==============================
Alleles per variant
-------------------
  2 alleles: 346 variants
==============================
Variants per contig
-------------------
  20: 346 variants
==============================
Allele type distribution
------------------------
        SNP: 301 alleles
   Deletion: 27 alleles
  Insertion: 18 alleles
==============================
Parameters:
  • mt (MatrixTable or Table) – Matrix table with a variant (locus / alleles) row key.

  • show (bool) – If True, print results instead of returning them.

  • handler

Notes

The result returned if show is False is a Struct with five fields:

  • n_variants (int): Number of variants present in the matrix table.

  • allele_types (dict [str, int]): Number of alternate alleles in each allele allele category.

  • contigs (dict [str, int]): Number of variants on each contig.

  • allele_counts (dict [int, int]): Number of variants broken down by number of alleles (biallelic is 2, for example).

  • r_ti_tv (float): Ratio of transition alternate alleles to transversion alternate alleles.

Returns:

None or Struct – Returns None if show is True, or returns results as a struct.

hail.methods.transmission_disequilibrium_test(dataset, pedigree)[source]

Performs the transmission disequilibrium test on trios.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to have a compound row key:

Note

Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Compute TDT association statistics and show the first two results:

>>> pedigree = hl.Pedigree.read('data/tdt_trios.fam')
>>> tdt_table = hl.transmission_disequilibrium_test(tdt_dataset, pedigree)
>>> tdt_table.show(2)  
+---------------+------------+-------+-------+----------+----------+
| locus         | alleles    |     t |     u |   chi_sq |  p_value |
+---------------+------------+-------+-------+----------+----------+
| locus<GRCh37> | array<str> | int64 | int64 |  float64 |  float64 |
+---------------+------------+-------+-------+----------+----------+
| 1:246714629   | ["C","A"]  |     0 |     4 | 4.00e+00 | 4.55e-02 |
| 2:167262169   | ["T","C"]  |    NA |    NA |       NA |       NA |
+---------------+------------+-------+-------+----------+----------+

Export variants with p-values below 0.001:

>>> tdt_table = tdt_table.filter(tdt_table.p_value < 0.001)
>>> tdt_table.export(f"output/tdt_results.tsv")

Notes

The transmission disequilibrium test compares the number of times the alternate allele is transmitted (t) versus not transmitted (u) from a heterozgyous parent to an affected child. The null hypothesis holds that each case is equally likely. The TDT statistic is given by

\[(t - u)^2 \over (t + u)\]

and asymptotically follows a chi-squared distribution with one degree of freedom under the null hypothesis.

transmission_disequilibrium_test() only considers complete trios (two parents and a proband with defined sex) and only returns results for the autosome, as defined by in_autosome(), and chromosome X. Transmissions and non-transmissions are counted only for the configurations of genotypes and copy state in the table below, in order to filter out Mendel errors and configurations where transmission is guaranteed. The copy state of a locus with respect to a trio is defined as follows:

  • Auto – in autosome or in PAR of X or female child

  • HemiX – in non-PAR of X and male child

Here PAR is the pseudoautosomal region of X and Y defined by ReferenceGenome, which many variant callers map to chromosome X.

Kid

Dad

Mom

Copy State

t

u

HomRef

Het

Het

Auto

0

2

HomRef

HomRef

Het

Auto

0

1

HomRef

Het

HomRef

Auto

0

1

Het

Het

Het

Auto

1

1

Het

HomRef

Het

Auto

1

0

Het

Het

HomRef

Auto

1

0

Het

HomVar

Het

Auto

0

1

Het

Het

HomVar

Auto

0

1

HomVar

Het

Het

Auto

2

0

HomVar

Het

HomVar

Auto

1

0

HomVar

HomVar

Het

Auto

1

0

HomRef

HomRef

Het

HemiX

0

1

HomRef

HomVar

Het

HemiX

0

1

HomVar

HomRef

Het

HemiX

1

0

HomVar

HomVar

Het

HemiX

1

0

transmission_disequilibrium_test() produces a table with the following columns:

  • locus (tlocus) – Locus.

  • alleles (tarray of tstr) – Alleles.

  • t (tint32) – Number of transmitted alternate alleles.

  • u (tint32) – Number of untransmitted alternate alleles.

  • chi_sq (tfloat64) – TDT statistic.

  • p_value (tfloat64) – p-value.

Parameters:
Returns:

Table – Table of TDT results.

hail.methods.trio_matrix(dataset, pedigree, complete_trios=False)[source]

Builds and returns a matrix where columns correspond to trios and entries contain genotypes for the trio.

Note

Requires the column key to be one field of type tstr

Examples

Create a trio matrix:

>>> pedigree = hl.Pedigree.read('data/case_control_study.fam')
>>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)

Notes

This method builds a new matrix table with one column per trio. If complete_trios is True, then only trios that satisfy Trio.is_complete() are included. In this new dataset, the column identifiers are the sample IDs of the trio probands. The column fields and entries of the matrix are changed in the following ways:

The new column fields consist of three structs (proband, father, mother), a Boolean field, and a string field:

  • proband (tstruct) - Column fields on the proband.

  • father (tstruct) - Column fields on the father.

  • mother (tstruct) - Column fields on the mother.

  • id (tstr) - Column key for the proband.

  • is_female (tbool) - Proband is female. True for female, False for male, missing if unknown.

  • fam_id (tstr) - Family ID.

The new entry fields are:

  • proband_entry (tstruct) - Proband entry fields.

  • father_entry (tstruct) - Father entry fields.

  • mother_entry (tstruct) - Mother entry fields.

Parameters:

pedigree (Pedigree)

Returns:

MatrixTable

hail.methods.variant_qc(mt, name='variant_qc')[source]

Compute common variant statistics (quality control metrics).

Note

Requires the dataset to have a compound row key:

Examples

>>> dataset_result = hl.variant_qc(dataset)

Notes

This method computes variant statistics from the genotype data, returning a new struct field name with the following metrics based on the fields present in the entry schema.

If mt contains an entry field DP of type tint32, then the field dp_stats is computed. If mt contains an entry field GQ of type tint32, then the field gq_stats is computed. Both dp_stats and gq_stats are structs with with four fields:

  • mean (float64) – Mean value.

  • stdev (float64) – Standard deviation (zero degrees of freedom).

  • min (int32) – Minimum value.

  • max (int32) – Maximum value.

If the dataset does not contain an entry field GT of type tcall, then an error is raised. The following fields are always computed from GT:

  • AF (array<float64>) – Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN.

  • AC (array<int32>) – Calculated allele count, one element per allele, including the reference. Sums to AN.

  • AN (int32) – Total number of called alleles.

  • homozygote_count (array<int32>) – Number of homozygotes per allele. One element per allele, including the reference.

  • call_rate (float64) – Fraction of calls neither missing nor filtered. Equivalent to n_called / count_cols().

  • n_called (int64) – Number of samples with a defined GT.

  • n_not_called (int64) – Number of samples with a missing GT.

  • n_filtered (int64) – Number of filtered entries.

  • n_het (int64) – Number of heterozygous samples.

  • n_non_ref (int64) – Number of samples with at least one called non-reference allele.

  • het_freq_hwe (float64) – Expected frequency of heterozygous samples under Hardy-Weinberg equilibrium. See functions.hardy_weinberg_test() for details.

  • p_value_hwe (float64) – p-value from two-sided test of Hardy-Weinberg equilibrium. See functions.hardy_weinberg_test() for details.

  • p_value_excess_het (float64) – p-value from one-sided test of Hardy-Weinberg equilibrium for excess heterozygosity. See functions.hardy_weinberg_test() for details.

Warning

het_freq_hwe and p_value_hwe are calculated as in functions.hardy_weinberg_test(), with non-diploid calls (ploidy != 2) ignored in the counts. As this test is only statistically rigorous in the biallelic setting, variant_qc() sets both fields to missing for multiallelic variants. Consider using split_multi() to split multi-allelic variants beforehand.

Parameters:
  • mt (MatrixTable) – Dataset.

  • name (str) – Name for resulting field.

Returns:

MatrixTable

hail.methods.vep(dataset, config=None, block_size=1000, name='vep', csq=False, tolerate_parse_error=False)[source]

Annotate variants with VEP.

Note

Requires the dataset to have a compound row key:

vep() runs Variant Effect Predictor on the current dataset and adds the result as a row field.

Examples

Add VEP annotations to the dataset:

>>> result = hl.vep(dataset, "data/vep-configuration.json") 

Notes

Installation

This VEP command only works if you have already installed VEP on your computing environment. If you use hailctl dataproc to start Hail clusters, installing VEP is achieved by specifying the –vep flag. For more detailed instructions, see Variant Effect Predictor (VEP). If you use hailctl hdinsight, see Variant Effect Predictor (VEP).

Spark Configuration

vep() needs a configuration file to tell it how to run VEP. This is the config argument to the VEP function. If you are using hailctl dataproc as mentioned above, you can just use the default argument for config and everything will work. If you need to run VEP with Hail in other environments, there are detailed instructions below.

The format of the configuration file is JSON, and vep() expects a JSON object with three fields:

  • command (array of string) – The VEP command line to run. The string literal __OUTPUT_FORMAT_FLAG__ is replaced with –json or –vcf depending on csq.

  • env (object) – A map of environment variables to values to add to the environment when invoking the command. The value of each object member must be a string.

  • vep_json_schema (string): The type of the VEP JSON schema (as produced by the VEP when invoked with the –json option). Note: This is the old-style ‘parseable’ Hail type syntax. This will change.

Here is an example configuration file for invoking VEP release 85 installed in /vep with the Loftee plugin:

{
    "command": [
        "/vep",
        "--format", "vcf",
        "__OUTPUT_FORMAT_FLAG__",
        "--everything",
        "--allele_number",
        "--no_stats",
        "--cache", "--offline",
        "--minimal",
        "--assembly", "GRCh37",
        "--plugin", "LoF,human_ancestor_fa:/root/.vep/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/root/.vep/loftee_data/phylocsf_gerp.sql,gerp_file:/root/.vep/loftee_data/GERP_scores.final.sorted.txt.gz",
        "-o", "STDOUT"
    ],
    "env": {
        "PERL5LIB": "/vep_data/loftee"
    },
    "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,uniparc:String,variant_allele:String}],variant_class:String}"
}

The configuration files used by``hailctl dataproc`` can be found at the following locations:

  • GRCh37: gs://hail-us-central1-vep/vep85-loftee-gcloud.json

  • GRCh38: gs://hail-us-central1-vep/vep95-GRCh38-loftee-gcloud.json

If no config file is specified, this function will check to see if environment variable VEP_CONFIG_URI is set with a path to a config file.

Batch Service Configuration

If no config is specified, Hail will use the user’s Service configuration parameters to find a supported VEP configuration. However, if you wish to use your own implementation of VEP, then see the documentation for VEPConfig.

Annotations

A new row field is added in the location specified by name with type given by the type given by the json_vep_schema (if csq is False) or tarray of tstr (if csq is True).

If csq is True, then the CSQ header string is also added as a global field with name name + '_csq_header'.

Parameters:
  • dataset (MatrixTable or Table) – Dataset.

  • config (str or VEPConfig, optional) – Path to VEP configuration file or a VEPConfig object.

  • block_size (int) – Number of rows to process per VEP invocation.

  • name (str) – Name for resulting row field.

  • csq (bool) – If True, annotates with the VCF CSQ field as a tstr. If False, annotates as the vep_json_schema.

  • tolerate_parse_error (bool) – If True, ignore invalid JSON produced by VEP and return a missing annotation.

Returns:

MatrixTable or Table – Dataset with new row-indexed field name containing VEP annotations.