Types¶

Aggregable¶

An Aggregable is a Hail data type representing a distributed row or column of a matrix. Hail exposes a number of methods to compute on aggregables depending on the data type.

Aggregable[Array[Double]]¶

sum(): Array[Double] – Compute the sum by index. All elements in the aggregable must have the same length.

Aggregable[Array[Float]]¶

sum(): Array[Float] – Compute the sum by index. All elements in the aggregable must have the same length.

Aggregable[Array[Int]]¶

sum(): Array[Int]
Compute the sum by index. All elements in the aggregable must have the same length.

Examples

Count the total number of occurrences of each allele across samples, per variant:
>>> vds_result = vds.annotate_variants_expr('va.AC = gs.map(g => g.oneHotAlleles(v)).sum()')

Aggregable[Array[Long]]¶

sum(): Array[Long] – Compute the sum by index. All elements in the aggregable must have the same length.

Aggregable[Double]¶

hist(start: Double, end: Double, bins: Int): Struct{binEdges:Array[Double],binFrequencies:Array[Long],nLess:Long,nGreater:Long}
binEdges (Array[Double]) – Array of bin cutoffs

binFrequencies (Array[Long]) – Number of elements that fall in each bin.

nLess (Long) – Number of elements less than the minimum bin

nGreater (Long) – Number of elements greater than the maximum bin

Compute frequency distributions of numeric parameters.

Examples

Compute GQ-distributions per variant:
>>> vds_result = vds.annotate_variants_expr('va.gqHist = gs.map(g => g.gq).hist(0, 100, 20)')
Compute global GQ-distribution:
>>> gq_hist = vds.query_genotypes('gs.map(g => g.gq).hist(0, 100, 100)')
Notes

The start, end, and bins params are no-scope parameters, which means that while computations like 100 / 4 are acceptable, variable references like global.nBins are not.

Bin size is calculated from (end - start) / bins

(bins + 1) breakpoints are generated from the range (start to end by binsize)

Each bin is left-inclusive, right-exclusive except the last bin, which includes the maximum value. This means that if there are N total bins, there will be N + 1 elements in binEdges. For the invocation hist(0, 3, 3), binEdges would be [0, 1, 2, 3] where the bins are [0, 1), [1, 2), [2, 3].

Arguments

start (Double) – Starting point of first bin

end (Double) – End point of last bin

bins (Int) – Number of bins to create.
max(): Double – Compute the maximum of all non-missing elements. The empty max is missing.

min(): Double – Compute the minimum of all non-missing elements. The empty min is missing.

product(): Double – Compute the product of all non-missing elements. The empty product is one.
stats(): Struct{mean:Double,stdev:Double,min:Double,max:Double,nNotMissing:Long,sum:Double}
mean (Double) – Mean value

stdev (Double) – Standard deviation

min (Double) – Minimum value

max (Double) – Maximum value

nNotMissing (Long) – Number of non-missing values

sum (Double) – Sum of all elements

Compute summary statistics about a numeric aggregable.

Examples

Compute the mean genotype quality score per variant:
>>> vds_result = vds.annotate_variants_expr('va.gqMean = gs.map(g => g.gq).stats().mean')
Compute summary statistics on the number of singleton calls per sample:
>>> [singleton_stats] = (vds.sample_qc()
...     .query_samples(['samples.map(s => sa.qc.nSingleton).stats()']))
Compute GQ and DP statistics stratified by genotype call:
>>> gq_dp = [
... 'va.homrefGQ = gs.filter(g => g.isHomRef()).map(g => g.gq).stats()',
... 'va.hetGQ = gs.filter(g => g.isHet()).map(g => g.gq).stats()',
... 'va.homvarGQ = gs.filter(g => g.isHomVar()).map(g => g.gq).stats()',
... 'va.homrefDP = gs.filter(g => g.isHomRef()).map(g => g.dp).stats()',
... 'va.hetDP = gs.filter(g => g.isHet()).map(g => g.dp).stats()',
... 'va.homvarDP = gs.filter(g => g.isHomVar()).map(g => g.dp).stats()']
>>> vds_result = vds.annotate_variants_expr(gq_dp)
Notes

The stats() aggregator can be used to replicate some of the values computed by variant_qc() and sample_qc() such as dpMean and dpStDev.
sum(): Double – Compute the sum of all non-missing elements. The empty sum is zero.

Aggregable[Float]¶

max(): Float – Compute the maximum of all non-missing elements. The empty max is missing.

min(): Float – Compute the minimum of all non-missing elements. The empty min is missing.

sum(): Float – Compute the sum of all non-missing elements. The empty sum is zero.

Aggregable[Genotype]¶

callStats(f: Genotype => Variant): Struct{AC:Array[Int],AF:Array[Double],AN:Int,GC:Array[Int]}
AC (Array[Int]) – Allele count. One element per allele including reference. There are two elements for a biallelic variant, or 4 for a variant with three alternate alleles.

AF (Array[Double]) – Allele frequency. One element per allele including reference. Sums to 1.

AN (Int) – Allele number. This is equal to the sum of AC, or 2 * the total number of called genotypes in the aggregable.

GC (Array[Int]) – Genotype count. One element per possible genotype, including reference genotypes – 3 for biallelic, 6 for triallelic, 10 for 3 alt alleles, and so on. The sum of this array is the number of called genotypes in the aggregable.

Compute four commonly-used metrics over a set of genotypes in a variant.

Examples

Compute phenotype-specific call statistics:
>>> pheno_stats = [
...   'va.case_stats = gs.filter(g => sa.pheno.isCase).callStats(g => v)',
...   'va.control_stats = gs.filter(g => !sa.pheno.isCase).callStats(g => v)']
>>> vds_result = vds.annotate_variants_expr(pheno_stats)
va.eur_stats.AC will be the allele count (AC) computed from individuals marked as “EUR”.

Arguments

f (Genotype => Variant) – Variant lambda expression such as g => v.
hardyWeinberg(): Struct{rExpectedHetFrequency:Double,pHWE:Double}
rExpectedHetFrequency (Double) – Expected rHeterozygosity based on Hardy Weinberg Equilibrium

pHWE (Double) – p-value

Compute Hardy-Weinberg equilibrium p-value.

Examples

Add a new variant annotation that calculates HWE p-value by phenotype:
>>> vds_result = vds.annotate_variants_expr([
...   'va.hweCase = gs.filter(g => sa.pheno.isCase).hardyWeinberg()',
...   'va.hweControl = gs.filter(g => !sa.pheno.isCase).hardyWeinberg()'])
Notes

Hail computes the exact p-value with mid-p-value correction, i.e. the probability of a less-likely outcome plus one-half the probability of an equally-likely outcome. See this document for details on the Levene-Haldane distribution and references.
inbreeding(af: Genotype => Double): Struct{Fstat:Double,nTotal:Long,nCalled:Long,expectedHoms:Double,observedHoms:Long}
Fstat (Double) – Inbreeding coefficient

nTotal (Long) – Number of genotypes analyzed

nCalled (Long) – number of genotypes with non-missing calls

expectedHoms (Double) – Expected number of homozygote calls

observedHoms (Long) – Total number of homozygote calls observed

Compute inbreeding metric. This aggregator is equivalent to the `–het` method in PLINK.

Examples

Calculate the inbreeding metric per sample:
>>> vds_result = (vds.variant_qc()
...     .annotate_samples_expr('sa.inbreeding = gs.inbreeding(g => va.qc.AF)'))
To obtain the same answer as PLINK, use the following series of commands:
>>> vds_result = (vds.variant_qc()
...     .filter_variants_expr('va.qc.AC > 1 && va.qc.AF >= 1e-8 && va.qc.nCalled * 2 - va.qc.AC > 1 && va.qc.AF <= 1 - 1e-8 && v.isAutosomal()')
...     .annotate_samples_expr('sa.inbreeding = gs.inbreeding(g => va.qc.AF)'))
Notes

The Inbreeding Coefficient (F) is computed as follows:

For each variant and sample with a non-missing genotype call, E, the expected number of homozygotes (computed from user-defined expression for minor allele frequency), is computed as 1.0 - (2.0*maf*(1.0-maf))

For each variant and sample with a non-missing genotype call, O, the observed number of homozygotes, is computed as 0 = heterozygote; 1 = homozygote

For each variant and sample with a non-missing genotype call, N is incremented by 1

For each sample, E, O, and N are combined across variants

F is calculated by (O - E) / (N - E)

Arguments

af (Genotype => Double) – Lambda expression for the alternate allele frequency.
infoScore(): Struct{score:Double,nIncluded:Int}
score (Double) – IMPUTE info score

nIncluded (Int) – Number of samples with non-missing genotype probability distribution

Compute the IMPUTE information score.

Examples

Calculate the info score per variant:
>>> (hc.import_gen("data/example.gen", "data/example.sample")
...    .annotate_variants_expr('va.infoScore = gs.infoScore()'))
Calculate group-specific info scores per variant:
>>> vds_result = (hc.import_gen("data/example.gen", "data/example.sample")
...    .annotate_samples_expr("sa.isCase = pcoin(0.5)")
...    .annotate_variants_expr(["va.infoScore.case = gs.filter(g => sa.isCase).infoScore()",
...                             "va.infoScore.control = gs.filter(g => !sa.isCase).infoScore()"]))
Notes

We implemented the IMPUTE info measure as described in the supplementary information from Marchini & Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics (2010).

To calculate the info score \(I_{A}\) for one SNP:

\[\begin{split}I_{A} = \begin{cases} 1 - \frac{\sum_{i=1}^{N}(f_{i} - e_{i}^2)}{2N\hat{\theta}(1 - \hat{\theta})} & \text{when } \hat{\theta} \in (0, 1) \\ 1 & \text{when } \hat{\theta} = 0, \hat{\theta} = 1\\ \end{cases}\end{split}\]

\(N\) is the number of samples with imputed genotype probabilities [\(p_{ik} = P(G_{i} = k)\) where \(k \in \{0, 1, 2\}\)]

\(e_{i} = p_{i1} + 2p_{i2}\) is the expected genotype per sample

\(f_{i} = p_{i1} + 4p_{i2}\)

\(\hat{\theta} = \frac{\sum_{i=1}^{N}e_{i}}{2N}\) is the MLE for the population minor allele frequency

Hail will not generate identical results as QCTOOL for the following reasons:

The floating point number Hail stores for each genotype probability is slightly different than the original data due to rounding and normalization of probabilities.

Hail automatically removes genotype probability distributions that do not meet certain requirements on data import with import_gen() and import_bgen().

Hail does not use the population frequency to impute genotype probabilities when a genotype probability distribution has been set to missing.

Hail calculates the same statistic for sex chromosomes as autosomes while QCTOOL incorporates sex information

Warning

The info score Hail reports will be extremely different from qctool when a SNP has a high missing rate.

If the genotype data was not imported using the import_gen() or import_bgen() commands, then the results for all variants will be score = NA and nIncluded = 0.

It only makes sense to compute the info score for an Aggregable[Genotype] per variant. While a per-sample info score will run, the result is meaningless.

Aggregable[Int]¶

max(): Int – Compute the maximum of all non-missing elements. The empty max is missing.

min(): Int – Compute the minimum of all non-missing elements. The empty min is missing.

sum(): Int – Compute the sum of all non-missing elements. The empty sum is zero.

Aggregable[Long]¶

max(): Long – Compute the maximum of all non-missing elements. The empty max is missing.

min(): Long – Compute the minimum of all non-missing elements. The empty min is missing.

product(): Long – Compute the product of all non-missing elements. The empty product is one.

sum(): Long – Compute the sum of all non-missing elements. The empty sum is zero.

Aggregable[T]¶

collect(): Array[T]
Returns an array with all of the elements in the aggregable. Order is not guaranteed.

Examples

Collect the list of sample IDs with heterozygote genotype calls per variant:
>>> vds_result = vds.annotate_variants_expr('va.hetSamples = gs.filter(g => g.isHet()).map(g => s).collect()')
va.hetSamples will have the type Array[String].
collectAsSet(): Set[T] – Returns the set of all unique elements in the aggregable.
count(): Long
Counts the number of elements in an aggregable.

Examples

Count the number of heterozygote genotype calls in an aggregable of genotypes (gs):
>>> vds_result = vds.annotate_variants_expr('va.nHets = gs.filter(g => g.isHet()).count()')
counter(): Dict[T, Long]
Counts the number of occurrences of each element in an aggregable.

Examples

Compute the number of indels in each chromosome:
>>> [indels_per_chr] = vds.query_variants(['variants.filter(v => v.altAllele().isIndel()).map(v => v.contig).counter()'])
Notes

We recommend this function is used with the Python counter object.
>>> [counter] = vds.query_variants(['variants.flatMap(v => v.altAlleles).counter()'])
>>> from collections import Counter
>>> counter = Counter(counter)
>>> print(counter.most_common(5))
[(AltAllele(C, T), 129L),
 (AltAllele(G, A), 112L),
 (AltAllele(C, A), 60L),
 (AltAllele(A, G), 46L),
 (AltAllele(T, C), 44L)]
exists(expr: T => Boolean): Boolean

Returns true if any element in the aggregator satisfies the condition given by expr and false otherwise.

Arguments

expr (T => Boolean) – Lambda expression.
filter(f: T => Boolean): Aggregable[T]
Subsets an aggregable by evaluating f for each element and keeping those elements that evaluate to true.

Examples

Compute Hardy Weinberg Equilibrium for cases only:
>>> vds_result = vds.annotate_variants_expr("va.hweCase = gs.filter(g => sa.isCase).hardyWeinberg()")
Arguments

f (T => Boolean) – Boolean lambda expression.
flatMap(f: T => Set[U]): Aggregable[U]
Returns a new aggregable by applying a function f to each element and concatenating the resulting sets.

Compute a list of genes per sample with loss of function variants (result does not have duplicate entries):
>>> vds_result = vds.annotate_samples_expr('sa.lof_genes = gs.filter(g => va.consequence == "LOF" && g.nNonRefAlleles() > 0).flatMap(g => va.genes.toSet()).collect()')
Arguments

f (T => Set[U]) – Lambda expression.
flatMap(a: T => Array[U]): Aggregable[U]
Returns a new aggregable by applying a function f to each element and concatenating the resulting arrays.

Examples

Compute a list of genes per sample with loss of function variants (result may have duplicate entries):
>>> vds_result = vds.annotate_samples_expr('sa.lof_genes = gs.filter(g => va.consequence == "LOF" && g.nNonRefAlleles() > 0).flatMap(g => va.genes).collect()')
forall(expr: T => Boolean): Boolean

Returns a true if all elements in the array satisfies the condition given by expr and false otherwise.

Arguments

expr (T => Boolean) – Lambda expression.
fraction(a: T => Boolean): Double
Computes the ratio of the number of occurrences for which a boolean condition evaluates to true, divided by the number of included elements in the aggregable.

Examples

Filter variants with a call rate less than 95%:
>>> vds_result = vds.filter_variants_expr('gs.fraction(g => g.isCalled()) > 0.90')
Compute the differential missingness at SNPs and indels:
>>> exprs = ['sa.SNPmissingness = gs.filter(g => v.altAllele().isSNP()).fraction(g => g.isNotCalled())',
...          'sa.indelmissingness = gs.filter(g => v.altAllele().isIndel()).fraction(g => g.isNotCalled())']
>>> vds_result = vds.annotate_samples_expr(exprs)
map(f: T => U): Aggregable[U]
Change the type of an aggregable by evaluating f for each element.

Examples

Convert an aggregable of genotypes (gs) to an aggregable of genotype quality scores and then compute summary statistics:
>>> vds_result = vds.annotate_variants_expr("va.gqStats = gs.map(g => g.gq).stats()")
Arguments

f (T => U) – Lambda expression.
take(n: Int): Array[T]
Take the first n items of an aggregable.

Examples

Collect the first 5 sample IDs with at least one alternate allele per variant:
>>> vds_result = vds.annotate_variants_expr("va.nonRefSamples = gs.filter(g => g.nNonRefAlleles() > 0).map(g => s).take(5)")
Arguments

n (Int) – Number of items to take.
takeBy(f: T => String, n: Int): Array[T]

Returns the first n items of an aggregable in ascending order, ordered by the result of f. NA always appears last. If the aggregable contains less than n items, then the result will contain as many elements as the aggregable contains.

Arguments

f (T => String) – Lambda expression for mapping an aggregable to an ordered value.

n (Int) – Number of items to take.
takeBy(f: T => Double, n: Int): Array[T]
Returns the first n items of an aggregable in ascending order, ordered by the result of f. NA always appears last. If the aggregable contains less than n items, then the result will contain as many elements as the aggregable contains.

Note that NaN always appears after any finite or infinite floating-point numbers but before NA. For example, consider an aggregable containing these elements:
Infinity, -1, 1, 0, -Infinity, NA, NaN
The expression gs.takeBy(x => x, 7) would return the array:
[-Infinity, -1, 0, 1, Infinity, NaN, NA]
The expression gs.takeBy(x => -x, 7) would return the array:
[Infinity, 1, 0, -1, -Infinity, NaN, NA]
Arguments

f (T => Double) – Lambda expression for mapping an aggregable to an ordered value.

n (Int) – Number of items to take.
takeBy(f: T => Float, n: Int): Array[T]
Returns the first n items of an aggregable in ascending order, ordered by the result of f. NA always appears last. If the aggregable contains less than n items, then the result will contain as many elements as the aggregable contains.

Note that NaN always appears after any finite or infinite floating-point numbers but before NA. For example, consider an aggregable containing these elements:
Infinity, -1, 1, 0, -Infinity, NA, NaN
The expression gs.takeBy(x => x, 7) would return the array:
[-Infinity, -1, 0, 1, Infinity, NaN, NA]
The expression gs.takeBy(x => -x, 7) would return the array:
[Infinity, 1, 0, -1, -Infinity, NaN, NA]
Arguments

f (T => Float) – Lambda expression for mapping an aggregable to an ordered value.

n (Int) – Number of items to take.
takeBy(f: T => Long, n: Int): Array[T]
Returns the first n items of an aggregable in ascending order, ordered by the result of f. NA always appears last. If the aggregable contains less than n items, then the result will contain as many elements as the aggregable contains.

Examples

Consider an aggregable gs containing these elements:
7, 6, 3, NA, 1, 2, NA, 4, 5, -1
The expression gs.takeBy(x => x, 5) would return the array:
[-1, 1, 2, 3, 4]
The expression gs.takeBy(x => -x, 5) would return the array:
[7, 6, 5, 4, 3]
The expression gs.takeBy(x => x, 10) would return the array:
[-1, 1, 2, 3, 4, 5, 6, 7, NA, NA]
Returns the 10 samples with the least number of singletons:
>>> samplesMostSingletons = (vds
...   .sample_qc()
...   .query_samples('samples.takeBy(s => sa.qc.nSingleton, 10)'))
Arguments

f (T => Long) – Lambda expression for mapping an aggregable to an ordered value.

n (Int) – Number of items to take.
takeBy(f: T => Int, n: Int): Array[T]
Returns the first n items of an aggregable in ascending order, ordered by the result of f. NA always appears last. If the aggregable contains less than n items, then the result will contain as many elements as the aggregable contains.

Examples

Consider an aggregable gs containing these elements:
7, 6, 3, NA, 1, 2, NA, 4, 5, -1
The expression gs.takeBy(x => x, 5) would return the array:
[-1, 1, 2, 3, 4]
The expression gs.takeBy(x => -x, 5) would return the array:
[7, 6, 5, 4, 3]
The expression gs.takeBy(x => x, 10) would return the array:
[-1, 1, 2, 3, 4, 5, 6, 7, NA, NA]
Returns the 10 samples with the least number of singletons:
>>> samplesMostSingletons = (vds
...   .sample_qc()
...   .query_samples('samples.takeBy(s => sa.qc.nSingleton, 10)'))
Arguments

f (T => Int) – Lambda expression for mapping an aggregable to an ordered value.

n (Int) – Number of items to take.

AltAllele¶

An AltAllele is a Hail data type representing an alternate allele in the Variant Dataset.

alt: String – Alternate allele base sequence.

category(): String – the alt allele type, i.e one of SNP, Insertion, Deletion, Star, MNP, Complex

isComplex(): Boolean – True if not a SNP, MNP, star, insertion, or deletion.

isDeletion(): Boolean – True if v.ref begins with and is longer than v.alt.

isIndel(): Boolean – True if an insertion or a deletion.

isInsertion(): Boolean – True if v.alt begins with and is longer than v.ref.

isMNP(): Boolean – True if v.ref and v.alt are the same length and differ in more than one position.

isSNP(): Boolean – True if v.ref and v.alt are the same length and differ in one position.

isStar(): Boolean – True if v.alt is *.

isTransition(): Boolean – True if a purine-purine or pyrimidine-pyrimidine SNP.

isTransversion(): Boolean – True if a purine-pyrimidine SNP.

ref: String – Reference allele base sequence.

Array¶

An Array is a collection of items that all have the same data type (ex: Int, String) and are indexed. Arrays can be constructed by specifying [item1, item2, ...] and they are 0-indexed.

An example of constructing an array and accessing an element is:

let a = [1, 10, 3, 7] in a[1]
result: 10

They can also be nested such as Array[Array[Int]]:

let a = [[1, 2, 3], [4, 5], [], [6, 7]] in a[1]
result: [4, 5]

Array[Array[T]]¶

flatten(): Array[T]
Flattens a nested array by concatenating all its rows into a single array.
let a = [[1, 3], [2, 4]] in a.flatten()
result: [1, 3, 2, 4]

Array[Boolean]¶

sort(ascending: Boolean): Array[Boolean]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[Boolean] – Sort the collection in ascending order.

Array[Double]¶

max(): Double – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Double – Median value of the collection.

min(): Double – Smallest element in the collection.

product(): Double – Product of all elements in the collection (returns 1 if empty).

sort(ascending: Boolean): Array[Double]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[Double] – Sort the collection in ascending order.

sum(): Double – Sum of all elements in the collection.

Array[Float]¶

max(): Float – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Float – Median value of the collection.

min(): Float – Smallest element in the collection.

product(): Float – Product of all elements in the collection (returns 1 if empty).

sort(ascending: Boolean): Array[Float]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[Float] – Sort the collection in ascending order.

sum(): Float – Sum of all elements in the collection.

Array[Int]¶

max(): Int – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Int – Median value of the collection.

min(): Int – Smallest element in the collection.

product(): Int – Product of all elements in the collection (returns 1 if empty).

sort(ascending: Boolean): Array[Int]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[Int] – Sort the collection in ascending order.

sum(): Int – Sum of all elements in the collection.

Array[Long]¶

max(): Long – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Long – Median value of the collection.

min(): Long – Smallest element in the collection.

product(): Long – Product of all elements in the collection (returns 1 if empty).

sort(ascending: Boolean): Array[Long]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[Long] – Sort the collection in ascending order.

sum(): Long – Sum of all elements in the collection.

Array[String]¶

mkString(delimiter: String): String
Concatenates all elements of this array into a single string where each element is separated by the delimiter.
let a = ["a", "b", "c"] in a.mkString("::")
result: "a::b::c"
Arguments

delimiter (String) – String that separates each element.
sort(ascending: Boolean): Array[String]

Sort the collection with the ordering specified by ascending.

Arguments

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sort(): Array[String] – Sort the collection in ascending order.

Array[T]¶

[:j]: Array[T]
Returns a slice of the array from the first element until the j*th* element (0-indexed). Negative indices are interpreted as offsets from the end of the array.
let a = [0, 2, 4, 6, 8, 10] in a[:4]
result: [0, 2, 4, 6]
let a = [0, 2, 4, 6, 8, 10] in a[:-4]
result: [0, 2]
Arguments

j (Int) – End index of the slice (not included in result).
[:]: Array[T]
Returns a copy of the array.
let a = [0, 2, 4, 6] in a[:]
result: [0, 2, 4, 6]
[i:j]: Array[T]
Returns a slice of the array from the i*th* element until the j*th* element (both 0-indexed). Negative indices are interpreted as offsets from the end of the array.
let a = [0, 2, 4, 6, 8, 10] in a[2:4]
result: [4, 6]
let a = [0, 2, 4, 6, 8, 10] in a[-3:-1]
result: [6, 8]
A handy way to understand the behavior of negative indicies is to recall this rule:
s[-i:-j] == s[s.length - i, s.length - j]
Arguments

i (Int) – Starting index of the slice.

j (Int) – End index of the slice (not included in result).
[i:]: Array[T]
Returns a slice of the array from the i*th* element (0-indexed) to the end. Negative indices are interpreted as offsets from the end of the array.
let a = [0, 2, 4, 6, 8, 10] in a[3:]
result: [6, 8, 10]
let a = [0, 2, 4, 6, 8, 10] in a[-5:]
result: [2, 4, 6, 8, 10]
Arguments

i (Int) – Starting index of the slice.
[i]: T
Returns the i*th* element (0-indexed) of the array, or throws an exception if i is an invalid index.
let a = [0, 2, 4, 6, 8, 10] in a[2]
result: 4
Arguments

i (Int) – Index of the element to return.
append(a: T): Array[T] – Returns the result of adding the element a to the end of this Array.
exists(expr: T => Boolean): Boolean
Returns a boolean which is true if any element in the array satisfies the condition given by expr. false otherwise.
let a = [1, 2, 3, 4, 5, 6] in a.exists(e => e > 4)
result: true
Arguments

expr (T => Boolean) – Lambda expression.
extend(a: Array[T]): Array[T] – Returns the concatenation of this Array followed by Array a.
filter(expr: T => Boolean): Array[T]
Returns a new array subsetted to the elements where expr evaluates to true.
let a = [1, 4, 5, 6, 10] in a.filter(e => e % 2 == 0)
result: [4, 6, 10]
Arguments

expr (T => Boolean) – Lambda expression.
find(expr: T => Boolean): T
Returns the first non-missing element of the array for which expr is true. If no element satisfies the predicate, find returns NA.
let a = ["cat", "dog", "rabbit"] in a.find(e => 'bb' ~ e)
result: "rabbit"
Arguments

expr (T => Boolean) – Lambda expression.
flatMap(expr: T => Array[U]): Array[U]
Returns a new array by applying a function to each subarray and concatenating the resulting arrays.
let a = [[1, 2, 3], [4, 5], [6]] in a.flatMap(e => e + 1)
result: [2, 3, 4, 5, 6, 7]
Arguments

expr (T => Array[U]) – Lambda expression.
forall(expr: T => Boolean): Boolean
Returns a boolean which is true if all elements in the array satisfies the condition given by expr and false otherwise.
let a = [0, 2, 4, 6, 8, 10] in a.forall(e => e % 2 == 0)
result: true
Arguments

expr (T => Boolean) – Lambda expression.
groupBy(a: T => U): Dict[U, Array[T]]

head(): T – Selects the first element.

isEmpty(): Boolean – Returns true if the number of elements is equal to 0. false otherwise.

length(): Int – Number of elements in the collection.
map(expr: T => U): Array[U]
Returns a new array produced by applying expr to each element.
let a = [0, 1, 2, 3] in a.map(e => pow(2, e))
result: [1, 2, 4, 8]
Arguments

expr (T => U) – Lambda expression.
size(): Int – Number of elements in the collection.

sortBy(f: T => String, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => String) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => String): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => String) – Lambda expression.

sortBy(f: T => Double, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => Double) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => Double): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => Double) – Lambda expression.

sortBy(f: T => Float, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => Float) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => Float): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => Float) – Lambda expression.

sortBy(f: T => Long, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => Long) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => Long): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => Long) – Lambda expression.

sortBy(f: T => Int, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => Int) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => Int): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => Int) – Lambda expression.

sortBy(f: T => Boolean, ascending: Boolean): Array[T]

Sort the collection with the ordering specified by ascending after evaluating f for each element.

Arguments

f (T => Boolean) – Lambda expression.

ascending (Boolean) – If true, sort the collection in ascending order. Otherwise, sort in descending order.

sortBy(f: T => Boolean): Array[T]

Sort the collection in ascending order after evaluating f for each element.

Arguments

f (T => Boolean) – Lambda expression.

tail(): Array[T] – Selects all elements except the first.

toArray(): Array[T] – Convert collection to an Array.

toSet(): Set[T] – Convert collection to a Set.

Boolean¶

max(a: Boolean): Boolean – Returns the maximum value.

min(a: Boolean): Boolean – Returns the minimum value.

toDouble(): Double – Convert value to a Double. Returns 1.0 if true, else 0.0.

toFloat(): Float – Convert value to a Float. Returns 1.0 if true, else 0.0.

toInt(): Int – Convert value to an Integer. Returns 1 if true, else 0.

toLong(): Long – Convert value to a Long. Returns 1L if true, else 0L.

Call¶

A Call is a Hail data type representing a genotype call (ex: 0/0) in the Variant Dataset.

gt: Int – the integer gt = k*(k+1)/2 + j for call j/k (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.).

gtj(): Int – the index of allele j for call j/k (0 = ref, 1 = first alt allele, etc.).

gtk(): Int – the index of allele k for call j/k (0 = ref, 1 = first alt allele, etc.).

isCalled(): Boolean – True if the call is not ./..

isCalledNonRef(): Boolean – True if either isHet or isHomVar is true.

isHet(): Boolean – True if this call is heterozygous.

isHetNonRef(): Boolean – True if this call is j/k with j>0.

isHetRef(): Boolean – True if this call is 0/k with k>0.

isHomRef(): Boolean – True if this call is 0/0.

isHomVar(): Boolean – True if this call is j/j with j>0.

isNotCalled(): Boolean – True if the call is ./..

nNonRefAlleles(): Int – the number of called alternate alleles.

oneHotAlleles(v: Variant): Array[Int]

Produce an array of called counts for each allele in the variant (including reference). For example, calling this function with a biallelic variant on hom-ref, het, and hom-var calls will produce [2, 0], [1, 1], and [0, 2] respectively.

Arguments

v (Variant) – Variant

oneHotGenotype(v: Variant): Array[Int]

Produces an array with one element for each possible genotype in the variant, where the called genotype is 1 and all else 0. For example, calling this function with a biallelic variant on hom-ref, het, and hom-var calls will produce [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.

Arguments

v (Variant) – Variant

toGenotype(): Genotype – Convert this call to a Genotype.

Dict¶

A Dict is an unordered collection of key-value pairs. Each key can only appear once in the collection.

[k]: U

Returns the value for k, or throws an exception if the key is not found.

Arguments

k (T) – Key in the Dict to query.

contains(k: T): Boolean

Returns true if the Dict has a key equal to k, otherwise false.

Arguments

k (T) – Key name to query.

get(a: T): U – Returns the value of the Dict for key k, or returns NA if the key is not found.

isEmpty(): Boolean – Returns true if the number of elements is equal to 0. false otherwise.

keySet(): Set[T] – Returns a Set containing the keys of the Dict.

keys(): Array[T] – Returns an Array containing the keys of the Dict.

mapValues(expr: U => V): Dict[T, V]

Returns a new Dict produced by applying expr to each value. The keys are unmodified.

Arguments

expr (U => V) – Lambda expression.

size(): Int – Number of elements in the collection.

values(): Array[U] – Returns an Array containing the values of the Dict.

Double¶

abs(): Double – Returns the absolute value of a number.

max(a: Double): Double – Returns the maximum value.

min(a: Double): Double – Returns the minimum value.

signum(): Int – Returns the sign of a number (1, 0, or -1).

toDouble(): Double – Convert value to a Double.

toFloat(): Float – Convert value to a Float.

toInt(): Int – Convert value to an Integer.

toLong(): Long – Convert value to a Long.

Float¶

abs(): Float – Returns the absolute value of a number.

max(a: Float): Float – Returns the maximum value.

min(a: Float): Float – Returns the minimum value.

signum(): Int – Returns the sign of a number (1, 0, or -1).

toDouble(): Double – Convert value to a Double.

toFloat(): Float – Convert value to a Float.

toInt(): Int – Convert value to an Integer.

toLong(): Long – Convert value to a Long.

Genotype¶

A Genotype is a Hail data type representing a genotype in the Variant Dataset. It is referred to as g in the expression language.

ad: Array[Int] – allelic depth for each allele.

call(): Call – the integer gt = k*(k+1)/2 + j for call j/k (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.).

dosage: Double – the expected number of non-reference alleles based on genotype probabilities.

dp: Int – the total number of informative reads.

fakeRef: Boolean – True if this genotype was downcoded in split_multi(). This can happen if a 1/2 call is split to 0/1, 0/1.

fractionReadsRef(): Double – the ratio of ref reads to the sum of all informative reads.

gp: Array[Double] – the linear-scaled probabilities.

gq: Int – the difference between the two smallest PL entries.

gt: Int – the integer gt = k*(k+1)/2 + j for call j/k (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.).

gtj(): Int – the index of allele j for call j/k (0 = ref, 1 = first alt allele, etc.).

gtk(): Int – the index of allele k for call j/k (0 = ref, 1 = first alt allele, etc.).

isCalled(): Boolean – True if the genotype is not ./..

isCalledNonRef(): Boolean – True if either g.isHet or g.isHomVar is true.

isHet(): Boolean – True if this call is heterozygous.

isHetNonRef(): Boolean – True if this call is j/k with j>0.

isHetRef(): Boolean – True if this call is 0/k with k>0.

isHomRef(): Boolean – True if this call is 0/0.

isHomVar(): Boolean – True if this call is j/j with j>0.

isLinearScale: Boolean – True if the data was imported from import_gen() or import_bgen().

isNotCalled(): Boolean – True if the genotype is ./..

nNonRefAlleles(): Int – the number of called alternate alleles.

od(): Int – od = dp - ad.sum.

oneHotAlleles(v: Variant): Array[Int]

Produce an array of called counts for each allele in the variant (including reference). For example, calling this function with a biallelic variant on hom-ref, het, and hom-var genotypes will produce [2, 0], [1, 1], and [0, 2] respectively.

Arguments

v (Variant) – Variant

oneHotGenotype(v: Variant): Array[Int]

Produces an array with one element for each possible genotype in the variant, where the called genotype is 1 and all else 0. For example, calling this function with a biallelic variant on hom-ref, het, and hom-var genotypes will produce [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.

Arguments

v (Variant) – Variant

pAB(): Double – p-value for pulling the given allelic depth from a binomial distribution with mean 0.5. Missing if the call is not heterozygous.

pl: Array[Int]

phred-scaled normalized genotype likelihood values. The conversion between g.pl (Phred-scaled likelihoods) and g.gp (linear-scaled probabilities) assumes a uniform prior.

Int¶

abs(): Int – Returns the absolute value of a number.

max(a: Int): Int – Returns the maximum value.

min(a: Int): Int – Returns the minimum value.

signum(): Int – Returns the sign of a number (1, 0, or -1).

toDouble(): Double – Convert value to a Double.

toFloat(): Float – Convert value to a Float.

toInt(): Int – Convert value to an Integer.

toLong(): Long – Convert value to a Long.

Interval¶

An Interval is a Hail data type representing a range of genomic locations in the Variant Dataset.

contains(locus: Locus): Boolean
Returns true if the locus is in the interval.
let i = Interval(Locus("1", 1000), Locus("1", 2000)) in i.contains(Locus("1", 1500))
result: true
Arguments

locus (Locus) – Locus
end: Locus – Locus at the end of the interval (exclusive).

start: Locus – Locus at the start of the interval (inclusive).

Locus¶

A Locus is a Hail data type representing a specific genomic location in the Variant Dataset.

contig: String – String representation of contig.

position: Int – Chromosomal position.

Long¶

abs(): Long – Returns the absolute value of a number.

max(a: Long): Long – Returns the maximum value.

min(a: Long): Long – Returns the minimum value.

signum(): Int – Returns the sign of a number (1, 0, or -1).

toDouble(): Double – Convert value to a Double.

toFloat(): Float – Convert value to a Float.

toInt(): Int – Convert value to an Integer.

toLong(): Long – Convert value to a Long.

Set¶

A Set is an unordered collection with no repeated values of a given data type (ex: Int, String). Sets can be constructed by specifying [item1, item2, ...].toSet().

let s = ["rabbit", "cat", "dog", "dog"].toSet()
result: Set("cat", "dog", "rabbit")

They can also be nested such as Set[Set[Int]]:

let s = [[1, 2, 3].toSet(), [4, 5, 5].toSet()].toSet()
result: Set(Set(1, 2, 3), Set(4, 5))

Set[Double]¶

max(): Double – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Double – Median value of the collection.

min(): Double – Smallest element in the collection.

product(): Double – Product of all elements in the collection (returns 1 if empty).

sum(): Double – Sum of all elements in the collection.

Set[Float]¶

max(): Float – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Float – Median value of the collection.

min(): Float – Smallest element in the collection.

product(): Float – Product of all elements in the collection (returns 1 if empty).

sum(): Float – Sum of all elements in the collection.

Set[Int]¶

max(): Int – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Int – Median value of the collection.

min(): Int – Smallest element in the collection.

product(): Int – Product of all elements in the collection (returns 1 if empty).

sum(): Int – Sum of all elements in the collection.

Set[Long]¶

max(): Long – Largest element in the collection.

mean(): Double – Mean value of the collection.

median(): Long – Median value of the collection.

min(): Long – Smallest element in the collection.

product(): Long – Product of all elements in the collection (returns 1 if empty).

sum(): Long – Sum of all elements in the collection.

Set[Set[T]]¶

flatten(): Set[T]
Flattens a nested set by concatenating all its elements into a single set.
let s = [[1, 2].toSet(), [3, 4].toSet()].toSet() in s.flatten()
result: Set(1, 2, 3, 4)

Set[String]¶

mkString(delimiter: String): String
Concatenates all elements of this set into a single string where each element is separated by the delimiter.
let s = [1, 2, 3].toSet() in s.mkString(",")
result: "1,2,3"
Arguments

delimiter (String) – String that separates each element.

Set[T]¶

add(a: T): Set[T] – Returns the result of adding the element a to this Set.
contains(x: T): Boolean
Returns true if the element x is contained in the set, otherwise false.
let s = [1, 2, 3].toSet() in s.contains(5)
result: false
Arguments

x (T) – Value to test.
difference(a: Set[T]): Set[T] – Returns the elements of this Set that are not in Set a.
exists(expr: T => Boolean): Boolean
Returns a boolean which is true if any element in the set satisfies the condition given by expr and false otherwise.
let s = [0, 2, 4, 6, 8, 10].toSet() in s.exists(e => e % 2 == 1)
result: false
Arguments

expr (T => Boolean) – Lambda expression.
filter(expr: T => Boolean): Set[T]
Returns a new set subsetted to the elements where expr evaluates to true.
let s = [1, 4, 5, 6, 10].toSet() in s.filter(e => e >= 5)
result: Set(5, 6, 10)
Arguments

expr (T => Boolean) – Lambda expression.
find(expr: T => Boolean): T
Returns the first non-missing element of the array for which expr is true. If no element satisfies the predicate, find returns NA.
let s = [1, 2, 3].toSet() in s.find(e => e % 3 == 0)
result: 3
Arguments

expr (T => Boolean) – Lambda expression.
flatMap(expr: T => Set[U]): Set[U]
Returns a new set by applying a function to each subset and concatenating the resulting sets.
let s = [["a", "b", "c"].toSet(), ["d", "e"].toSet(), ["f"].toSet()].toSet() in s.flatMap(e => e + "1")
result: Set("a1", "b1", "c1", "d1", "e1", "f1")
Arguments

expr (T => Set[U]) – Lambda expression.
forall(expr: T => Boolean): Boolean
Returns a boolean which is true if all elements in the set satisfies the condition given by expr and false otherwise.
let s = [0.1, 0.5, 0.3, 1.0, 2.5, 3.0].toSet() in s.forall(e => e > 1.0 == 0)
result: false
Arguments

expr (T => Boolean) – Lambda expression.
groupBy(a: T => U): Dict[U, Set[T]]

head(): T – Select one element.

intersection(a: Set[T]): Set[T] – Returns the intersection of this Set and Set a.

isEmpty(): Boolean – Returns true if the number of elements is equal to 0. false otherwise.

issubset(a: Set[T]): Boolean – Returns true if this Set is a subset of Set a.
map(expr: T => U): Set[U]
Returns a new set produced by applying expr to each element.
let s = [1, 2, 3].toSet() in s.map(e => e * 3)
result: Set(3, 6, 9)
Arguments

expr (T => U) – Lambda expression.
remove(a: T): Set[T] – Returns the result of removing the element a from this Set.

size(): Int – Number of elements in the collection.

tail(): Set[T] – Select all elements except the element returned by head.

toArray(): Array[T] – Convert collection to an Array.

toSet(): Set[T] – Convert collection to a Set.

union(a: Set[T]): Set[T] – Returns the union of this Set and Set a.

String¶

[:j]: String
Returns a slice of the string from the first unicode-codepoint until the j*th* unicode-codepoint (0-indexed). Negative indices are interpreted as offsets from the end of the string.
let a = "abcdef" in a[:4]
result: "abcd"
let a = "abcdef" in a[:-3]
result: "abc"
Arguments

j (Int) – End index of the slice (not included in result).
[:]: String
Returns a copy of the string.
let a = "abcd" in a[:]
result: "abcd"
[i:j]: String
Returns a slice of the string from the i*th* unicode-codepoint until the j*th* unicode-codepoint (both 0-indexed). Negative indices are interpreted as offsets from the end of the string instead of the beginning.
let a = "abcdef" in a[2:4]
result: "cd"
let a = "abcdef" in a[-3:-1]
result: "de"
A handy way to understand the behavior of negative indicies is to recall this rule, which holds for any positive integers i and j:
s[-i:-j] == s[s.length - i, s.length - j]
Arguments

i (Int) – Starting index of the slice.

j (Int) – End index of the slice (not included in result).
[i:]: String
Returns a slice of the string from the i*th* unicode-codepoint (0-indexed) to the end. Negative indices are interpreted as offsets from the end of the string.
let a = "abcdef" in a[3:]
result: "def"
let a = "abcdef" in a[-5:]
result: "bcdef"
Arguments

i (Int) – Starting index of the slice.
[i]: String
Returns the i*th* element (0-indexed) of the string, or throws an exception if i is an invalid index.
let s = "genetics" in s[6]
result: "c"
Arguments

i (Int) – Index of the character to return.
entropy(): Double

Computes the Shannon entropy in bits of the character frequency distribution.

length(): Int – Length of the string.

max(a: String): String – Returns the maximum value.

min(a: String): String – Returns the minimum value.
replace(pattern1: String, pattern2: String): String
Replaces each substring of this string that matches the given regular expression (pattern1) with the given replacement (pattern2).
let s = "1kg-NA12878" in a.replace("1kg-", "")
result: "NA12878"
Arguments

pattern1 (String) – Substring to replace.

pattern2 (String) – Replacement string.
split(delim: String, n: Int): Array[String]
Returns an array of strings, split on the given regular expression delimiter with the pattern applied n times. See the documentation on regular expression syntax delimiter. If you need to split on special characters, escape them with double backslash (\).
let s = "1kg-NA12878" in s.split("-")
result: ["1kg", "NA12878"]
Arguments

delim (String) – Regular expression delimiter.

n (Int) – Number of times the pattern is applied. See the Java documentation for more information.
split(delim: String): Array[String]
Returns an array of strings, split on the given regular expression delimiter. See the documentation on regular expression syntax delimiter. If you need to split on special characters, escape them with double backslash (\).
let s = "1kg-NA12878" in s.split("-")
result: ["1kg", "NA12878"]
Arguments

delim (String) – Regular expression delimiter.
toDouble(): Double – Convert value to a Double.

toFloat(): Float – Convert value to a Float.

toInt(): Int – Convert value to an Integer.

toLong(): Long – Convert value to a Long.

Struct¶

A Struct is like a Python tuple where the fields are named and the set of fields is fixed.

An example of constructing and accessing the fields in a Struct is

let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene
result: "ACBD"

A field of the Struct can also be another Struct. For example, va.info.AC selects the struct info from the struct va, and then selects the array AC from the struct info.

Variant¶

A Variant is a Hail data type representing a variant in the Variant Dataset. It is referred to as v in the expression language.

The pseudoautosomal region (PAR) is currently defined with respect to reference GRCh37:

X: 60001 - 2699520, 154931044 - 155260560
Y: 10001 - 2649520, 59034050 - 59363566

Most callers assign variants in PAR to X.

alt(): String – Alternate allele sequence. Assumes biallelic.

altAllele(): AltAllele – The alternate allele. Assumes biallelic.

altAlleles: Array[AltAllele] – The alternate alleles.

contig: String – String representation of contig, exactly as imported. NB: Hail stores contigs as strings. Use double-quotes when checking contig equality.

inXNonPar(): Boolean – True if chromosome is X and start is not in pseudoautosomal region of X.

inXPar(): Boolean – True if chromosome is X and start is in pseudoautosomal region of X.

inYNonPar(): Boolean – True if chromosome is Y and start is not in pseudoautosomal region of Y.

inYPar(): Boolean – True if chromosome is Y and start is in pseudoautosomal region of Y. NB: most callers assign variants in PAR to X.

isAutosomal(): Boolean – True if chromosome is not X, not Y, and not MT.

isBiallelic(): Boolean – True if v has one alternate allele.

locus(): Locus – Chromosomal locus (chr, pos) of this variant

nAlleles(): Int – Number of alleles.

nAltAlleles(): Int – Number of alternate alleles, equal to nAlleles - 1.

nGenotypes(): Int – Number of genotypes.

ref: String – Reference allele sequence.

start: Int – SNP position or start of an indel.