.. sec-types: .. testsetup:: vds = hc.read("data/example.vds").annotate_variants_expr('va.genes = ["ACBD", "DCBA"]') ===== Types ===== .. _aggregable: ---------- Aggregable ---------- An ``Aggregable`` is a Hail data type representing a distributed row or column of a matrix. Hail exposes a number of methods to compute on aggregables depending on the data type. .. _aggregable[array[double]]: Aggregable[Array[Double]] ========================= - **sum()**: *Array[Double]* -- Compute the sum by index. All elements in the aggregable must have the same length. .. _aggregable[array[float]]: Aggregable[Array[Float]] ======================== - **sum()**: *Array[Float]* -- Compute the sum by index. All elements in the aggregable must have the same length. .. _aggregable[array[int]]: Aggregable[Array[Int]] ====================== - **sum()**: *Array[Int]* Compute the sum by index. All elements in the aggregable must have the same length. **Examples** Count the total number of occurrences of each allele across samples, per variant: >>> vds_result = vds.annotate_variants_expr('va.AC = gs.map(g => g.oneHotAlleles(v)).sum()') .. _aggregable[array[long]]: Aggregable[Array[Long]] ======================= - **sum()**: *Array[Long]* -- Compute the sum by index. All elements in the aggregable must have the same length. .. _aggregable[double]: Aggregable[Double] ================== - **hist(start: Double, end: Double, bins: Int)**: *Struct{binEdges:Array[Double],binFrequencies:Array[Long],nLess:Long,nGreater:Long}* .. container:: annotation - **binEdges** (*Array[Double]*) -- Array of bin cutoffs - **binFrequencies** (*Array[Long]*) -- Number of elements that fall in each bin. - **nLess** (*Long*) -- Number of elements less than the minimum bin - **nGreater** (*Long*) -- Number of elements greater than the maximum bin Compute frequency distributions of numeric parameters. **Examples** Compute GQ-distributions per variant: >>> vds_result = vds.annotate_variants_expr('va.gqHist = gs.map(g => g.gq).hist(0, 100, 20)') Compute global GQ-distribution: >>> gq_hist = vds.query_genotypes('gs.map(g => g.gq).hist(0, 100, 100)') **Notes** - The start, end, and bins params are no-scope parameters, which means that while computations like 100 / 4 are acceptable, variable references like ``global.nBins`` are not. - Bin size is calculated from (``end`` - ``start``) / ``bins`` - (``bins`` + 1) breakpoints are generated from the range (start to end by binsize) - Each bin is left-inclusive, right-exclusive except the last bin, which includes the maximum value. This means that if there are N total bins, there will be N + 1 elements in ``binEdges``. For the invocation ``hist(0, 3, 3)``, ``binEdges`` would be ``[0, 1, 2, 3]`` where the bins are ``[0, 1), [1, 2), [2, 3]``. **Arguments** - **start** (*Double*) -- Starting point of first bin - **end** (*Double*) -- End point of last bin - **bins** (*Int*) -- Number of bins to create. - **max()**: *Double* -- Compute the maximum of all non-missing elements. The empty max is missing. - **min()**: *Double* -- Compute the minimum of all non-missing elements. The empty min is missing. - **product()**: *Double* -- Compute the product of all non-missing elements. The empty product is one. - **stats()**: *Struct{mean:Double,stdev:Double,min:Double,max:Double,nNotMissing:Long,sum:Double}* .. container:: annotation - **mean** (*Double*) -- Mean value - **stdev** (*Double*) -- Standard deviation - **min** (*Double*) -- Minimum value - **max** (*Double*) -- Maximum value - **nNotMissing** (*Long*) -- Number of non-missing values - **sum** (*Double*) -- Sum of all elements Compute summary statistics about a numeric aggregable. **Examples** Compute the mean genotype quality score per variant: >>> vds_result = vds.annotate_variants_expr('va.gqMean = gs.map(g => g.gq).stats().mean') Compute summary statistics on the number of singleton calls per sample: >>> [singleton_stats] = (vds.sample_qc() ... .query_samples(['samples.map(s => sa.qc.nSingleton).stats()'])) Compute GQ and DP statistics stratified by genotype call: >>> gq_dp = [ ... 'va.homrefGQ = gs.filter(g => g.isHomRef()).map(g => g.gq).stats()', ... 'va.hetGQ = gs.filter(g => g.isHet()).map(g => g.gq).stats()', ... 'va.homvarGQ = gs.filter(g => g.isHomVar()).map(g => g.gq).stats()', ... 'va.homrefDP = gs.filter(g => g.isHomRef()).map(g => g.dp).stats()', ... 'va.hetDP = gs.filter(g => g.isHet()).map(g => g.dp).stats()', ... 'va.homvarDP = gs.filter(g => g.isHomVar()).map(g => g.dp).stats()'] >>> vds_result = vds.annotate_variants_expr(gq_dp) **Notes** The ``stats()`` aggregator can be used to replicate some of the values computed by :py:meth:`~hail.VariantDataset.variant_qc` and :py:meth:`~hail.VariantDataset.sample_qc` such as ``dpMean`` and ``dpStDev``. - **sum()**: *Double* -- Compute the sum of all non-missing elements. The empty sum is zero. .. _aggregable[float]: Aggregable[Float] ================= - **max()**: *Float* -- Compute the maximum of all non-missing elements. The empty max is missing. - **min()**: *Float* -- Compute the minimum of all non-missing elements. The empty min is missing. - **sum()**: *Float* -- Compute the sum of all non-missing elements. The empty sum is zero. .. _aggregable[genotype]: Aggregable[Genotype] ==================== - **callStats(f: Genotype => Variant)**: *Struct{AC:Array[Int],AF:Array[Double],AN:Int,GC:Array[Int]}* .. container:: annotation - **AC** (*Array[Int]*) -- Allele count. One element per allele **including reference**. There are two elements for a biallelic variant, or 4 for a variant with three alternate alleles. - **AF** (*Array[Double]*) -- Allele frequency. One element per allele including reference. Sums to 1. - **AN** (*Int*) -- Allele number. This is equal to the sum of AC, or 2 * the total number of called genotypes in the aggregable. - **GC** (*Array[Int]*) -- Genotype count. One element per possible genotype, including reference genotypes -- 3 for biallelic, 6 for triallelic, 10 for 3 alt alleles, and so on. The sum of this array is the number of called genotypes in the aggregable. Compute four commonly-used metrics over a set of genotypes in a variant. **Examples** Compute phenotype-specific call statistics: >>> pheno_stats = [ ... 'va.case_stats = gs.filter(g => sa.pheno.isCase).callStats(g => v)', ... 'va.control_stats = gs.filter(g => !sa.pheno.isCase).callStats(g => v)'] >>> vds_result = vds.annotate_variants_expr(pheno_stats) ``va.eur_stats.AC`` will be the allele count (AC) computed from individuals marked as "EUR". **Arguments** - **f** (*Genotype => Variant*) -- Variant lambda expression such as ``g => v``. - **hardyWeinberg()**: *Struct{rExpectedHetFrequency:Double,pHWE:Double}* .. container:: annotation - **rExpectedHetFrequency** (*Double*) -- Expected rHeterozygosity based on Hardy Weinberg Equilibrium - **pHWE** (*Double*) -- p-value Compute Hardy-Weinberg equilibrium p-value. **Examples** Add a new variant annotation that calculates HWE p-value by phenotype: >>> vds_result = vds.annotate_variants_expr([ ... 'va.hweCase = gs.filter(g => sa.pheno.isCase).hardyWeinberg()', ... 'va.hweControl = gs.filter(g => !sa.pheno.isCase).hardyWeinberg()']) **Notes** Hail computes the exact p-value with mid-p-value correction, i.e. the probability of a less-likely outcome plus one-half the probability of an equally-likely outcome. See this `document `__ for details on the Levene-Haldane distribution and references. - **inbreeding(af: Genotype => Double)**: *Struct{Fstat:Double,nTotal:Long,nCalled:Long,expectedHoms:Double,observedHoms:Long}* .. container:: annotation - **Fstat** (*Double*) -- Inbreeding coefficient - **nTotal** (*Long*) -- Number of genotypes analyzed - **nCalled** (*Long*) -- number of genotypes with non-missing calls - **expectedHoms** (*Double*) -- Expected number of homozygote calls - **observedHoms** (*Long*) -- Total number of homozygote calls observed Compute inbreeding metric. This aggregator is equivalent to the `\`--het\` method in PLINK `_. **Examples** Calculate the inbreeding metric per sample: >>> vds_result = (vds.variant_qc() ... .annotate_samples_expr('sa.inbreeding = gs.inbreeding(g => va.qc.AF)')) To obtain the same answer as `PLINK `_, use the following series of commands: >>> vds_result = (vds.variant_qc() ... .filter_variants_expr('va.qc.AC > 1 && va.qc.AF >= 1e-8 && va.qc.nCalled * 2 - va.qc.AC > 1 && va.qc.AF <= 1 - 1e-8 && v.isAutosomal()') ... .annotate_samples_expr('sa.inbreeding = gs.inbreeding(g => va.qc.AF)')) **Notes** The Inbreeding Coefficient (F) is computed as follows: #. For each variant and sample with a non-missing genotype call, ``E``, the expected number of homozygotes (computed from user-defined expression for minor allele frequency), is computed as ``1.0 - (2.0*maf*(1.0-maf))`` #. For each variant and sample with a non-missing genotype call, ``O``, the observed number of homozygotes, is computed as ``0 = heterozygote; 1 = homozygote`` #. For each variant and sample with a non-missing genotype call, ``N`` is incremented by 1 #. For each sample, ``E``, ``O``, and ``N`` are combined across variants #. ``F`` is calculated by ``(O - E) / (N - E)`` **Arguments** - **af** (*Genotype => Double*) -- Lambda expression for the alternate allele frequency. - **infoScore()**: *Struct{score:Double,nIncluded:Int}* .. container:: annotation - **score** (*Double*) -- IMPUTE info score - **nIncluded** (*Int*) -- Number of samples with non-missing genotype probability distribution Compute the IMPUTE information score. **Examples** Calculate the info score per variant: >>> (hc.import_gen("data/example.gen", "data/example.sample") ... .annotate_variants_expr('va.infoScore = gs.infoScore()')) Calculate group-specific info scores per variant: >>> vds_result = (hc.import_gen("data/example.gen", "data/example.sample") ... .annotate_samples_expr("sa.isCase = pcoin(0.5)") ... .annotate_variants_expr(["va.infoScore.case = gs.filter(g => sa.isCase).infoScore()", ... "va.infoScore.control = gs.filter(g => !sa.isCase).infoScore()"])) **Notes** We implemented the IMPUTE info measure as described in the `supplementary information from Marchini & Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics (2010) `_. To calculate the info score :math:`I_{A}` for one SNP: .. math:: I_{A} = \begin{cases} 1 - \frac{\sum_{i=1}^{N}(f_{i} - e_{i}^2)}{2N\hat{\theta}(1 - \hat{\theta})} & \text{when } \hat{\theta} \in (0, 1) \\ 1 & \text{when } \hat{\theta} = 0, \hat{\theta} = 1\\ \end{cases} - :math:`N` is the number of samples with imputed genotype probabilities [:math:`p_{ik} = P(G_{i} = k)` where :math:`k \in \{0, 1, 2\}`] - :math:`e_{i} = p_{i1} + 2p_{i2}` is the expected genotype per sample - :math:`f_{i} = p_{i1} + 4p_{i2}` - :math:`\hat{\theta} = \frac{\sum_{i=1}^{N}e_{i}}{2N}` is the MLE for the population minor allele frequency Hail will not generate identical results as `QCTOOL `_ for the following reasons: - The floating point number Hail stores for each genotype probability is slightly different than the original data due to rounding and normalization of probabilities. - Hail automatically removes genotype probability distributions that do not meet certain requirements on data import with :py:meth:`~hail.HailContext.import_gen` and :py:meth:`~hail.HailContext.import_bgen`. - Hail does not use the population frequency to impute genotype probabilities when a genotype probability distribution has been set to missing. - Hail calculates the same statistic for sex chromosomes as autosomes while QCTOOL incorporates sex information .. warning:: - The info score Hail reports will be extremely different from qctool when a SNP has a high missing rate. - If the genotype data was not imported using the :py:meth:`~hail.HailContext.import_gen` or :py:meth:`~hail.HailContext.import_bgen` commands, then the results for all variants will be ``score = NA`` and ``nIncluded = 0``. - It only makes sense to compute the info score for an Aggregable[Genotype] per variant. While a per-sample info score will run, the result is meaningless. .. _aggregable[int]: Aggregable[Int] =============== - **max()**: *Int* -- Compute the maximum of all non-missing elements. The empty max is missing. - **min()**: *Int* -- Compute the minimum of all non-missing elements. The empty min is missing. - **sum()**: *Int* -- Compute the sum of all non-missing elements. The empty sum is zero. .. _aggregable[long]: Aggregable[Long] ================ - **max()**: *Long* -- Compute the maximum of all non-missing elements. The empty max is missing. - **min()**: *Long* -- Compute the minimum of all non-missing elements. The empty min is missing. - **product()**: *Long* -- Compute the product of all non-missing elements. The empty product is one. - **sum()**: *Long* -- Compute the sum of all non-missing elements. The empty sum is zero. .. _aggregable[t]: Aggregable[T] ============= - **collect()**: *Array[T]* Returns an array with all of the elements in the aggregable. Order is not guaranteed. **Examples** Collect the list of sample IDs with heterozygote genotype calls per variant: >>> vds_result = vds.annotate_variants_expr('va.hetSamples = gs.filter(g => g.isHet()).map(g => s).collect()') ``va.hetSamples`` will have the type ``Array[String]``. - **collectAsSet()**: *Set[T]* -- Returns the set of all unique elements in the aggregable. - **count()**: *Long* Counts the number of elements in an aggregable. **Examples** Count the number of heterozygote genotype calls in an aggregable of genotypes (``gs``): >>> vds_result = vds.annotate_variants_expr('va.nHets = gs.filter(g => g.isHet()).count()') - **counter()**: *Dict[T, Long]* Counts the number of occurrences of each element in an aggregable. **Examples** Compute the number of indels in each chromosome: >>> [indels_per_chr] = vds.query_variants(['variants.filter(v => v.altAllele().isIndel()).map(v => v.contig).counter()']) **Notes** We recommend this function is used with the `Python counter object `_. >>> [counter] = vds.query_variants(['variants.flatMap(v => v.altAlleles).counter()']) >>> from collections import Counter >>> counter = Counter(counter) >>> print(counter.most_common(5)) [(AltAllele(C, T), 129L), (AltAllele(G, A), 112L), (AltAllele(C, A), 60L), (AltAllele(A, G), 46L), (AltAllele(T, C), 44L)] - **exists(expr: T => Boolean)**: *Boolean* Returns true if **any** element in the aggregator satisfies the condition given by ``expr`` and false otherwise. **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **filter(f: T => Boolean)**: *Aggregable[T]* Subsets an aggregable by evaluating ``f`` for each element and keeping those elements that evaluate to true. **Examples** Compute Hardy Weinberg Equilibrium for cases only: >>> vds_result = vds.annotate_variants_expr("va.hweCase = gs.filter(g => sa.isCase).hardyWeinberg()") **Arguments** - **f** (*T => Boolean*) -- Boolean lambda expression. - **flatMap(f: T => Set[U])**: *Aggregable[U]* Returns a new aggregable by applying a function ``f`` to each element and concatenating the resulting sets. Compute a list of genes per sample with loss of function variants (result does not have duplicate entries): >>> vds_result = vds.annotate_samples_expr('sa.lof_genes = gs.filter(g => va.consequence == "LOF" && g.nNonRefAlleles() > 0).flatMap(g => va.genes.toSet()).collect()') **Arguments** - **f** (*T => Set[U]*) -- Lambda expression. - **flatMap(a: T => Array[U])**: *Aggregable[U]* Returns a new aggregable by applying a function ``f`` to each element and concatenating the resulting arrays. **Examples** Compute a list of genes per sample with loss of function variants (result may have duplicate entries): >>> vds_result = vds.annotate_samples_expr('sa.lof_genes = gs.filter(g => va.consequence == "LOF" && g.nNonRefAlleles() > 0).flatMap(g => va.genes).collect()') - **forall(expr: T => Boolean)**: *Boolean* Returns a true if **all** elements in the array satisfies the condition given by ``expr`` and false otherwise. **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **fraction(a: T => Boolean)**: *Double* Computes the ratio of the number of occurrences for which a boolean condition evaluates to true, divided by the number of included elements in the aggregable. **Examples** Filter variants with a call rate less than 95%: >>> vds_result = vds.filter_variants_expr('gs.fraction(g => g.isCalled()) > 0.90') Compute the differential missingness at SNPs and indels: >>> exprs = ['sa.SNPmissingness = gs.filter(g => v.altAllele().isSNP()).fraction(g => g.isNotCalled())', ... 'sa.indelmissingness = gs.filter(g => v.altAllele().isIndel()).fraction(g => g.isNotCalled())'] >>> vds_result = vds.annotate_samples_expr(exprs) - **map(f: T => U)**: *Aggregable[U]* Change the type of an aggregable by evaluating ``f`` for each element. **Examples** Convert an aggregable of genotypes (``gs``) to an aggregable of genotype quality scores and then compute summary statistics: >>> vds_result = vds.annotate_variants_expr("va.gqStats = gs.map(g => g.gq).stats()") **Arguments** - **f** (*T => U*) -- Lambda expression. - **take(n: Int)**: *Array[T]* Take the first ``n`` items of an aggregable. **Examples** Collect the first 5 sample IDs with at least one alternate allele per variant: >>> vds_result = vds.annotate_variants_expr("va.nonRefSamples = gs.filter(g => g.nNonRefAlleles() > 0).map(g => s).take(5)") **Arguments** - **n** (*Int*) -- Number of items to take. - **takeBy(f: T => String, n: Int)**: *Array[T]* Returns the first ``n`` items of an aggregable in ascending order, ordered by the result of ``f``. ``NA`` always appears last. If the aggregable contains less than ``n`` items, then the result will contain as many elements as the aggregable contains. **Arguments** - **f** (*T => String*) -- Lambda expression for mapping an aggregable to an ordered value. - **n** (*Int*) -- Number of items to take. - **takeBy(f: T => Double, n: Int)**: *Array[T]* Returns the first ``n`` items of an aggregable in ascending order, ordered by the result of ``f``. ``NA`` always appears last. If the aggregable contains less than ``n`` items, then the result will contain as many elements as the aggregable contains. Note that ``NaN`` always appears after any finite or infinite floating-point numbers but before ``NA``. For example, consider an aggregable containing these elements:: Infinity, -1, 1, 0, -Infinity, NA, NaN The expression ``gs.takeBy(x => x, 7)`` would return the array:: [-Infinity, -1, 0, 1, Infinity, NaN, NA] The expression ``gs.takeBy(x => -x, 7)`` would return the array:: [Infinity, 1, 0, -1, -Infinity, NaN, NA] **Arguments** - **f** (*T => Double*) -- Lambda expression for mapping an aggregable to an ordered value. - **n** (*Int*) -- Number of items to take. - **takeBy(f: T => Float, n: Int)**: *Array[T]* Returns the first ``n`` items of an aggregable in ascending order, ordered by the result of ``f``. ``NA`` always appears last. If the aggregable contains less than ``n`` items, then the result will contain as many elements as the aggregable contains. Note that ``NaN`` always appears after any finite or infinite floating-point numbers but before ``NA``. For example, consider an aggregable containing these elements:: Infinity, -1, 1, 0, -Infinity, NA, NaN The expression ``gs.takeBy(x => x, 7)`` would return the array:: [-Infinity, -1, 0, 1, Infinity, NaN, NA] The expression ``gs.takeBy(x => -x, 7)`` would return the array:: [Infinity, 1, 0, -1, -Infinity, NaN, NA] **Arguments** - **f** (*T => Float*) -- Lambda expression for mapping an aggregable to an ordered value. - **n** (*Int*) -- Number of items to take. - **takeBy(f: T => Long, n: Int)**: *Array[T]* Returns the first ``n`` items of an aggregable in ascending order, ordered by the result of ``f``. ``NA`` always appears last. If the aggregable contains less than ``n`` items, then the result will contain as many elements as the aggregable contains. **Examples** Consider an aggregable ``gs`` containing these elements:: 7, 6, 3, NA, 1, 2, NA, 4, 5, -1 The expression ``gs.takeBy(x => x, 5)`` would return the array:: [-1, 1, 2, 3, 4] The expression ``gs.takeBy(x => -x, 5)`` would return the array:: [7, 6, 5, 4, 3] The expression ``gs.takeBy(x => x, 10)`` would return the array:: [-1, 1, 2, 3, 4, 5, 6, 7, NA, NA] Returns the 10 samples with the least number of singletons: >>> samplesMostSingletons = (vds ... .sample_qc() ... .query_samples('samples.takeBy(s => sa.qc.nSingleton, 10)')) **Arguments** - **f** (*T => Long*) -- Lambda expression for mapping an aggregable to an ordered value. - **n** (*Int*) -- Number of items to take. - **takeBy(f: T => Int, n: Int)**: *Array[T]* Returns the first ``n`` items of an aggregable in ascending order, ordered by the result of ``f``. ``NA`` always appears last. If the aggregable contains less than ``n`` items, then the result will contain as many elements as the aggregable contains. **Examples** Consider an aggregable ``gs`` containing these elements:: 7, 6, 3, NA, 1, 2, NA, 4, 5, -1 The expression ``gs.takeBy(x => x, 5)`` would return the array:: [-1, 1, 2, 3, 4] The expression ``gs.takeBy(x => -x, 5)`` would return the array:: [7, 6, 5, 4, 3] The expression ``gs.takeBy(x => x, 10)`` would return the array:: [-1, 1, 2, 3, 4, 5, 6, 7, NA, NA] Returns the 10 samples with the least number of singletons: >>> samplesMostSingletons = (vds ... .sample_qc() ... .query_samples('samples.takeBy(s => sa.qc.nSingleton, 10)')) **Arguments** - **f** (*T => Int*) -- Lambda expression for mapping an aggregable to an ordered value. - **n** (*Int*) -- Number of items to take. .. _altallele: --------- AltAllele --------- An ``AltAllele`` is a Hail data type representing an alternate allele in the Variant Dataset. - **alt**: *String* -- Alternate allele base sequence. - **category()**: *String* -- the alt allele type, i.e one of SNP, Insertion, Deletion, Star, MNP, Complex - **isComplex()**: *Boolean* -- True if not a SNP, MNP, star, insertion, or deletion. - **isDeletion()**: *Boolean* -- True if ``v.ref`` begins with and is longer than ``v.alt``. - **isIndel()**: *Boolean* -- True if an insertion or a deletion. - **isInsertion()**: *Boolean* -- True if ``v.alt`` begins with and is longer than ``v.ref``. - **isMNP()**: *Boolean* -- True if ``v.ref`` and ``v.alt`` are the same length and differ in more than one position. - **isSNP()**: *Boolean* -- True if ``v.ref`` and ``v.alt`` are the same length and differ in one position. - **isStar()**: *Boolean* -- True if ``v.alt`` is ``*``. - **isTransition()**: *Boolean* -- True if a purine-purine or pyrimidine-pyrimidine SNP. - **isTransversion()**: *Boolean* -- True if a purine-pyrimidine SNP. - **ref**: *String* -- Reference allele base sequence. .. _array: ----- Array ----- An ``Array`` is a collection of items that all have the same data type (ex: Int, String) and are indexed. Arrays can be constructed by specifying ``[item1, item2, ...]`` and they are 0-indexed. An example of constructing an array and accessing an element is: .. code-block:: text :emphasize-lines: 2 let a = [1, 10, 3, 7] in a[1] result: 10 They can also be nested such as Array[Array[Int]]: .. code-block:: text :emphasize-lines: 2 let a = [[1, 2, 3], [4, 5], [], [6, 7]] in a[1] result: [4, 5] .. _array[array[t]]: Array[Array[T]] =============== - **flatten()**: *Array[T]* Flattens a nested array by concatenating all its rows into a single array. .. code-block:: text :emphasize-lines: 2 let a = [[1, 3], [2, 4]] in a.flatten() result: [1, 3, 2, 4] .. _array[boolean]: Array[Boolean] ============== - **sort(ascending: Boolean)**: *Array[Boolean]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[Boolean]* -- Sort the collection in ascending order. .. _array[double]: Array[Double] ============= - **max()**: *Double* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Double* -- Median value of the collection. - **min()**: *Double* -- Smallest element in the collection. - **product()**: *Double* -- Product of all elements in the collection (returns 1 if empty). - **sort(ascending: Boolean)**: *Array[Double]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[Double]* -- Sort the collection in ascending order. - **sum()**: *Double* -- Sum of all elements in the collection. .. _array[float]: Array[Float] ============ - **max()**: *Float* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Float* -- Median value of the collection. - **min()**: *Float* -- Smallest element in the collection. - **product()**: *Float* -- Product of all elements in the collection (returns 1 if empty). - **sort(ascending: Boolean)**: *Array[Float]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[Float]* -- Sort the collection in ascending order. - **sum()**: *Float* -- Sum of all elements in the collection. .. _array[int]: Array[Int] ========== - **max()**: *Int* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Int* -- Median value of the collection. - **min()**: *Int* -- Smallest element in the collection. - **product()**: *Int* -- Product of all elements in the collection (returns 1 if empty). - **sort(ascending: Boolean)**: *Array[Int]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[Int]* -- Sort the collection in ascending order. - **sum()**: *Int* -- Sum of all elements in the collection. .. _array[long]: Array[Long] =========== - **max()**: *Long* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Long* -- Median value of the collection. - **min()**: *Long* -- Smallest element in the collection. - **product()**: *Long* -- Product of all elements in the collection (returns 1 if empty). - **sort(ascending: Boolean)**: *Array[Long]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[Long]* -- Sort the collection in ascending order. - **sum()**: *Long* -- Sum of all elements in the collection. .. _array[string]: Array[String] ============= - **mkString(delimiter: String)**: *String* Concatenates all elements of this array into a single string where each element is separated by the ``delimiter``. .. code-block:: text :emphasize-lines: 2 let a = ["a", "b", "c"] in a.mkString("::") result: "a::b::c" **Arguments** - **delimiter** (*String*) -- String that separates each element. - **sort(ascending: Boolean)**: *Array[String]* Sort the collection with the ordering specified by ``ascending``. **Arguments** - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sort()**: *Array[String]* -- Sort the collection in ascending order. .. _array[t]: Array[T] ======== - **[:\j]**: *Array[T]* Returns a slice of the array from the first element until the j*th* element (0-indexed). Negative indices are interpreted as offsets from the end of the array. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[:4] result: [0, 2, 4, 6] .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[:-4] result: [0, 2] **Arguments** - **j** (*Int*) -- End index of the slice (not included in result). - **[:]**: *Array[T]* Returns a copy of the array. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6] in a[:] result: [0, 2, 4, 6] - **[\i:\j]**: *Array[T]* Returns a slice of the array from the i*th* element until the j*th* element (both 0-indexed). Negative indices are interpreted as offsets from the end of the array. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[2:4] result: [4, 6] .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[-3:-1] result: [6, 8] A handy way to understand the behavior of negative indicies is to recall this rule: .. code-block:: text s[-i:-j] == s[s.length - i, s.length - j] **Arguments** - **i** (*Int*) -- Starting index of the slice. - **j** (*Int*) -- End index of the slice (not included in result). - **[\i:]**: *Array[T]* Returns a slice of the array from the i*th* element (0-indexed) to the end. Negative indices are interpreted as offsets from the end of the array. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[3:] result: [6, 8, 10] .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[-5:] result: [2, 4, 6, 8, 10] **Arguments** - **i** (*Int*) -- Starting index of the slice. - **[i]**: *T* Returns the i*th* element (0-indexed) of the array, or throws an exception if ``i`` is an invalid index. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a[2] result: 4 **Arguments** - **i** (*Int*) -- Index of the element to return. - **append(a: T)**: *Array[T]* -- Returns the result of adding the element `a` to the end of this Array. - **exists(expr: T => Boolean)**: *Boolean* Returns a boolean which is true if **any** element in the array satisfies the condition given by ``expr``. false otherwise. .. code-block:: text :emphasize-lines: 2 let a = [1, 2, 3, 4, 5, 6] in a.exists(e => e > 4) result: true **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **extend(a: Array[T])**: *Array[T]* -- Returns the concatenation of this Array followed by Array `a`. - **filter(expr: T => Boolean)**: *Array[T]* Returns a new array subsetted to the elements where ``expr`` evaluates to true. .. code-block:: text :emphasize-lines: 2 let a = [1, 4, 5, 6, 10] in a.filter(e => e % 2 == 0) result: [4, 6, 10] **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **find(expr: T => Boolean)**: *T* Returns the first non-missing element of the array for which expr is true. If no element satisfies the predicate, find returns NA. .. code-block:: text :emphasize-lines: 2 let a = ["cat", "dog", "rabbit"] in a.find(e => 'bb' ~ e) result: "rabbit" **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **flatMap(expr: T => Array[U])**: *Array[U]* Returns a new array by applying a function to each subarray and concatenating the resulting arrays. .. code-block:: text :emphasize-lines: 2 let a = [[1, 2, 3], [4, 5], [6]] in a.flatMap(e => e + 1) result: [2, 3, 4, 5, 6, 7] **Arguments** - **expr** (*T => Array[U]*) -- Lambda expression. - **forall(expr: T => Boolean)**: *Boolean* Returns a boolean which is true if **all** elements in the array satisfies the condition given by ``expr`` and false otherwise. .. code-block:: text :emphasize-lines: 2 let a = [0, 2, 4, 6, 8, 10] in a.forall(e => e % 2 == 0) result: true **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **groupBy(a: T => U)**: *Dict[U, Array[T]]* - **head()**: *T* -- Selects the first element. - **isEmpty()**: *Boolean* -- Returns true if the number of elements is equal to 0. false otherwise. - **length()**: *Int* -- Number of elements in the collection. - **map(expr: T => U)**: *Array[U]* Returns a new array produced by applying ``expr`` to each element. .. code-block:: text :emphasize-lines: 2 let a = [0, 1, 2, 3] in a.map(e => pow(2, e)) result: [1, 2, 4, 8] **Arguments** - **expr** (*T => U*) -- Lambda expression. - **size()**: *Int* -- Number of elements in the collection. - **sortBy(f: T => String, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => String*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => String)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => String*) -- Lambda expression. - **sortBy(f: T => Double, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => Double*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => Double)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => Double*) -- Lambda expression. - **sortBy(f: T => Float, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => Float*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => Float)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => Float*) -- Lambda expression. - **sortBy(f: T => Long, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => Long*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => Long)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => Long*) -- Lambda expression. - **sortBy(f: T => Int, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => Int*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => Int)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => Int*) -- Lambda expression. - **sortBy(f: T => Boolean, ascending: Boolean)**: *Array[T]* Sort the collection with the ordering specified by ``ascending`` after evaluating ``f`` for each element. **Arguments** - **f** (*T => Boolean*) -- Lambda expression. - **ascending** (*Boolean*) -- If true, sort the collection in ascending order. Otherwise, sort in descending order. - **sortBy(f: T => Boolean)**: *Array[T]* Sort the collection in ascending order after evaluating ``f`` for each element. **Arguments** - **f** (*T => Boolean*) -- Lambda expression. - **tail()**: *Array[T]* -- Selects all elements except the first. - **toArray()**: *Array[T]* -- Convert collection to an Array. - **toSet()**: *Set[T]* -- Convert collection to a Set. .. _boolean: ------- Boolean ------- - **max(a: Boolean)**: *Boolean* -- Returns the maximum value. - **min(a: Boolean)**: *Boolean* -- Returns the minimum value. - **toDouble()**: *Double* -- Convert value to a Double. Returns 1.0 if true, else 0.0. - **toFloat()**: *Float* -- Convert value to a Float. Returns 1.0 if true, else 0.0. - **toInt()**: *Int* -- Convert value to an Integer. Returns 1 if true, else 0. - **toLong()**: *Long* -- Convert value to a Long. Returns 1L if true, else 0L. .. _call: ---- Call ---- A ``Call`` is a Hail data type representing a genotype call (ex: 0/0) in the Variant Dataset. - **gt**: *Int* -- the integer ``gt = k*(k+1)/2 + j`` for call ``j/k`` (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.). - **gtj()**: *Int* -- the index of allele ``j`` for call ``j/k`` (0 = ref, 1 = first alt allele, etc.). - **gtk()**: *Int* -- the index of allele ``k`` for call ``j/k`` (0 = ref, 1 = first alt allele, etc.). - **isCalled()**: *Boolean* -- True if the call is not ``./.``. - **isCalledNonRef()**: *Boolean* -- True if either ``isHet`` or ``isHomVar`` is true. - **isHet()**: *Boolean* -- True if this call is heterozygous. - **isHetNonRef()**: *Boolean* -- True if this call is ``j/k`` with ``j>0``. - **isHetRef()**: *Boolean* -- True if this call is ``0/k`` with ``k>0``. - **isHomRef()**: *Boolean* -- True if this call is ``0/0``. - **isHomVar()**: *Boolean* -- True if this call is ``j/j`` with ``j>0``. - **isNotCalled()**: *Boolean* -- True if the call is ``./.``. - **nNonRefAlleles()**: *Int* -- the number of called alternate alleles. - **oneHotAlleles(v: Variant)**: *Array[Int]* Produce an array of called counts for each allele in the variant (including reference). For example, calling this function with a biallelic variant on hom-ref, het, and hom-var calls will produce ``[2, 0]``, ``[1, 1]``, and ``[0, 2]`` respectively. **Arguments** - **v** (*Variant*) -- :ref:`variant` - **oneHotGenotype(v: Variant)**: *Array[Int]* Produces an array with one element for each possible genotype in the variant, where the called genotype is 1 and all else 0. For example, calling this function with a biallelic variant on hom-ref, het, and hom-var calls will produce ``[1, 0, 0]``, ``[0, 1, 0]``, and ``[0, 0, 1]`` respectively. **Arguments** - **v** (*Variant*) -- :ref:`variant` - **toGenotype()**: *Genotype* -- Convert this call to a Genotype. .. _dict: ---- Dict ---- A ``Dict`` is an unordered collection of key-value pairs. Each key can only appear once in the collection. - **[k]**: *U* Returns the value for ``k``, or throws an exception if the key is not found. **Arguments** - **k** (*T*) -- Key in the Dict to query. - **contains(k: T)**: *Boolean* Returns true if the Dict has a key equal to ``k``, otherwise false. **Arguments** - **k** (*T*) -- Key name to query. - **get(a: T)**: *U* -- Returns the value of the Dict for key ``k``, or returns ``NA`` if the key is not found. - **isEmpty()**: *Boolean* -- Returns true if the number of elements is equal to 0. false otherwise. - **keySet()**: *Set[T]* -- Returns a Set containing the keys of the Dict. - **keys()**: *Array[T]* -- Returns an Array containing the keys of the Dict. - **mapValues(expr: U => V)**: *Dict[T, V]* Returns a new Dict produced by applying ``expr`` to each value. The keys are unmodified. **Arguments** - **expr** (*U => V*) -- Lambda expression. - **size()**: *Int* -- Number of elements in the collection. - **values()**: *Array[U]* -- Returns an Array containing the values of the Dict. .. _double: ------ Double ------ - **abs()**: *Double* -- Returns the absolute value of a number. - **max(a: Double)**: *Double* -- Returns the maximum value. - **min(a: Double)**: *Double* -- Returns the minimum value. - **signum()**: *Int* -- Returns the sign of a number (1, 0, or -1). - **toDouble()**: *Double* -- Convert value to a Double. - **toFloat()**: *Float* -- Convert value to a Float. - **toInt()**: *Int* -- Convert value to an Integer. - **toLong()**: *Long* -- Convert value to a Long. .. _float: ----- Float ----- - **abs()**: *Float* -- Returns the absolute value of a number. - **max(a: Float)**: *Float* -- Returns the maximum value. - **min(a: Float)**: *Float* -- Returns the minimum value. - **signum()**: *Int* -- Returns the sign of a number (1, 0, or -1). - **toDouble()**: *Double* -- Convert value to a Double. - **toFloat()**: *Float* -- Convert value to a Float. - **toInt()**: *Int* -- Convert value to an Integer. - **toLong()**: *Long* -- Convert value to a Long. .. _genotype: -------- Genotype -------- A ``Genotype`` is a Hail data type representing a genotype in the Variant Dataset. It is referred to as ``g`` in the expression language. - **ad**: *Array[Int]* -- allelic depth for each allele. - **call()**: *Call* -- the integer ``gt = k*(k+1)/2 + j`` for call ``j/k`` (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.). - **dosage**: *Double* -- the expected number of non-reference alleles based on genotype probabilities. - **dp**: *Int* -- the total number of informative reads. - **fakeRef**: *Boolean* -- True if this genotype was downcoded in :py:meth:`~hail.VariantDataset.split_multi`. This can happen if a ``1/2`` call is split to ``0/1``, ``0/1``. - **fractionReadsRef()**: *Double* -- the ratio of ref reads to the sum of all *informative* reads. - **gp**: *Array[Double]* -- the linear-scaled probabilities. - **gq**: *Int* -- the difference between the two smallest PL entries. - **gt**: *Int* -- the integer ``gt = k*(k+1)/2 + j`` for call ``j/k`` (0 = 0/0, 1 = 0/1, 2 = 1/1, 3 = 0/2, etc.). - **gtj()**: *Int* -- the index of allele ``j`` for call ``j/k`` (0 = ref, 1 = first alt allele, etc.). - **gtk()**: *Int* -- the index of allele ``k`` for call ``j/k`` (0 = ref, 1 = first alt allele, etc.). - **isCalled()**: *Boolean* -- True if the genotype is not ``./.``. - **isCalledNonRef()**: *Boolean* -- True if either ``g.isHet`` or ``g.isHomVar`` is true. - **isHet()**: *Boolean* -- True if this call is heterozygous. - **isHetNonRef()**: *Boolean* -- True if this call is ``j/k`` with ``j>0``. - **isHetRef()**: *Boolean* -- True if this call is ``0/k`` with ``k>0``. - **isHomRef()**: *Boolean* -- True if this call is ``0/0``. - **isHomVar()**: *Boolean* -- True if this call is ``j/j`` with ``j>0``. - **isLinearScale**: *Boolean* -- True if the data was imported from :py:meth:`~hail.HailContext.import_gen` or :py:meth:`~hail.HailContext.import_bgen`. - **isNotCalled()**: *Boolean* -- True if the genotype is ``./.``. - **nNonRefAlleles()**: *Int* -- the number of called alternate alleles. - **od()**: *Int* -- ``od = dp - ad.sum``. - **oneHotAlleles(v: Variant)**: *Array[Int]* Produce an array of called counts for each allele in the variant (including reference). For example, calling this function with a biallelic variant on hom-ref, het, and hom-var genotypes will produce ``[2, 0]``, ``[1, 1]``, and ``[0, 2]`` respectively. **Arguments** - **v** (*Variant*) -- :ref:`variant` - **oneHotGenotype(v: Variant)**: *Array[Int]* Produces an array with one element for each possible genotype in the variant, where the called genotype is 1 and all else 0. For example, calling this function with a biallelic variant on hom-ref, het, and hom-var genotypes will produce ``[1, 0, 0]``, ``[0, 1, 0]``, and ``[0, 0, 1]`` respectively. **Arguments** - **v** (*Variant*) -- :ref:`variant` - **pAB()**: *Double* -- p-value for pulling the given allelic depth from a binomial distribution with mean 0.5. Missing if the call is not heterozygous. - **pl**: *Array[Int]* phred-scaled normalized genotype likelihood values. The conversion between ``g.pl`` (Phred-scaled likelihoods) and ``g.gp`` (linear-scaled probabilities) assumes a uniform prior. .. _int: --- Int --- - **abs()**: *Int* -- Returns the absolute value of a number. - **max(a: Int)**: *Int* -- Returns the maximum value. - **min(a: Int)**: *Int* -- Returns the minimum value. - **signum()**: *Int* -- Returns the sign of a number (1, 0, or -1). - **toDouble()**: *Double* -- Convert value to a Double. - **toFloat()**: *Float* -- Convert value to a Float. - **toInt()**: *Int* -- Convert value to an Integer. - **toLong()**: *Long* -- Convert value to a Long. .. _interval: -------- Interval -------- An ``Interval`` is a Hail data type representing a range of genomic locations in the Variant Dataset. - **contains(locus: Locus)**: *Boolean* Returns true if the ``locus`` is in the interval. .. code-block:: text :emphasize-lines: 2 let i = Interval(Locus("1", 1000), Locus("1", 2000)) in i.contains(Locus("1", 1500)) result: true **Arguments** - **locus** (*Locus*) -- :ref:`locus` - **end**: *Locus* -- :ref:`locus` at the end of the interval (exclusive). - **start**: *Locus* -- :ref:`locus` at the start of the interval (inclusive). .. _locus: ----- Locus ----- A ``Locus`` is a Hail data type representing a specific genomic location in the Variant Dataset. - **contig**: *String* -- String representation of contig. - **position**: *Int* -- Chromosomal position. .. _long: ---- Long ---- - **abs()**: *Long* -- Returns the absolute value of a number. - **max(a: Long)**: *Long* -- Returns the maximum value. - **min(a: Long)**: *Long* -- Returns the minimum value. - **signum()**: *Int* -- Returns the sign of a number (1, 0, or -1). - **toDouble()**: *Double* -- Convert value to a Double. - **toFloat()**: *Float* -- Convert value to a Float. - **toInt()**: *Int* -- Convert value to an Integer. - **toLong()**: *Long* -- Convert value to a Long. .. _set: --- Set --- A ``Set`` is an unordered collection with no repeated values of a given data type (ex: Int, String). Sets can be constructed by specifying ``[item1, item2, ...].toSet()``. .. code-block:: text :emphasize-lines: 2 let s = ["rabbit", "cat", "dog", "dog"].toSet() result: Set("cat", "dog", "rabbit") They can also be nested such as Set[Set[Int]]: .. code-block:: text :emphasize-lines: 2 let s = [[1, 2, 3].toSet(), [4, 5, 5].toSet()].toSet() result: Set(Set(1, 2, 3), Set(4, 5)) .. _set[double]: Set[Double] =========== - **max()**: *Double* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Double* -- Median value of the collection. - **min()**: *Double* -- Smallest element in the collection. - **product()**: *Double* -- Product of all elements in the collection (returns 1 if empty). - **sum()**: *Double* -- Sum of all elements in the collection. .. _set[float]: Set[Float] ========== - **max()**: *Float* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Float* -- Median value of the collection. - **min()**: *Float* -- Smallest element in the collection. - **product()**: *Float* -- Product of all elements in the collection (returns 1 if empty). - **sum()**: *Float* -- Sum of all elements in the collection. .. _set[int]: Set[Int] ======== - **max()**: *Int* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Int* -- Median value of the collection. - **min()**: *Int* -- Smallest element in the collection. - **product()**: *Int* -- Product of all elements in the collection (returns 1 if empty). - **sum()**: *Int* -- Sum of all elements in the collection. .. _set[long]: Set[Long] ========= - **max()**: *Long* -- Largest element in the collection. - **mean()**: *Double* -- Mean value of the collection. - **median()**: *Long* -- Median value of the collection. - **min()**: *Long* -- Smallest element in the collection. - **product()**: *Long* -- Product of all elements in the collection (returns 1 if empty). - **sum()**: *Long* -- Sum of all elements in the collection. .. _set[set[t]]: Set[Set[T]] =========== - **flatten()**: *Set[T]* Flattens a nested set by concatenating all its elements into a single set. .. code-block:: text :emphasize-lines: 2 let s = [[1, 2].toSet(), [3, 4].toSet()].toSet() in s.flatten() result: Set(1, 2, 3, 4) .. _set[string]: Set[String] =========== - **mkString(delimiter: String)**: *String* Concatenates all elements of this set into a single string where each element is separated by the ``delimiter``. .. code-block:: text :emphasize-lines: 2 let s = [1, 2, 3].toSet() in s.mkString(",") result: "1,2,3" **Arguments** - **delimiter** (*String*) -- String that separates each element. .. _set[t]: Set[T] ====== - **add(a: T)**: *Set[T]* -- Returns the result of adding the element `a` to this Set. - **contains(x: T)**: *Boolean* Returns true if the element ``x`` is contained in the set, otherwise false. .. code-block:: text :emphasize-lines: 2 let s = [1, 2, 3].toSet() in s.contains(5) result: false **Arguments** - **x** (*T*) -- Value to test. - **difference(a: Set[T])**: *Set[T]* -- Returns the elements of this Set that are not in Set `a`. - **exists(expr: T => Boolean)**: *Boolean* Returns a boolean which is true if **any** element in the set satisfies the condition given by ``expr`` and false otherwise. .. code-block:: text :emphasize-lines: 2 let s = [0, 2, 4, 6, 8, 10].toSet() in s.exists(e => e % 2 == 1) result: false **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **filter(expr: T => Boolean)**: *Set[T]* Returns a new set subsetted to the elements where ``expr`` evaluates to true. .. code-block:: text :emphasize-lines: 2 let s = [1, 4, 5, 6, 10].toSet() in s.filter(e => e >= 5) result: Set(5, 6, 10) **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **find(expr: T => Boolean)**: *T* Returns the first non-missing element of the array for which expr is true. If no element satisfies the predicate, find returns NA. .. code-block:: text :emphasize-lines: 2 let s = [1, 2, 3].toSet() in s.find(e => e % 3 == 0) result: 3 **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **flatMap(expr: T => Set[U])**: *Set[U]* Returns a new set by applying a function to each subset and concatenating the resulting sets. .. code-block:: text :emphasize-lines: 2 let s = [["a", "b", "c"].toSet(), ["d", "e"].toSet(), ["f"].toSet()].toSet() in s.flatMap(e => e + "1") result: Set("a1", "b1", "c1", "d1", "e1", "f1") **Arguments** - **expr** (*T => Set[U]*) -- Lambda expression. - **forall(expr: T => Boolean)**: *Boolean* Returns a boolean which is true if **all** elements in the set satisfies the condition given by ``expr`` and false otherwise. .. code-block:: text :emphasize-lines: 2 let s = [0.1, 0.5, 0.3, 1.0, 2.5, 3.0].toSet() in s.forall(e => e > 1.0 == 0) result: false **Arguments** - **expr** (*T => Boolean*) -- Lambda expression. - **groupBy(a: T => U)**: *Dict[U, Set[T]]* - **head()**: *T* -- Select one element. - **intersection(a: Set[T])**: *Set[T]* -- Returns the intersection of this Set and Set `a`. - **isEmpty()**: *Boolean* -- Returns true if the number of elements is equal to 0. false otherwise. - **issubset(a: Set[T])**: *Boolean* -- Returns true if this Set is a subset of Set `a`. - **map(expr: T => U)**: *Set[U]* Returns a new set produced by applying ``expr`` to each element. .. code-block:: text :emphasize-lines: 2 let s = [1, 2, 3].toSet() in s.map(e => e * 3) result: Set(3, 6, 9) **Arguments** - **expr** (*T => U*) -- Lambda expression. - **remove(a: T)**: *Set[T]* -- Returns the result of removing the element `a` from this Set. - **size()**: *Int* -- Number of elements in the collection. - **tail()**: *Set[T]* -- Select all elements except the element returned by ``head``. - **toArray()**: *Array[T]* -- Convert collection to an Array. - **toSet()**: *Set[T]* -- Convert collection to a Set. - **union(a: Set[T])**: *Set[T]* -- Returns the union of this Set and Set `a`. .. _string: ------ String ------ - **[:\j]**: *String* Returns a slice of the string from the first unicode-codepoint until the j*th* unicode-codepoint (0-indexed). Negative indices are interpreted as offsets from the end of the string. .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[:4] result: "abcd" .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[:-3] result: "abc" **Arguments** - **j** (*Int*) -- End index of the slice (not included in result). - **[:]**: *String* Returns a copy of the string. .. code-block:: text :emphasize-lines: 2 let a = "abcd" in a[:] result: "abcd" - **[\i:\j]**: *String* Returns a slice of the string from the i*th* unicode-codepoint until the j*th* unicode-codepoint (both 0-indexed). Negative indices are interpreted as offsets from the end of the string instead of the beginning. .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[2:4] result: "cd" .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[-3:-1] result: "de" A handy way to understand the behavior of negative indicies is to recall this rule, which holds for any positive integers i and j: .. code-block:: text s[-i:-j] == s[s.length - i, s.length - j] **Arguments** - **i** (*Int*) -- Starting index of the slice. - **j** (*Int*) -- End index of the slice (not included in result). - **[\i:]**: *String* Returns a slice of the string from the i*th* unicode-codepoint (0-indexed) to the end. Negative indices are interpreted as offsets from the end of the string. .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[3:] result: "def" .. code-block:: text :emphasize-lines: 2 let a = "abcdef" in a[-5:] result: "bcdef" **Arguments** - **i** (*Int*) -- Starting index of the slice. - **[i]**: *String* Returns the i*th* element (0-indexed) of the string, or throws an exception if ``i`` is an invalid index. .. code-block:: text :emphasize-lines: 2 let s = "genetics" in s[6] result: "c" **Arguments** - **i** (*Int*) -- Index of the character to return. - **entropy()**: *Double* Computes the `Shannon entropy `__ in bits of the character frequency distribution. - **length()**: *Int* -- Length of the string. - **max(a: String)**: *String* -- Returns the maximum value. - **min(a: String)**: *String* -- Returns the minimum value. - **replace(pattern1: String, pattern2: String)**: *String* Replaces each substring of this string that matches the given `regular expression `_ (``pattern1``) with the given replacement (``pattern2``). .. code-block:: text :emphasize-lines: 2 let s = "1kg-NA12878" in a.replace("1kg-", "") result: "NA12878" **Arguments** - **pattern1** (*String*) -- Substring to replace. - **pattern2** (*String*) -- Replacement string. - **split(delim: String, n: Int)**: *Array[String]* Returns an array of strings, split on the given regular expression delimiter with the pattern applied ``n`` times. See the documentation on `regular expression syntax `_ delimiter. If you need to split on special characters, escape them with double backslash (\\\\). .. code-block:: text :emphasize-lines: 2 let s = "1kg-NA12878" in s.split("-") result: ["1kg", "NA12878"] **Arguments** - **delim** (*String*) -- Regular expression delimiter. - **n** (*Int*) -- Number of times the pattern is applied. See the `Java documentation `_ for more information. - **split(delim: String)**: *Array[String]* Returns an array of strings, split on the given regular expression delimiter. See the documentation on `regular expression syntax `_ delimiter. If you need to split on special characters, escape them with double backslash (\\\\). .. code-block:: text :emphasize-lines: 2 let s = "1kg-NA12878" in s.split("-") result: ["1kg", "NA12878"] **Arguments** - **delim** (*String*) -- Regular expression delimiter. - **toDouble()**: *Double* -- Convert value to a Double. - **toFloat()**: *Float* -- Convert value to a Float. - **toInt()**: *Int* -- Convert value to an Integer. - **toLong()**: *Long* -- Convert value to a Long. .. _struct: ------ Struct ------ A ``Struct`` is like a Python tuple where the fields are named and the set of fields is fixed. An example of constructing and accessing the fields in a ``Struct`` is .. code-block:: text :emphasize-lines: 2 let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene result: "ACBD" A field of the ``Struct`` can also be another ``Struct``. For example, ``va.info.AC`` selects the struct ``info`` from the struct ``va``, and then selects the array ``AC`` from the struct ``info``. .. _variant: ------- Variant ------- A ``Variant`` is a Hail data type representing a variant in the Variant Dataset. It is referred to as ``v`` in the expression language. The `pseudoautosomal region `_ (PAR) is currently defined with respect to reference `GRCh37 `_: - X: 60001 - 2699520, 154931044 - 155260560 - Y: 10001 - 2649520, 59034050 - 59363566 Most callers assign variants in PAR to X. - **alt()**: *String* -- Alternate allele sequence. **Assumes biallelic.** - **altAllele()**: *AltAllele* -- The :ref:`alternate allele `. **Assumes biallelic.** - **altAlleles**: *Array[AltAllele]* -- The :ref:`alternate alleles `. - **contig**: *String* -- String representation of contig, exactly as imported. *NB: Hail stores contigs as strings. Use double-quotes when checking contig equality.* - **inXNonPar()**: *Boolean* -- True if chromosome is X and start is not in pseudoautosomal region of X. - **inXPar()**: *Boolean* -- True if chromosome is X and start is in pseudoautosomal region of X. - **inYNonPar()**: *Boolean* -- True if chromosome is Y and start is not in pseudoautosomal region of Y. - **inYPar()**: *Boolean* -- True if chromosome is Y and start is in pseudoautosomal region of Y. *NB: most callers assign variants in PAR to X.* - **isAutosomal()**: *Boolean* -- True if chromosome is not X, not Y, and not MT. - **isBiallelic()**: *Boolean* -- True if `v` has one alternate allele. - **locus()**: *Locus* -- Chromosomal locus (chr, pos) of this variant - **nAlleles()**: *Int* -- Number of alleles. - **nAltAlleles()**: *Int* -- Number of alternate alleles, equal to ``nAlleles - 1``. - **nGenotypes()**: *Int* -- Number of genotypes. - **ref**: *String* -- Reference allele sequence. - **start**: *Int* -- SNP position or start of an indel.