Aggregators
The aggregators
module is exposed as hl.agg
, e.g. hl.agg.sum
.
|
Collect records into an array. |
|
Collect records into a set. |
|
Count the number of records. |
|
Count the number of records where a predicate is |
|
Count the occurrences of each unique record and return a dictionary. |
|
Returns |
|
Returns |
|
Take n records of expr, optionally ordered by ordering. |
|
Compute the minimum expr. |
|
Compute the maximum expr. |
|
Compute the sum of all records of expr. |
|
Compute the coordinate-wise sum of all records of expr. |
|
Compute the mean value of records of expr. |
|
Compute an array of approximate quantiles. |
|
Compute the approximate median. |
|
Compute a number of useful statistics about expr. |
|
Compute the product of all records of expr. |
|
Compute the fraction of records where predicate is |
|
Performs test of Hardy-Weinberg equilibrium. |
|
Explode an array or set expression to aggregate the elements of all records. |
|
Filter records according to a predicate. |
|
Compute inbreeding statistics on calls. |
|
Compute useful call statistics. |
|
Compute the IMPUTE information score. |
|
Compute binned counts of a numeric expression. |
|
Compute multivariate linear regression statistics. |
|
Computes the Pearson correlation coefficient between x and y. |
|
Compute aggregation statistics stratified by one or more groups. |
|
Aggregate an array element-wise using a user-specified aggregation function. |
|
Downsample (x, y) coordinate datapoints. |
|
Produce a summary of the distribution of values. |
- hail.expr.aggregators.collect(expr)[source]
Collect records into an array.
Examples
Collect the ID field where HT is greater than 68:
>>> table1.aggregate(hl.agg.filter(table1.HT > 68, hl.agg.collect(table1.ID))) [2, 3]
Notes
The element order of the resulting array is not guaranteed, and in some cases is non-deterministic.
Use
collect_as_set()
to collect unique items.Warning
Collecting a large number of items can cause out-of-memory exceptions.
- Parameters:
expr (
Expression
) – Expression to collect.- Returns:
ArrayExpression
– Array of all expr records.
- hail.expr.aggregators.collect_as_set(expr)[source]
Collect records into a set.
Examples
Collect the unique ID field where HT is greater than 68:
>>> table1.aggregate(hl.agg.filter(table1.HT > 68, hl.agg.collect_as_set(table1.ID))) {2, 3}
Note that when collecting a set-typed field with
collect_as_set()
, the values becomefrozenset
s because Python does not permit the keys of a dictionary to be mutable:>>> table1.aggregate(hl.agg.filter(table1.HT > 68, hl.agg.collect_as_set(hl.set({table1.ID})))) {frozenset({3}), frozenset({2})}
Warning
Collecting a large number of items can cause out-of-memory exceptions.
- Parameters:
expr (
Expression
) – Expression to collect.- Returns:
SetExpression
– Set of unique expr records.
- hail.expr.aggregators.count()[source]
Count the number of records.
Examples
Group by the SEX field and count the number of rows in each category:
>>> (table1.group_by(table1.SEX) ... .aggregate(n=hl.agg.count()) ... .show()) +-----+-------+ | SEX | n | +-----+-------+ | str | int64 | +-----+-------+ | "F" | 2 | | "M" | 2 | +-----+-------+
- Returns:
Expression
of typetint64
– Total number of records.
- hail.expr.aggregators.count_where(condition)[source]
Count the number of records where a predicate is
True
.Examples
Count the number of individuals with HT greater than 68:
>>> table1.aggregate(hl.agg.count_where(table1.HT > 68)) 2
- Parameters:
condition (
BooleanExpression
) – Criteria for inclusion.- Returns:
Expression
of typetint64
– Total number of records where condition isTrue
.
- hail.expr.aggregators.counter(expr, *, weight=None)[source]
Count the occurrences of each unique record and return a dictionary.
Examples
Count the number of individuals for each unique SEX value:
>>> table1.aggregate(hl.agg.counter(table1.SEX)) {'F': 2, 'M': 2}
Compute the total height for each unique SEX value:
>>> table1.aggregate(hl.agg.counter(table1.SEX, weight=table1.HT)) {'F': 130, 'M': 137}
Note that when counting a set-typed field, the values become
frozenset
s because Python does not permit the keys of a dictionary to be mutable:>>> table1.aggregate(hl.agg.counter(hl.set({table1.SEX}), weight=table1.HT)) {frozenset({'F'}): 130, frozenset({'M'}): 137}
Notes
If you need a more complex grouped aggregation than
counter()
supports, try usinggroup_by()
.This aggregator method returns a dict expression whose key type is the same type as expr and whose value type is
Expression
of typetint64
. This dict contains a key for each unique value of expr, and the value is the number of times that key was observed.Ensure that the result can be stored in memory on a single machine.
Warning
Using
counter()
with a large number of unique items can cause out-of-memory exceptions.- Parameters:
expr (
Expression
) – Expression to count by key.weight (
NumericExpression
, optional) – Expression by which to weight each occurence (when unspecified, it is effectively1
)
- Returns:
DictExpression
– Dictionary with the number of occurrences of each unique record.
- hail.expr.aggregators.any(condition)[source]
Returns
True
if condition isTrue
for any record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(any_over_70 = hl.agg.any(table1.HT > 70)) ... .show()) +-----+-------------+ | SEX | any_over_70 | +-----+-------------+ | str | bool | +-----+-------------+ | "F" | False | | "M" | True | +-----+-------------+
Notes
If there are no records to aggregate, the result is
False
.Missing records are not considered. If every record is missing, the result is also
False
.- Parameters:
condition (
BooleanExpression
) – Condition to test.- Returns:
- hail.expr.aggregators.all(condition)[source]
Returns
True
if condition isTrue
for every record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(all_under_70 = hl.agg.all(table1.HT < 70)) ... .show()) +-----+--------------+ | SEX | all_under_70 | +-----+--------------+ | str | bool | +-----+--------------+ | "F" | False | | "M" | False | +-----+--------------+
Notes
If there are no records to aggregate, the result is
True
.Missing records are not considered. If every record is missing, the result is also
True
.- Parameters:
condition (
BooleanExpression
) – Condition to test.- Returns:
- hail.expr.aggregators.take(expr, n, ordering=None)[source]
Take n records of expr, optionally ordered by ordering.
Examples
Take 3 elements of field X:
>>> table1.aggregate(hl.agg.take(table1.X, 3)) [5, 6, 7]
Take the ID and HT fields, ordered by HT (descending):
>>> table1.aggregate(hl.agg.take(hl.struct(ID=table1.ID, HT=table1.HT), ... 3, ... ordering=-table1.HT)) [Struct(ID=2, HT=72), Struct(ID=3, HT=70), Struct(ID=1, HT=65)]
Notes
The resulting array can include fewer than n elements if there are fewer than n total records.
The ordering argument may be an expression, a function, or
None
.If ordering is an expression, this expression’s type should be one with a natural ordering (e.g. numeric).
If ordering is a function, it will be evaluated on each record of expr to compute the value used for ordering. In the above example,
ordering=-table1.HT
andordering=lambda x: -x.HT
would be equivalent.If ordering is
None
, then there is no guaranteed ordering on the elements taken, and and the results may be non-deterministic.Missing values are always sorted last.
- Parameters:
expr (
Expression
) – Expression to store.n (
Expression
of typetint32
) – Number of records to take.ordering (
Expression
or function ((arg) ->Expression
) or None) – Optional ordering on records.
- Returns:
ArrayExpression
– Array of up to n records of expr.
- hail.expr.aggregators.min(expr)[source]
Compute the minimum expr.
Examples
Compute the minimum value of HT:
>>> table1.aggregate(hl.agg.min(table1.HT)) 60
Notes
This function returns the minimum non-missing value. If there are no non-missing values, then the result is missing.
For back-compatibility reasons, this function also ignores NaN, in contrast with
functions.min()
. The behavior is similar tofunctions.nanmin()
.- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
NumericExpression
– Minimum value of all expr records, same type as expr.
- hail.expr.aggregators.max(expr)[source]
Compute the maximum expr.
Examples
Compute the maximum value of HT:
>>> table1.aggregate(hl.agg.max(table1.HT)) 72
Notes
This function returns the maximum non-missing value. If there are no non-missing values, then the result is missing.
For back-compatibility reasons, this function also ignores NaN, in contrast with
functions.max()
. The behavior is similar tofunctions.nanmax()
.- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
NumericExpression
– Maximum value of all expr records, same type as expr.
- hail.expr.aggregators.sum(expr)[source]
Compute the sum of all records of expr.
Examples
Compute the sum of field C1:
>>> table1.aggregate(hl.agg.sum(table1.C1)) 25
Notes
Missing values are ignored (treated as zero).
If expr is an expression of type
tint32
,tint64
, ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the sum.
- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
Expression
of typetint64
ortfloat64
– Sum of records of expr.
- hail.expr.aggregators.array_sum(expr)[source]
Compute the coordinate-wise sum of all records of expr.
Examples
Compute the sum of C1 and C2:
>>> table1.aggregate(hl.agg.array_sum([table1.C1, table1.C2])) [25, 282]
Notes
All records must have the same length. Each coordinate is summed independently as described in
sum()
.- Parameters:
expr (
ArrayNumericExpression
)- Returns:
ArrayExpression
with element typetint64
ortfloat64
- hail.expr.aggregators.mean(expr)[source]
Compute the mean value of records of expr.
Examples
Compute the mean of field HT:
>>> table1.aggregate(hl.agg.mean(table1.HT)) 66.75
Notes
Missing values are ignored.
- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
Expression
of typetfloat64
– Mean value of records of expr.
- hail.expr.aggregators.approx_quantiles(expr, qs, k=100)[source]
Compute an array of approximate quantiles.
Examples
Estimate the median of the HT field.
>>> table1.aggregate(hl.agg.approx_quantiles(table1.HT, 0.5)) 64
Estimate the quartiles of the HT field.
>>> table1.aggregate(hl.agg.approx_quantiles(table1.HT, [0, 0.25, 0.5, 0.75, 1])) [50, 60, 64, 71, 86]
Warning
This is an approximate and nondeterministic method.
- Parameters:
expr (
Expression
) – Expression to collect.qs (
NumericExpression
orArrayNumericExpression
) – Number or array of numbers between 0 and 1.k (
int
) – Parameter controlling the accuracy vs. memory usage tradeoff. Increasing k increases both memory use and accuracy.
- Returns:
NumericExpression
orArrayNumericExpression
– If qs is a single number, returns the estimated quantile. If qs is an array, returns the array of estimated quantiles.
- hail.expr.aggregators.approx_median(expr, k=100)[source]
Compute the approximate median. This function is a shorthand for approx_quantiles(expr, .5, k)
Examples
Estimate the median of the HT field.
>>> table1.aggregate(hl.agg.approx_median(table1.HT)) 64
Warning
This is an approximate and nondeterministic method.
- Parameters:
expr (
Expression
) – Expression to collect.k (
int
) – Parameter controlling the accuracy vs. memory usage tradeoff. Increasing k increases both memory use and accuracy.
See also
- Returns:
NumericExpression
– The estimated median.
- hail.expr.aggregators.stats(expr)[source]
Compute a number of useful statistics about expr.
Examples
Compute statistics about field HT:
>>> table1.aggregate(hl.agg.stats(table1.HT)) Struct(mean=66.75, stdev=4.656984002549289, min=60.0, max=72.0, n=4, sum=267.0)
Notes
Computes a struct with the following fields:
min (
tfloat64
) - Minimum value.max (
tfloat64
) - Maximum value.mean (
tfloat64
) - Mean value,stdev (
tfloat64
) - Standard deviation.n (
tint64
) - Number of non-missing records.sum (
tfloat64
) - Sum.
- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
StructExpression
– Struct expression with fields mean, stdev, min, max, n, and sum.
- hail.expr.aggregators.product(expr)[source]
Compute the product of all records of expr.
Examples
Compute the product of field C1:
>>> table1.aggregate(hl.agg.product(table1.C1)) 440
Notes
Missing values are ignored (treated as one).
If expr is an expression of type
tint32
,tint64
ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the product.
- Parameters:
expr (
NumericExpression
) – Numeric expression.- Returns:
Expression
of typetint64
ortfloat64
– Product of records of expr.
- hail.expr.aggregators.fraction(predicate)[source]
Compute the fraction of records where predicate is
True
.Examples
Compute the fraction of rows where SEX is “F” and HT > 65:
>>> table1.aggregate(hl.agg.fraction((table1.SEX == 'F') & (table1.HT > 65))) 0.25
Notes
Missing values for predicate are treated as
False
.- Parameters:
predicate (
BooleanExpression
) – Boolean predicate.- Returns:
Expression
of typetfloat64
– Fraction of records where predicate isTrue
.
- hail.expr.aggregators.hardy_weinberg_test(expr, one_sided=False)[source]
Performs test of Hardy-Weinberg equilibrium.
Examples
Test each row of a dataset:
>>> dataset_result = dataset.annotate_rows(hwe = hl.agg.hardy_weinberg_test(dataset.GT))
Test each row on a sub-population:
>>> dataset_result = dataset.annotate_rows( ... hwe_eas = hl.agg.filter(dataset.pop == 'EAS', ... hl.agg.hardy_weinberg_test(dataset.GT)))
Notes
This method performs the test described in
functions.hardy_weinberg_test()
based solely on the counts of homozygous reference, heterozygous, and homozygous variant calls.The resulting struct expression has two fields:
het_freq_hwe (
tfloat64
) - Expected frequency of heterozygous calls under Hardy-Weinberg equilibrium.p_value (
tfloat64
) - p-value from test of Hardy-Weinberg equilibrium.
By default, Hail computes the exact p-value with mid-p-value correction, i.e. the probability of a less-likely outcome plus one-half the probability of an equally-likely outcome. See this document for details on the Levene-Haldane distribution and references.
To perform one-sided exact test of excess heterozygosity with mid-p-value correction instead, set one_sided=True and the p-value returned will be from the one-sided exact test.
Warning
Non-diploid calls (
ploidy != 2
) are ignored in the counts. While the counts are defined for multiallelic variants, this test is only statistically rigorous in the biallelic setting; usesplit_multi()
to split multiallelic variants beforehand.- Parameters:
expr (
CallExpression
) – Call to test for Hardy-Weinberg equilibrium.one_sided (
bool
) –False
by default. WhenTrue
, perform one-sided test for excess heterozygosity.
- Returns:
StructExpression
– Struct expression with fields het_freq_hwe and p_value.
- hail.expr.aggregators.explode(f, array_agg_expr)[source]
Explode an array or set expression to aggregate the elements of all records.
Examples
Compute the mean of all elements in fields C1, C2, and C3:
>>> table1.aggregate(hl.agg.explode(lambda elt: hl.agg.mean(elt), [table1.C1, table1.C2, table1.C3])) 24.833333333333332
Compute the set of all observed elements in the filters field (
Set[String]
):>>> dataset.aggregate_rows(hl.agg.explode(lambda elt: hl.agg.collect_as_set(elt), dataset.filters)) set()
Notes
This method can be used with aggregator functions to aggregate the elements of collection types (
tarray
andtset
).- Parameters:
f (Function from
Expression
toExpression
) – Aggregation function to apply to each element of the exploded array.array_agg_expr (
CollectionExpression
) – Expression of typetarray
ortset
.
- Returns:
Expression
– Aggregation expression.
- hail.expr.aggregators.filter(condition, aggregation)[source]
Filter records according to a predicate.
Examples
Collect the ID field where HT >= 70:
>>> table1.aggregate(hl.agg.filter(table1.HT >= 70, hl.agg.collect(table1.ID))) [2, 3]
Notes
This method can be used with aggregator functions to remove records from aggregation.
- Parameters:
condition (
BooleanExpression
) – Filter expression.aggregation (
Expression
) – Aggregation expression to filter.
- Returns:
Expression
– Aggregable expression.
- hail.expr.aggregators.inbreeding(expr, prior)[source]
Compute inbreeding statistics on calls.
Examples
Compute inbreeding statistics per column:
>>> dataset_result = dataset.annotate_cols(IB = hl.agg.inbreeding(dataset.GT, dataset.variant_qc.AF[1])) >>> dataset_result.IB.show(width=100) +------------------+-----------+-------------+------------------+------------------+ | s | IB.f_stat | IB.n_called | IB.expected_homs | IB.observed_homs | +------------------+-----------+-------------+------------------+------------------+ | str | float64 | int64 | float64 | int64 | +------------------+-----------+-------------+------------------+------------------+ | "C1046::HG02024" | 2.79e-01 | 9 | 7.61e+00 | 8 | | "C1046::HG02025" | -4.41e-01 | 9 | 7.61e+00 | 7 | | "C1046::HG02026" | -4.41e-01 | 9 | 7.61e+00 | 7 | | "C1047::HG00731" | 2.79e-01 | 9 | 7.61e+00 | 8 | | "C1047::HG00732" | 2.79e-01 | 9 | 7.61e+00 | 8 | | "C1047::HG00733" | 2.79e-01 | 9 | 7.61e+00 | 8 | | "C1048::HG02024" | -4.41e-01 | 9 | 7.61e+00 | 7 | | "C1048::HG02025" | -4.41e-01 | 9 | 7.61e+00 | 7 | | "C1048::HG02026" | -4.41e-01 | 9 | 7.61e+00 | 7 | | "C1049::HG00731" | 2.79e-01 | 9 | 7.61e+00 | 8 | +------------------+-----------+-------------+------------------+------------------+ showing top 10 rows
Notes
E
is total number of expected homozygous calls, given by the sum of1 - 2.0 * prior * (1 - prior)
across records.O
is the observed number of homozygous calls across records.N
is the number of non-missing calls.F
is the inbreeding coefficient, and is computed by:(O - E) / (N - E)
.This method returns a struct expression with four fields:
- Parameters:
expr (
CallExpression
) – Call expression.prior (
Expression
of typetfloat64
) – Alternate allele frequency prior.
- Returns:
StructExpression
– Struct expression with fields f_stat, n_called, expected_homs, observed_homs.
- hail.expr.aggregators.call_stats(call, alleles)[source]
Compute useful call statistics.
Examples
Compute call statistics per row:
>>> dataset_result = dataset.annotate_rows(gt_stats = hl.agg.call_stats(dataset.GT, dataset.alleles)) >>> dataset_result.rows().key_by('locus').select('gt_stats').show(width=120) +---------------+--------------+---------------------+-------------+---------------------------+ | locus | gt_stats.AC | gt_stats.AF | gt_stats.AN | gt_stats.homozygote_count | +---------------+--------------+---------------------+-------------+---------------------------+ | locus<GRCh37> | array<int32> | array<float64> | int32 | array<int32> | +---------------+--------------+---------------------+-------------+---------------------------+ | 20:10579373 | [199,1] | [9.95e-01,5.00e-03] | 200 | [99,0] | | 20:10579398 | [198,2] | [9.90e-01,1.00e-02] | 200 | [99,1] | | 20:10627772 | [198,2] | [9.90e-01,1.00e-02] | 200 | [98,0] | | 20:10633237 | [108,92] | [5.40e-01,4.60e-01] | 200 | [31,23] | | 20:10636995 | [198,2] | [9.90e-01,1.00e-02] | 200 | [98,0] | | 20:10639222 | [175,25] | [8.75e-01,1.25e-01] | 200 | [78,3] | | 20:13763601 | [198,2] | [9.90e-01,1.00e-02] | 200 | [98,0] | | 20:16223922 | [87,101] | [4.63e-01,5.37e-01] | 188 | [28,35] | | 20:17479617 | [191,9] | [9.55e-01,4.50e-02] | 200 | [91,0] | +---------------+--------------+---------------------+-------------+---------------------------+
Notes
This method is meaningful for computing call metrics per variant, but not especially meaningful for computing metrics per sample.
This method returns a struct expression with three fields:
AC (
tarray
oftint32
) - Allele counts. One element for each allele, including the reference.AF (
tarray
oftfloat64
) - Allele frequencies. One element for each allele, including the reference.AN (
tint32
) - Allele number. The total number of called alleles, or the number of non-missing calls * 2.homozygote_count (
tarray
oftint32
) - Homozygote genotype counts for each allele, including the reference. Only diploid genotype calls are counted.
- Parameters:
call (
CallExpression
)alleles (
ArrayExpression
of strings orInt32Expression
) – Variant alleles array, or number of alleles (including reference).
- Returns:
StructExpression
– Struct expression with fields AC, AF, AN, and homozygote_count.
- hail.expr.aggregators.info_score(gp)[source]
Compute the IMPUTE information score.
Examples
Calculate the info score per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_rows(info_score = hl.agg.info_score(gen_mt.GP))
Calculate group-specific info scores per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_cols(is_case = hl.rand_bool(0.5)) >>> gen_mt = gen_mt.annotate_rows(info_score = hl.agg.group_by(gen_mt.is_case, hl.agg.info_score(gen_mt.GP)))
Notes
The result of
info_score()
is a struct with two fields:score (
float64
) – Info score.n_included (
int32
) – Number of non-missing samples included in the calculation.
We implemented the IMPUTE info measure as described in the supplementary information from Marchini & Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics (2010). To calculate the info score \(I_{A}\) for one SNP:
\[I_{A} = \begin{cases} 1 - \frac{\sum_{i=1}^{N}(f_{i} - e_{i}^2)}{2N\hat{\theta}(1 - \hat{\theta})} & \text{when } \hat{\theta} \in (0, 1) \\ 1 & \text{when } \hat{\theta} = 0, \hat{\theta} = 1\\ \end{cases}\]\(N\) is the number of samples with imputed genotype probabilities [\(p_{ik} = P(G_{i} = k)\) where \(k \in \{0, 1, 2\}\)]
\(e_{i} = p_{i1} + 2p_{i2}\) is the expected genotype per sample
\(f_{i} = p_{i1} + 4p_{i2}\)
\(\hat{\theta} = \frac{\sum_{i=1}^{N}e_{i}}{2N}\) is the MLE for the population minor allele frequency
Hail will not generate identical results to QCTOOL for the following reasons:
Hail automatically removes genotype probability distributions that do not meet certain requirements on data import with
import_gen()
andimport_bgen()
.Hail does not use the population frequency to impute genotype probabilities when a genotype probability distribution has been set to missing.
Hail calculates the same statistic for sex chromosomes as autosomes while QCTOOL incorporates sex information.
The floating point number Hail stores for each genotype probability is slightly different than the original data due to rounding and normalization of probabilities.
Warning
The info score Hail reports will be extremely different from QCTOOL when a SNP has a high missing rate.
If the gp array must contain 3 elements, and its elements may not be missing.
If the genotype data was not imported using the
import_gen()
orimport_bgen()
functions, then the results for all variants will bescore = NA
andn_included = 0
.It only makes semantic sense to compute the info score per variant. While the aggregator will run in any context if its arguments are the right type, the results are only meaningful in a narrow context.
- Parameters:
gp (
ArrayNumericExpression
) – Genotype probability array. Must have 3 elements, all of which must be defined.- Returns:
StructExpression
– Struct with fields score and n_included.
- hail.expr.aggregators.hist(expr, start, end, bins)[source]
Compute binned counts of a numeric expression.
Examples
Compute a histogram of field GQ:
>>> dataset.aggregate_entries(hl.agg.hist(dataset.GQ, 0, 100, 10)) Struct(bin_edges=[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0], bin_freq=[2194L, 637L, 2450L, 1081L, 518L, 402L, 11168L, 1918L, 1379L, 11973L]), n_smaller=0, n_greater=0)
Notes
This method returns a struct expression with four fields:
bin_edges (
tarray
oftfloat64
): Bin edges. Bin i contains values in the left-inclusive, right-exclusive range[ bin_edges[i], bin_edges[i+1] )
.bin_freq (
tarray
oftint64
): Bin frequencies. The number of records found in each bin.n_smaller (
tint64
): The number of records smaller than the start of the first bin.n_larger (
tint64
): The number of records larger than the end of the last bin.
- Parameters:
- Returns:
StructExpression
– Struct expression with fields bin_edges, bin_freq, n_smaller, and n_larger.
- hail.expr.aggregators.linreg(y, x, nested_dim=1, weight=None)[source]
Compute multivariate linear regression statistics.
Examples
Regress HT against an intercept (1), SEX, and C1:
>>> table1.aggregate(hl.agg.linreg(table1.HT, [1, table1.SEX == 'F', table1.C1])) Struct(beta=[88.50000000000014, 81.50000000000057, -10.000000000000068], standard_error=[14.430869689661844, 59.70552738231206, 7.000000000000016], t_stat=[6.132686518775844, 1.365032746099571, -1.428571428571435], p_value=[0.10290201427537926, 0.40250974549499974, 0.3888002244284281], multiple_standard_error=4.949747468305833, multiple_r_squared=0.7175792507204611, adjusted_r_squared=0.1527377521613834, f_stat=1.2704081632653061, multiple_p_value=0.5314327326007864, n=4)
Regress blood pressure against an intercept (1), genotype, age, and the interaction of genotype and age:
>>> ds_ann = ds.annotate_rows(linreg = ... hl.agg.linreg(ds.pheno.blood_pressure, ... [1, ... ds.GT.n_alt_alleles(), ... ds.pheno.age, ... ds.GT.n_alt_alleles() * ds.pheno.age]))
Warning
As in the example, the intercept covariate
1
must be included explicitly if desired.Notes
In relation to lm.summary in R,
linreg(y, x = [1, mt.x1, mt.x2])
computessummary(lm(y ~ x1 + x2))
andlinreg(y, x = [mt.x1, mt.x2], nested_dim=0)
computessummary(lm(y ~ x1 + x2 - 1))
.More generally, nested_dim defines the number of effects to fit in the nested (null) model, with the effects on the remaining covariates fixed to zero.
- The returned struct has ten fields:
beta (
tarray
oftfloat64
): Estimated regression coefficient for each covariate.standard_error (
tarray
oftfloat64
): Estimated standard error for each covariate.t_stat (
tarray
oftfloat64
): t-statistic for each covariate.multiple_standard_error (
tfloat64
): Estimated standard deviation of the random error.multiple_r_squared (
tfloat64
): Coefficient of determination for nested models.adjusted_r_squared (
tfloat64
): Adjusted multiple_r_squared taking into account degrees of freedom.f_stat (
tfloat64
): F-statistic for nested models.multiple_p_value (
tfloat64
): p-value for the F-test of nested models.n (
tint64
): Number of samples included in the regression. A sample is included if and only if y, all elements of x, and weight (if set) are non-missing.
All but the last field are missing if n is less than or equal to the number of covariates or if the covariates are linearly dependent.
If set, the weight parameter generalizes the model to weighted least squares, useful for heteroscedastic (diagonal but non-constant) variance.
Warning
If any weight is negative, the resulting statistics will be
nan
.- Parameters:
y (
Float64Expression
) – Response (dependent variable).x (
Float64Expression
orlist
ofFloat64Expression
) – Covariates (independent variables).nested_dim (
int
) – The null model includes the first nested_dim covariates. Must be between 0 and k (the length of x).weight (
Float64Expression
, optional) – Non-negative weight for weighted least squares.
- Returns:
StructExpression
– Struct of regression results.
- hail.expr.aggregators.corr(x, y)[source]
Computes the Pearson correlation coefficient between x and y.
Examples
>>> ds.aggregate_cols(hl.agg.corr(ds.pheno.age, ds.pheno.blood_pressure)) 0.16592876044845484
Notes
Only records where both x and y are non-missing will be included in the calculation.
In the case that there are no non-missing pairs, the result will be missing.
See also
- Parameters:
x (
Expression
of typetfloat64
)y (
Expression
of typetfloat64
)
- Returns:
- hail.expr.aggregators.group_by(group, agg_expr)[source]
Compute aggregation statistics stratified by one or more groups.
Examples
Compute linear regression statistics stratified by SEX:
>>> table1.aggregate(hl.agg.group_by(table1.SEX, ... hl.agg.linreg(table1.HT, table1.C1, nested_dim=0))) { 'F': Struct(beta=[6.153846153846154], standard_error=[0.7692307692307685], t_stat=[8.000000000000009], p_value=[0.07916684832113098], multiple_standard_error=11.4354374979373, multiple_r_squared=0.9846153846153847, adjusted_r_squared=0.9692307692307693, f_stat=64.00000000000014, multiple_p_value=0.07916684832113098, n=2), 'M': Struct(beta=[34.25], standard_error=[1.75], t_stat=[19.571428571428573], p_value=[0.03249975499062629], multiple_standard_error=4.949747468305833, multiple_r_squared=0.9973961101073441, adjusted_r_squared=0.9947922202146882, f_stat=383.0408163265306, multiple_p_value=0.03249975499062629, n=2) }
Compute call statistics stratified by population group and case status:
>>> ann = ds.annotate_rows(call_stats=hl.agg.group_by(hl.struct(pop=ds.pop, is_case=ds.is_case), ... hl.agg.call_stats(ds.GT, ds.alleles)))
- Parameters:
group (
Expression
orlist
ofExpression
) – Group to stratify the result by.agg_expr (
Expression
) – Aggregation or scan expression to compute per grouping.
- Returns:
DictExpression
– Dictionary where the keys are group and the values are the result of computing agg_expr for each unique value of group.
- hail.expr.aggregators.array_agg(f, array)[source]
Aggregate an array element-wise using a user-specified aggregation function.
Examples
Start with a range table with an array of random boolean values:
>>> ht = hl.utils.range_table(100) >>> ht = ht.annotate(arr = hl.range(0, 5).map(lambda _: hl.rand_bool(0.5)))
Aggregate to compute the fraction
True
per element:>>> ht.aggregate(hl.agg.array_agg(lambda element: hl.agg.fraction(element), ht.arr)) [0.54, 0.55, 0.46, 0.52, 0.48]
Notes
This function requires that all values of array have the same length. If two values have different lengths, then an exception will be thrown.
The f argument should be a function taking one argument, an expression of the element type of array, and returning an expression including aggregation(s). The type of the aggregated expression returned by
array_agg()
is an array of elements of the return type of f.- Parameters:
f – Aggregation function to apply to each element of the exploded array.
array (
ArrayExpression
) – Array to aggregate.
- Returns:
- hail.expr.aggregators.downsample(x, y, label=None, n_divisions=500)[source]
Downsample (x, y) coordinate datapoints.
- Parameters:
x (
NumericExpression
) – X-values to be downsampled.y (
NumericExpression
) – Y-values to be downsampled.label (
StringExpression
orArrayExpression
) – Additional data for each (x, y) coordinate. Can pass in multiple fields in anArrayExpression
.n_divisions (
int
) – Factor by which to downsample (default value = 500). A lower input results in fewer output datapoints.
- Returns:
ArrayExpression
– Expression for downsampled coordinate points (x, y). The element type of the array isttuple
oftfloat64
,tfloat64
, andtarray
oftstr
- hail.expr.aggregators.approx_cdf(expr, k=100, *, _raw=False)[source]
Produce a summary of the distribution of values.
Notes
This method returns a struct containing two arrays: values and ranks. The values array contains an ordered sample of values seen. The ranks array is one longer, and contains the approximate ranks for the corresponding values.
These represent a summary of the CDF of the distribution of values. In particular, for any value x = values(i) in the summary, we estimate that there are ranks(i) values strictly less than x, and that there are ranks(i+1) values less than or equal to x. For any value y (not necessarily in the summary), we estimate CDF(y) to be ranks(i), where i is such that values(i-1) < y ≤ values(i).
An alternative intuition is that the summary encodes a compressed approximation to the sorted list of values. For example, values=[0,2,5,6,9] and ranks=[0,3,4,5,8,10] represents the approximation [0,0,0,2,5,6,6,6,9,9], with the value values(i) occupying indices ranks(i) (inclusive) to ranks(i+1) (exclusive).
The returned struct also contains an array _compaction_counts, which is used internally to support downstream error estimation.
Warning
This is an approximate and nondeterministic method.
- Parameters:
expr (
Expression
) – Expression to collect.k (
int
) – Parameter controlling the accuracy vs. memory usage tradeoff.
- Returns:
StructExpression
– Struct containing values and ranks arrays.