Aggregators¶
The aggregators
module is exposed as hl.agg
, e.g. hl.agg.sum
.
collect (expr) 
Collect records into an array. 
collect_as_set (expr) 
Collect records into a set. 
count ([expr]) 
Count the number of records. 
count_where (condition) 
Count the number of records where a predicate is True . 
counter (expr) 
Count the occurrences of each unique record and return a dictionary. 
any (condition) 
Returns True if condition is True for any record. 
all (condition) 
Returns True if condition is True for every record. 
take (expr, n[, ordering]) 
Take n records of expr, optionally ordered by ordering. 
min (expr) 
Compute the minimum expr. 
max (expr) 
Compute the maximum expr. 
sum (expr) 
Compute the sum of all records of expr. 
array_sum (expr) 
Compute the coordinatewise sum of all records of expr. 
mean (expr) 
Compute the mean value of records of expr. 
stats (expr) 
Compute a number of useful statistics about expr. 
product (expr) 
Compute the product of all records of expr. 
fraction (predicate) 
Compute the fraction of records where predicate is True . 
hardy_weinberg (expr) 
Compute HardyWeinberg Equilbrium (HWE) pvalue and heterozygosity ratio. 
explode (expr) 
Explode an array or set expression to aggregate the elements of all records. 
filter (condition, expr) 
Filter records according to a predicate. 
inbreeding (expr, prior) 
Compute inbreeding statistics on calls. 
call_stats (call, alleles) 
Compute useful call statistics. 
info_score (gp) 
Compute the IMPUTE information score. 
hist (expr, start, end, bins) 
Compute binned counts of a numeric expression. 

hail.expr.aggregators.
collect
(expr) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Collect records into an array.
Examples
Collect the ID field where HT is greater than 68:
>>> table1.aggregate(agg.collect(agg.filter(table1.HT > 68, table1.ID))) [2, 3]
Notes
The element order of the resulting array is not guaranteed, and in some cases is nondeterministic.
Use
collect_as_set()
to collect unique items.Warning
Collecting a large number of items can cause outofmemory exceptions.
Parameters: expr ( Expression
) – Expression to collect.Returns: ArrayExpression
– Array of all expr records.

hail.expr.aggregators.
collect_as_set
(expr) → hail.expr.expressions.typed_expressions.SetExpression[source]¶ Collect records into a set.
Examples
Collect the unique ID field where HT is greater than 68:
>>> table1.aggregate(agg.collect_as_set(agg.filter(table1.HT > 68, table1.ID))) set([2, 3]
Warning
Collecting a large number of items can cause outofmemory exceptions.
Parameters: expr ( Expression
) – Expression to collect.Returns: SetExpression
– Set of unique expr records.

hail.expr.aggregators.
count
(expr=None) → hail.expr.expressions.typed_expressions.Int64Expression[source]¶ Count the number of records.
Examples
Group by the SEX field and count the number of rows in each category:
>>> (table1.group_by(table1.SEX) ... .aggregate(n=agg.count()) ... .show()) +++  SEX  n  +++  str  int64  +++  M  2   F  2  +++
Notes
If expr is not provided, then this method will count the number of records aggregated. If expr is provided, then the result should make use of
filter()
orexplode()
so that the number of records aggregated changes.Parameters: expr ( Expression
, orNone
) – Expression to count.Returns: Expression
of typetint64
– Total number of records.

hail.expr.aggregators.
count_where
(condition) → hail.expr.expressions.typed_expressions.Int64Expression[source]¶ Count the number of records where a predicate is
True
.Examples
Count the number of individuals with HT greater than 68:
>>> table1.aggregate(agg.count_where(table1.HT > 68)) 2
Parameters: condition ( BooleanExpression
) – Criteria for inclusion.Returns: Expression
of typetint64
– Total number of records where condition isTrue
.

hail.expr.aggregators.
counter
(expr) → hail.expr.expressions.typed_expressions.DictExpression[source]¶ Count the occurrences of each unique record and return a dictionary.
Examples
Count the number of individuals for each unique SEX value:
>>> table1.aggregate(agg.counter(table1.SEX)) {'M': 2L, 'F': 2L}
Notes
This aggregator method returns a dict expression whose key type is the same type as expr and whose value type is
Expression
of typetint64
. This dict contains a key for each unique value of expr, and the value is the number of times that key was observed.Ensure that the result can be stored in memory on a single machine.
Warning
Using
counter()
with a large number of unique items can cause outofmemory exceptions.Parameters: expr ( Expression
) – Expression to count by key.Returns: DictExpression
– Dictionary with the number of occurrences of each unique record.

hail.expr.aggregators.
any
(condition) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if condition isTrue
for any record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(any_over_70 = agg.any(table1.HT > 70)) ... .show()) +++  SEX  any_over_70  +++  str  bool  +++  M  true   F  false  +++
Notes
If there are no records to aggregate, the result is
False
.Missing records are not considered. If every record is missing, the result is also
False
.Parameters: condition ( BooleanExpression
) – Condition to test.Returns: BooleanExpression

hail.expr.aggregators.
all
(condition) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if condition isTrue
for every record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(all_under_70 = agg.all(table1.HT < 70)) ... .show()) +++  SEX  all_under_70  +++  str  bool  +++  M  false   F  false  +++
Notes
If there are no records to aggregate, the result is
True
.Missing records are not considered. If every record is missing, the result is also
True
.Parameters: condition ( BooleanExpression
) – Condition to test.Returns: BooleanExpression

hail.expr.aggregators.
take
(expr, n, ordering=None) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Take n records of expr, optionally ordered by ordering.
Examples
Take 3 elements of field X:
>>> table1.aggregate(agg.take(table1.X, 3)) [5, 6, 7]
Take the ID and HT fields, ordered by HT (descending):
>>> table1.aggregate(agg.take(hl.struct(ID=table1.ID, HT=table1.HT), ... 3, ... ordering=table1.HT)) [Struct(ID=2, HT=72), Struct(ID=3, HT=70), Struct(ID=1, HT=65)]
Notes
The resulting array can include fewer than n elements if there are fewer than n total records.
The ordering argument may be an expression, a function, or
None
.If ordering is an expression, this expression’s type should be one with a natural ordering (e.g. numeric).
If ordering is a function, it will be evaluated on each record of expr to compute the value used for ordering. In the above example,
ordering=table1.HT
andordering=lambda x: x.HT
would be equivalent.If ordering is
None
, then there is no guaranteed ordering on the elements taken, and and the results may be nondeterministic.Missing values are always sorted last.
Parameters:  expr (
Expression
) – Expression to store.  n (
Expression
of typetint32
) – Number of records to take.  ordering (
Expression
or function ((arg) >Expression
) or None) – Optional ordering on records.
Returns: ArrayExpression
– Array of up to n records of expr. expr (

hail.expr.aggregators.
min
(expr) → hail.expr.expressions.typed_expressions.NumericExpression[source]¶ Compute the minimum expr.
Examples
Compute the minimum value of HT:
>>> table1.aggregate(agg.min(table1.HT)) min_ht=60
Notes
This method returns the minimum nonmissing value. If there are no nonmissing values, then the result is missing.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: NumericExpression
– Minimum value of all expr records, same type as expr.

hail.expr.aggregators.
max
(expr) → hail.expr.expressions.typed_expressions.NumericExpression[source]¶ Compute the maximum expr.
Examples
Compute the maximum value of HT:
>>> table1.aggregate(agg.max(table1.HT)) max_ht=72
Notes
This method returns the maximum nonmissing value. If there are no nonmissing values, then the result is missing.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: NumericExpression
– Maximum value of all expr records, same type as expr.

hail.expr.aggregators.
sum
(expr)[source]¶ Compute the sum of all records of expr.
Examples
Compute the sum of field C1:
>>> table1.aggregate(agg.sum(table1.C1)) 25
Notes
Missing values are ignored (treated as zero).
If expr is an expression of type
tint32
,tint64
, ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the sum.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetint64
ortfloat64
– Sum of records of expr.

hail.expr.aggregators.
array_sum
(expr) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Compute the coordinatewise sum of all records of expr.
Examples
Compute the sum of C1 and C2:
>>> table1.aggregate(agg.array_sum([table1.C1, table1.C2])) [25, 46]
Notes
All records must have the same length. Each coordinate is summed independently as described in
sum()
.Parameters: expr ( ArrayNumericExpression
)Returns: ArrayNumericExpression

hail.expr.aggregators.
mean
(expr) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Compute the mean value of records of expr.
Examples
Compute the mean of field HT:
>>> table1.aggregate(agg.mean(table1.HT)) 66.75
Notes
Missing values are ignored.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetfloat64
– Mean value of records of expr.

hail.expr.aggregators.
stats
(expr) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute a number of useful statistics about expr.
Examples
Compute statistics about field HT:
>>> table1.aggregate(agg.stats(table1.HT)) Struct(min=60.0, max=72.0, sum=267.0, stdev=4.65698400255, n=4, mean=66.75)
Notes
Computes a struct with the following fields:
 min (
tfloat64
)  Minimum value.  max (
tfloat64
)  Maximum value.  mean (
tfloat64
)  Mean value,  stdev (
tfloat64
)  Standard deviation.  n (
tfloat64
)  Number of nonmissing records.  sum (
tfloat64
)  Sum.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: StructExpression
– Struct expression with fields mean, stdev, min, max, n, and sum. min (

hail.expr.aggregators.
product
(expr)[source]¶ Compute the product of all records of expr.
Examples
Compute the product of field C1:
>>> table1.aggregate(agg.product(table1.C1)) 440
Notes
Missing values are ignored (treated as one).
If expr is an expression of type
tint32
,tint64
ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the product.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetint64
ortfloat64
– Product of records of expr.

hail.expr.aggregators.
fraction
(predicate) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Compute the fraction of records where predicate is
True
.Examples
Compute the fraction of rows where SEX is “F” and HT > 65:
>>> table1.aggregate(agg.fraction((table1.SEX == 'F') & (table1.HT > 65))) 0.25
Notes
Missing values for predicate are treated as
False
.Parameters: predicate ( BooleanExpression
) – Boolean predicate.Returns: Expression
of typetfloat64
– Fraction of records where predicate isTrue
.

hail.expr.aggregators.
hardy_weinberg
(expr) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute HardyWeinberg Equilbrium (HWE) pvalue and heterozygosity ratio.
Examples
Compute HWE statistics per row of a dataset:
>>> dataset_result = dataset.annotate_rows(hwe = agg.hardy_weinberg(dataset.GT))
Compute HWE statistics for a single population:
>>> dataset_result = dataset.annotate_rows( ... hwe_eas = agg.hardy_weinberg(agg.filter(dataset.pop == 'EAS', dataset.GT)))
Notes
This method returns a struct expression with the following fields:
 r_expected_het_freq (
tfloat64
)  Ratio of observed to expected heterozygote frequency.  p_hwe (
tfloat64
)  HardyWeinberg pvalue.
Hail computes the exact pvalue with midpvalue correction, i.e. the probability of a lesslikely outcome plus onehalf the probability of an equallylikely outcome. See this document for details on the LeveneHaldane distribution and references.
Warning
Nondiploid calls (
ploidy != 2
) are not included in statistics. It is assumed the row is biallelic. Usesplit_multi()
to split multiallelic variants before computing statistics.Parameters: expr ( CallExpression
) – Call for which to compute HardyWeinberg statistics.Returns: StructExpression
– Struct expression with fields r_expected_het_freq and p_hwe. r_expected_het_freq (

hail.expr.aggregators.
explode
(expr) → hail.expr.expressions.base_expression.Aggregable[source]¶ Explode an array or set expression to aggregate the elements of all records.
Examples
Compute the mean of all elements in fields C1, C2, and C3:
>>> table1.aggregate(agg.mean(agg.explode([table1.C1, table1.C2, table1.C3]))) 24.8333333333
Compute the set of all observed elements in the filters field (
Set[String]
):>>> dataset.aggregate_rows(agg.collect_as_set(agg.explode(dataset.filters))) set([u'VQSRTrancheSNP99.80to99.90', u'VQSRTrancheINDEL99.95to100.00', u'VQSRTrancheINDEL99.00to99.50', u'VQSRTrancheINDEL97.00to99.00', u'VQSRTrancheSNP99.95to100.00', u'VQSRTrancheSNP99.60to99.80', u'VQSRTrancheINDEL99.50to99.90', u'VQSRTrancheSNP99.90to99.95', u'VQSRTrancheINDEL96.00to97.00']))
Notes
This method can be used with aggregator functions to aggregate the elements of collection types (
tarray
andtset
).The result of the
explode()
andfilter()
methods is anAggregable
expression which can be used only in aggregator methods.Parameters: expr ( CollectionExpression
) – Expression of typetarray
ortset
.Returns: Aggregable
– Aggregable expression.

hail.expr.aggregators.
filter
(condition, expr) → hail.expr.expressions.base_expression.Aggregable[source]¶ Filter records according to a predicate.
Examples
Collect the ID field where HT >= 70:
>>> table1.aggregate(agg.collect(agg.filter(table1.HT >= 70, table1.ID))) [2, 3]
Notes
This method can be used with aggregator functions to remove records from aggregation.
The result of the
explode()
andfilter()
methods is anAggregable
expression which can be used only in aggregator methods.Parameters:  condition (
BooleanExpression
or function ( (arg) >BooleanExpression
)) – Filter expression, or a function to evaluate for each record.  expr (
Expression
) – Expression to filter.
Returns: Aggregable
– Aggregable expression. condition (

hail.expr.aggregators.
inbreeding
(expr, prior) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute inbreeding statistics on calls.
Examples
Compute inbreeding statistics per column:
>>> dataset_result = dataset.annotate_cols(IB = agg.inbreeding(dataset.GT, dataset.variant_qc.AF)) >>> dataset_result.cols().show() ++++++  s  IB.f_stat  IB.n_called  IB.expected_homs  IB.observed_homs  ++++++  str  float64  int64  float64  int64  ++++++  C1046::HG02024  1.23867e01  338  2.96180e+02  291   C1046::HG02025  2.02944e02  339  2.97151e+02  298   C1046::HG02026  5.47269e02  336  2.94742e+02  297   C1047::HG00731  1.89046e02  337  2.95779e+02  295   C1047::HG00732  1.38718e01  337  2.95202e+02  301   C1047::HG00733  3.50684e01  338  2.96418e+02  311   C1048::HG02024  1.95603e01  338  2.96180e+02  288   C1048::HG02025  2.02944e02  339  2.97151e+02  298   C1048::HG02026  6.74296e02  338  2.96180e+02  299   C1049::HG00731  1.00467e02  337  2.95418e+02  295  ++++++
Notes
E
is total number of expected homozygous calls, given by the sum of1  2.0 * prior * (1  prior)
across records.O
is the observed number of homozygous calls across records.N
is the number of nonmissing calls.F
is the inbreeding coefficient, and is computed by:(O  E) / (N  E)
.This method returns a struct expression with four fields:
Parameters:  expr (
CallExpression
) – Call expression.  prior (
Expression
of typetfloat64
) – Alternate allele frequency prior.
Returns: StructExpression
– Struct expression with fields f_stat, n_called, expected_homs, observed_homs. expr (

hail.expr.aggregators.
call_stats
(call, alleles) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute useful call statistics.
Examples
Compute call statistics per row:
>>> dataset_result = dataset.annotate_rows(gt_stats = agg.call_stats(dataset.GT, dataset.alleles)) >>> dataset_result.rows().key_by('locus').select('gt_stats').show() ++++++  locus  gt_stats.AC  gt_stats.AF  gt_stats.AN  gt_stats.homozygote_count  ++++++  locus<GRCh37>  array<int32>  array<float64>  int32  array<int32>  ++++++  20:10579373  [199,1]  [0.995,0.005]  200  [99,0]   20:13695607  [177,23]  [0.885,0.115]  200  [77,0]   20:13698129  [198,2]  [0.99,0.01]  200  [98,0]   20:14306896  [142,58]  [0.71,0.29]  200  [51,9]   20:14306953  [121,79]  [0.605,0.395]  200  [38,17]   20:15948325  [172,2]  [0.989,0.012]  174  [85,0]   20:15948326  [174,8]  [0.956,0.043]  182  [83,0]   20:17479423  [199,1]  [0.995,0.005]  200  [99,0]   20:17600357  [79,121]  [0.395,0.605]  200  [24,45]   20:17640833  [193,3]  [0.985,0.015]  196  [95,0]  ++++++
Notes
This method is meaningful for computing call metrics per variant, but not especially meaningful for computing metrics per sample.
This method returns a struct expression with three fields:
 AC (
tarray
oftint32
)  Allele counts. One element for each allele, including the reference.  AF (
tarray
oftfloat64
)  Allele frequencies. One element for each allele, including the reference.  AN (
tint32
)  Allele number. The total number of called alleles, or the number of nonmissing calls * 2.  homozygote_count (
tarray
oftint32
)  Homozygote genotype counts for each allele, including the reference. Only diploid genotype calls are counted.
Parameters:  call (
CallExpression
)  alleles (
ArrayStringExpression
) – Variant alleles.
Returns: StructExpression
– Struct expression with fields AC, AF, AN, and homozygote_count. AC (

hail.expr.aggregators.
info_score
(gp) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute the IMPUTE information score.
Examples
Calculate the info score per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_rows(info_score = hl.agg.info_score(gen_mt.GP))
Calculate groupspecific info scores per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_cols(is_case = hl.rand_bool(0.5)) >>> gen_mt = gen_mt.annotate_rows(info_score_case = hl.agg.info_score(hl.agg.filter(gen_mt.is_case, gen_mt.GP)), ... info_score_ctrl = hl.agg.info_score(hl.agg.filter(~gen_mt.is_case, gen_mt.GP)))
Notes
The result of
info_score()
is a struct with two fields: score (
float64
) – Info score.  n_included (
int32
) – Number of nonmissing samples included in the calculation.
We implemented the IMPUTE info measure as described in the supplementary information from Marchini & Howie. Genotype imputation for genomewide association studies. Nature Reviews Genetics (2010). To calculate the info score \(I_{A}\) for one SNP:
\[\begin{split}I_{A} = \begin{cases} 1  \frac{\sum_{i=1}^{N}(f_{i}  e_{i}^2)}{2N\hat{\theta}(1  \hat{\theta})} & \text{when } \hat{\theta} \in (0, 1) \\ 1 & \text{when } \hat{\theta} = 0, \hat{\theta} = 1\\ \end{cases}\end{split}\] \(N\) is the number of samples with imputed genotype probabilities [\(p_{ik} = P(G_{i} = k)\) where \(k \in \{0, 1, 2\}\)]
 \(e_{i} = p_{i1} + 2p_{i2}\) is the expected genotype per sample
 \(f_{i} = p_{i1} + 4p_{i2}\)
 \(\hat{\theta} = \frac{\sum_{i=1}^{N}e_{i}}{2N}\) is the MLE for the population minor allele frequency
Hail will not generate identical results to QCTOOL for the following reasons:
 Hail automatically removes genotype probability distributions that do not
meet certain requirements on data import with
import_gen()
andimport_bgen()
.  Hail does not use the population frequency to impute genotype probabilities when a genotype probability distribution has been set to missing.
 Hail calculates the same statistic for sex chromosomes as autosomes while QCTOOL incorporates sex information.
 The floating point number Hail stores for each genotype probability is slightly different than the original data due to rounding and normalization of probabilities.
Warning
 The info score Hail reports will be extremely different from QCTOOL when a SNP has a high missing rate.
 If the gp array must contain 3 elements, and its elements may not be missing.
 If the genotype data was not imported using the
import_gen()
orimport_bgen()
functions, then the results for all variants will bescore = NA
andn_included = 0
.  It only makes semantic sense to compute the info score per variant. While the aggregator will run in any context if its arguments are the right type, the results are only meaningful in a narrow context.
Parameters: gp ( ArrayNumericExpression
) – Genotype probability array. Must have 3 elements, all of which must be defined.Returns: StructExpression
– Struct with fields score and n_included. score (

hail.expr.aggregators.
hist
(expr, start, end, bins) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute binned counts of a numeric expression.
Examples
Compute a histogram of field GQ:
>>> dataset.aggregate_entries(agg.hist(dataset.GQ, 0, 100, 10)) Struct(bin_edges=[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0], bin_freq=[2194L, 637L, 2450L, 1081L, 518L, 402L, 11168L, 1918L, 1379L, 11973L]), nLess=0, nGreater=0)
Notes
This method returns a struct expression with four fields:
 bin_edges (
tarray
oftfloat64
): Bin edges. Bin i contains values in the leftinclusive, rightexclusive range[ bin_edges[i], bin_edges[i+1] )
.  bin_freq (
tarray
oftint64
): Bin frequencies. The number of records found in each bin.  n_smaller (
tint64
): The number of records smaller than the start of the first bin.  n_larger (
tint64
): The number of records larger than the end of the last bin.
Parameters:  expr (
NumericExpression
) – Target numeric expression.  start (
int
orfloat
) – Start of histogram range.  end (
int
orfloat
) – End of histogram range.  bins (
int
orfloat
) – Number of bins.
Returns: StructExpression
– Struct expression with fields bin_edges, bin_freq, n_smaller, and n_larger. bin_edges (