GroupedTable

class hail.GroupedTable(parent: hail.table.Table, groups)[source]

Table grouped by row that can be aggregated into a new table.

There are only two operations on a grouped table, GroupedTable.partition_hint() and GroupedTable.aggregate().

Methods

__init__ Initialize self.
aggregate Aggregate by group, used after Table.group_by().
partition_hint Set the target number of partitions for aggregation.
aggregate(**named_exprs)[source]

Aggregate by group, used after Table.group_by().

Examples

Compute the mean value of X and the sum of Z per unique ID:

>>> table_result = (table1.group_by(table1.ID)
...                       .aggregate(meanX = agg.mean(table1.X), sumZ = agg.sum(table1.Z)))

Group by a height bin and compute sex ratio per bin:

>>> table_result = (table1.group_by(height_bin = table1.HT // 20)
...                       .aggregate(fraction_female = agg.fraction(table1.SEX == 'F')))
Parameters:named_exprs (varargs of Expression) – Aggregation expressions.
Returns:Table – Aggregated table.
partition_hint(n) → GroupedTable[source]

Set the target number of partitions for aggregation.

Examples

Use partition_hint in a Table.group_by() / GroupedTable.aggregate() pipeline:

>>> table_result = (table1.group_by(table1.ID)
...                       .partition_hint(5)
...                       .aggregate(meanX = agg.mean(table1.X), sumZ = agg.sum(table1.Z)))

Notes

Until Hail’s query optimizer is intelligent enough to sample records at all stages of a pipeline, it can be necessary in some places to provide some explicit hints.

The default number of partitions for GroupedTable.aggregate() is the number of partitions in the upstream table. If the aggregation greatly reduces the size of the table, providing a hint for the target number of partitions can accelerate downstream operations.

Parameters:n (int) – Number of partitions.
Returns:GroupedTable – Same grouped table with a partition hint.