GroupedTable
- class hail.GroupedTable[source]
Table grouped by row that can be aggregated into a new table.
There are only two operations on a grouped table,
GroupedTable.partition_hint()
andGroupedTable.aggregate()
.Attributes
Methods
Aggregate by group, used after
Table.group_by()
.Set the target number of partitions for aggregation.
- aggregate(**named_exprs)[source]
Aggregate by group, used after
Table.group_by()
.Examples
Compute the mean value of X and the sum of Z per unique ID:
>>> table_result = (table1.group_by(table1.ID) ... .aggregate(meanX = hl.agg.mean(table1.X), sumZ = hl.agg.sum(table1.Z)))
Group by a height bin and compute sex ratio per bin:
>>> table_result = (table1.group_by(height_bin = table1.HT // 20) ... .aggregate(fraction_female = hl.agg.fraction(table1.SEX == 'F')))
Notes
The resulting table has a key field for each group and a value field for each aggregation. The names of the aggregation expressions must be distinct from the names of the groups.
- Parameters:
named_exprs (varargs of
Expression
) – Aggregation expressions.- Returns:
Table
– Aggregated table.
- partition_hint(n)[source]
Set the target number of partitions for aggregation.
Examples
Use partition_hint in a
Table.group_by()
/GroupedTable.aggregate()
pipeline:>>> table_result = (table1.group_by(table1.ID) ... .partition_hint(5) ... .aggregate(meanX = hl.agg.mean(table1.X), sumZ = hl.agg.sum(table1.Z)))
Notes
Until Hail’s query optimizer is intelligent enough to sample records at all stages of a pipeline, it can be necessary in some places to provide some explicit hints.
The default number of partitions for
GroupedTable.aggregate()
is the number of partitions in the upstream table. If the aggregation greatly reduces the size of the table, providing a hint for the target number of partitions can accelerate downstream operations.- Parameters:
n (int) – Number of partitions.
- Returns:
GroupedTable
– Same grouped table with a partition hint.