Scans

The scan module is exposed as hl.scan, e.g. hl.scan.sum.

The functions in this module perform rolling aggregations along the rows of a table, or along the rows or columns of a matrix table. The value of the scan at a given row (or column) is the result of applying the corresponding aggregator to all previous rows (or columns). Scans directly over entries are not currently supported.

For example, the count aggregator can be used as hl.scan.count to add an index along the rows of a table or the rows or columns of a matrix table; the two statements below produce identical tables:

>>> ht_with_idx = ht.add_index()
>>> ht_with_idx = ht.annotate(idx=hl.scan.count())

For example, to compute a cumulative sum for a row field in a table:

>>> ht_scan = ht.select(ht.Z, cum_sum=hl.scan.sum(ht.Z))
>>> ht_scan.show()
+-------+-------+---------+
|    ID |     Z | cum_sum |
+-------+-------+---------+
| int32 | int32 |   int64 |
+-------+-------+---------+
|     1 |     4 |       0 |
|     2 |     3 |       4 |
|     3 |     3 |       7 |
|     4 |     2 |      10 |
+-------+-------+---------+

Note that the cumulative sum is exclusive of the current row’s value. On a matrix table, to compute the cumulative number of non-reference genotype calls along the genome:

>>> ds_scan = ds.select_rows(ds.variant_qc.n_non_ref,
...                          cum_n_non_ref=hl.scan.sum(ds.variant_qc.n_non_ref))
>>> ds_scan.rows().show()
+---------------+------------+-----------+---------------+
| locus         | alleles    | n_non_ref | cum_n_non_ref |
+---------------+------------+-----------+---------------+
| locus<GRCh37> | array<str> |     int64 |         int64 |
+---------------+------------+-----------+---------------+
| 20:10579373   | ["C","T"]  |         1 |             0 |
| 20:10579398   | ["C","T"]  |         1 |             1 |
| 20:10627772   | ["C","T"]  |         2 |             2 |
| 20:10633237   | ["G","A"]  |        69 |             4 |
| 20:10636995   | ["C","T"]  |         2 |            73 |
| 20:10639222   | ["G","A"]  |        22 |            75 |
| 20:13763601   | ["A","G"]  |         2 |            97 |
| 20:16223922   | ["T","C"]  |        66 |            99 |
| 20:17479617   | ["G","A"]  |         9 |           165 |
+---------------+------------+-----------+---------------+

Scans over column fields can be done in a similar manner.

Danger

Computing the result of certain aggregators, such as hardy_weinberg_test(), can be very expensive when done for every row in a scan.”

See the Aggregators module for documentation on the behavior of specific aggregators.