Scans
The scan
module is exposed as hl.scan
, e.g. hl.scan.sum
.
The functions in this module perform rolling aggregations along the rows of a table, or along the rows or columns of a matrix table. The value of the scan at a given row (or column) is the result of applying the corresponding aggregator to all previous rows (or columns). Scans directly over entries are not currently supported.
For example, the count
aggregator can be used as hl.scan.count
to add an
index along the rows of a table or the rows or columns of a matrix table; the
two statements below produce identical tables:
>>> ht_with_idx = ht.add_index()
>>> ht_with_idx = ht.annotate(idx=hl.scan.count())
For example, to compute a cumulative sum for a row field in a table:
>>> ht_scan = ht.select(ht.Z, cum_sum=hl.scan.sum(ht.Z))
>>> ht_scan.show()
+-------+-------+---------+
| ID | Z | cum_sum |
+-------+-------+---------+
| int32 | int32 | int64 |
+-------+-------+---------+
| 1 | 4 | 0 |
| 2 | 3 | 4 |
| 3 | 3 | 7 |
| 4 | 2 | 10 |
+-------+-------+---------+
Note that the cumulative sum is exclusive of the current row’s value. On a matrix table, to compute the cumulative number of non-reference genotype calls along the genome:
>>> ds_scan = ds.select_rows(ds.variant_qc.n_non_ref,
... cum_n_non_ref=hl.scan.sum(ds.variant_qc.n_non_ref))
>>> ds_scan.rows().show()
+---------------+------------+-----------+---------------+
| locus | alleles | n_non_ref | cum_n_non_ref |
+---------------+------------+-----------+---------------+
| locus<GRCh37> | array<str> | int64 | int64 |
+---------------+------------+-----------+---------------+
| 20:10579373 | ["C","T"] | 1 | 0 |
| 20:10579398 | ["C","T"] | 1 | 1 |
| 20:10627772 | ["C","T"] | 2 | 2 |
| 20:10633237 | ["G","A"] | 69 | 4 |
| 20:10636995 | ["C","T"] | 2 | 73 |
| 20:10639222 | ["G","A"] | 22 | 75 |
| 20:13763601 | ["A","G"] | 2 | 97 |
| 20:16223922 | ["T","C"] | 66 | 99 |
| 20:17479617 | ["G","A"] | 9 | 165 |
+---------------+------------+-----------+---------------+
Scans over column fields can be done in a similar manner.
Danger
Computing the result of certain aggregators, such as
hardy_weinberg_test()
, can be very expensive when done
for every row in a scan.”
See the Aggregators module for documentation on the behavior of specific aggregators.