hail.vds.truncate_reference_blocks

hail.vds.truncate_reference_blocks(ds, *, max_ref_block_base_pairs=None, ref_block_winsorize_fraction=None)[source]

Cap reference blocks at a maximum length in order to permit faster interval filtering.

Examples

Truncate reference blocks to 5 kilobases:

>>> vds2 = hl.vds.truncate_reference_blocks(vds, max_ref_block_base_pairs=5000) 

Truncate the longest 1% of reference blocks to the length of the 99th percentile block:

>>> vds2 = hl.vds.truncate_reference_blocks(vds, ref_block_winsorize_fraction=0.01) 

Notes

After this function has been run, the reference blocks have a known maximum length ref_block_max_length, stored in the global fields, which permits vds.filter_intervals() to filter to intervals of the reference data by reading ref_block_max_length bases ahead of each interval. This allows narrow interval queries to run in roughly O(data kept) work rather than O(all reference data) work.

It is also possible to patch an existing VDS to store the max reference block length with vds.store_ref_block_max_length().

Parameters:
  • vds (VariantDataset or MatrixTable)

  • max_ref_block_base_pairs – Maximum size of reference blocks, in base pairs.

  • ref_block_winsorize_fraction – Fraction of reference block length distribution to truncate / winsorize.

Returns:

VariantDataset or MatrixTable