VariantDatasetCombiner

class hail.vds.combiner.VariantDatasetCombiner[source]

A restartable and failure-tolerant method for combining one or more GVCFs and Variant Datasets.

Examples

A Variant Dataset comprises one or more sequences. A new Variant Dataset is constructed from GVCF files and/or extant Variant Datasets. For example, the following produces a new Variant Dataset from four GVCF files containing whole genome sequences

gvcfs = [
    'gs://bucket/sample_10123.g.vcf.bgz',
    'gs://bucket/sample_10124.g.vcf.bgz',
    'gs://bucket/sample_10125.g.vcf.bgz',
    'gs://bucket/sample_10126.g.vcf.bgz',
]

combiner = hl.vds.new_combiner(
    output_path='gs://bucket/dataset.vds',
    temp_path='gs://1-day-temp-bucket/',
    gvcf_paths=gvcfs,
    use_genome_default_intervals=True,
)

combiner.run()

vds = hl.read_vds('gs://bucket/dataset.vds')

The following combines four new samples from GVCFs with multiple extant Variant Datasets:

gvcfs = [
    'gs://bucket/sample_10123.g.vcf.bgz',
    'gs://bucket/sample_10124.g.vcf.bgz',
    'gs://bucket/sample_10125.g.vcf.bgz',
    'gs://bucket/sample_10126.g.vcf.bgz',
]

vdses = [
    'gs://bucket/hgdp.vds',
    'gs://bucket/1kg.vds'
]

combiner = hl.vds.new_combiner(
    output_path='gs://bucket/dataset.vds',
    temp_path='gs://1-day-temp-bucket/',
    save_path='gs://1-day-temp-bucket/',
    gvcf_paths=gvcfs,
    vds_paths=vdses,
    use_genome_default_intervals=True,
)

combiner.run()

vds = hl.read_vds('gs://bucket/dataset.vds')

The speed of the Variant Dataset Combiner critically depends on data partitioning. Although the partitioning is fully customizable, two high-quality partitioning strategies are available by default, one for exomes and one for genomes. These partitioning strategies can be enabled, respectively, with the parameters: use_exome_default_intervals=True and use_genome_default_intervals=True.

The combiner serializes itself to save_path so that it can be restarted after failure.

Parameters:
  • save_path (str) – The location to store this VariantDatasetCombiner plan. A failed execution can be restarted using this plan.

  • output_path (str) – The location to store the new VariantDataset.

  • temp_path (str) – The location to store temporary intermediates. We recommend using a bucket with an automatic deletion or lifecycle policy.

  • reference_genome (ReferenceGenome) – The reference genome to which all inputs (GVCFs and Variant Datasets) are aligned.

  • branch_factor (int) – The number of Variant Datasets to combine at once.

  • target_records (int) – The target number of variants per partition.

  • gvcf_batch_size (int) – The number of GVCFs to combine into a Variant Dataset at once.

  • contig_recoding (dict mapping str to str or None) – This mapping is applied to GVCF contigs before importing them into Hail. This is used to handle GVCFs containing invalid contig names. For example, GRCh38 GVCFs which contain the contig “1” instead of the correct “chr1”.

  • vdses (list of VDSMetadata) – A list of Variant Datasets to combine. Each dataset is identified by a VDSMetadata, which is a pair of a path and the number of samples in said Variant Dataset.

  • gvcfs (list of str) – A list of paths of GVCF files to combine.

  • gvcf_sample_names (list of str or None) – List of names to use for the samples from the GVCF files. Must be the same length as gvcfs. Must be specified if gvcf_external_header is specified.

  • gvcf_external_header (str or None) – A path to a file containing a VCF header which is applied to all GVCFs. Must be specified if gvcf_sample_names is specified.

  • gvcf_import_intervals (list of Interval) – A list of intervals defining how to partition the GVCF files. The same partitioning is used for all GVCF files. Finer partitioning yields more parallelism but less work per task.

  • gvcf_info_to_keep (list of str or None) – GVCF INFO fields to keep in the gvcf_info entry field. By default, all fields except END and DP are kept.

  • gvcf_reference_entry_fields_to_keep (list of str or None) – Genotype fields to keep in the reference table. If empty, the first 10,000 reference block rows of mt will be sampled and all fields found to be defined other than GT, AD, and PL will be entry fields in the resulting reference matrix in the dataset.

Attributes

default_exome_interval_size

A reasonable partition size in basepairs given the density of exomes.

default_genome_interval_size

A reasonable partition size in basepairs given the density of genomes.

finished

Have all GVCFs and input Variant Datasets been combined?

gvcf_batch_size

The number of GVCFs to combine into a Variant Dataset at once.

Methods

load

Load a VariantDatasetCombiner from path.

run

Combine the specified GVCFs and Variant Datasets.

save

Save a VariantDatasetCombiner to its save_path.

step

Run one layer of combinations.

to_dict

A serializable representation of this combiner.

__eq__(other)[source]

Return self==value.

default_exome_interval_size = 60000000

A reasonable partition size in basepairs given the density of exomes.

default_genome_interval_size = 1200000

A reasonable partition size in basepairs given the density of genomes.

property finished

Have all GVCFs and input Variant Datasets been combined?

property gvcf_batch_size

The number of GVCFs to combine into a Variant Dataset at once.

static load(path)[source]

Load a VariantDatasetCombiner from path.

run()[source]

Combine the specified GVCFs and Variant Datasets.

save()[source]

Save a VariantDatasetCombiner to its save_path.

step()[source]

Run one layer of combinations.

run() effectively runs step() until all GVCFs and Variant Datasets have been combined.

to_dict()[source]

A serializable representation of this combiner.