VariantDatasetCombiner
- class hail.vds.combiner.VariantDatasetCombiner[source]
A restartable and failure-tolerant method for combining one or more GVCFs and Variant Datasets.
Examples
A Variant Dataset comprises one or more sequences. A new Variant Dataset is constructed from GVCF files and/or extant Variant Datasets. For example, the following produces a new Variant Dataset from four GVCF files containing whole genome sequences
gvcfs = [ 'gs://bucket/sample_10123.g.vcf.bgz', 'gs://bucket/sample_10124.g.vcf.bgz', 'gs://bucket/sample_10125.g.vcf.bgz', 'gs://bucket/sample_10126.g.vcf.bgz', ] combiner = hl.vds.new_combiner( output_path='gs://bucket/dataset.vds', temp_path='gs://1-day-temp-bucket/', gvcf_paths=gvcfs, use_genome_default_intervals=True, ) combiner.run() vds = hl.read_vds('gs://bucket/dataset.vds')
The following combines four new samples from GVCFs with multiple extant Variant Datasets:
gvcfs = [ 'gs://bucket/sample_10123.g.vcf.bgz', 'gs://bucket/sample_10124.g.vcf.bgz', 'gs://bucket/sample_10125.g.vcf.bgz', 'gs://bucket/sample_10126.g.vcf.bgz', ] vdses = [ 'gs://bucket/hgdp.vds', 'gs://bucket/1kg.vds' ] combiner = hl.vds.new_combiner( output_path='gs://bucket/dataset.vds', temp_path='gs://1-day-temp-bucket/', save_path='gs://1-day-temp-bucket/combiner-plan.json', gvcf_paths=gvcfs, vds_paths=vdses, use_genome_default_intervals=True, ) combiner.run() vds = hl.read_vds('gs://bucket/dataset.vds')
The speed of the Variant Dataset Combiner critically depends on data partitioning. Although the partitioning is fully customizable, two high-quality partitioning strategies are available by default, one for exomes and one for genomes. These partitioning strategies can be enabled, respectively, with the parameters:
use_exome_default_intervals=True
anduse_genome_default_intervals=True
.The combiner serializes itself to save_path so that it can be restarted after failure.
- Parameters:
save_path (
str
) – The file path to store this VariantDatasetCombiner plan. A failed or interrupted execution can be restarted using this plan.output_path (
str
) – The location to store the new VariantDataset.temp_path (
str
) – The location to store temporary intermediates. We recommend using a bucket with an automatic deletion or lifecycle policy.reference_genome (
ReferenceGenome
) – The reference genome to which all inputs (GVCFs and Variant Datasets) are aligned.branch_factor (
int
) – The number of Variant Datasets to combine at once.target_records (
int
) – The target number of variants per partition.gvcf_batch_size (
int
) – The number of GVCFs to combine into a Variant Dataset at once.contig_recoding (
dict
mappingstr
tostr
orNone
) – This mapping is applied to GVCF contigs before importing them into Hail. This is used to handle GVCFs containing invalid contig names. For example, GRCh38 GVCFs which contain the contig “1” instead of the correct “chr1”.vdses (
list
ofVDSMetadata
) – A list of Variant Datasets to combine. Each dataset is identified by aVDSMetadata
, which is a pair of a path and the number of samples in said Variant Dataset.gvcfs (
list
ofstr
) – A list of paths of GVCF files to combine.gvcf_sample_names (
list
ofstr
orNone
) – List of names to use for the samples from the GVCF files. Must be the same length as gvcfs. Must be specified if gvcf_external_header is specified.gvcf_external_header (
str
orNone
) – A path to a file containing a VCF header which is applied to all GVCFs. Must be specified if gvcf_sample_names is specified.gvcf_import_intervals (
list
ofInterval
) – A list of intervals defining how to partition the GVCF files. The same partitioning is used for all GVCF files. Finer partitioning yields more parallelism but less work per task.gvcf_info_to_keep (
list
ofstr
orNone
) – GVCFINFO
fields to keep in thegvcf_info
entry field. By default, all fields exceptEND
andDP
are kept.gvcf_reference_entry_fields_to_keep (
list
ofstr
orNone
) – Genotype fields to keep in the reference table. If empty, the first 10,000 reference block rows ofmt
will be sampled and all fields found to be defined other thanGT
,AD
, andPL
will be entry fields in the resulting reference matrix in the dataset.
Attributes
A reasonable partition size in basepairs given the density of exomes.
A reasonable partition size in basepairs given the density of genomes.
Have all GVCFs and input Variant Datasets been combined?
The number of GVCFs to combine into a Variant Dataset at once.
Methods
Load a
VariantDatasetCombiner
from path.Combine the specified GVCFs and Variant Datasets.
Save a
VariantDatasetCombiner
to its save_path.Run one layer of combinations.
A serializable representation of this combiner.
- default_exome_interval_size = 60000000
A reasonable partition size in basepairs given the density of exomes.
- default_genome_interval_size = 1200000
A reasonable partition size in basepairs given the density of genomes.
- property finished
Have all GVCFs and input Variant Datasets been combined?
- property gvcf_batch_size
The number of GVCFs to combine into a Variant Dataset at once.
- static load(path)[source]
Load a
VariantDatasetCombiner
from path.
- save()[source]
Save a
VariantDatasetCombiner
to its save_path.