VCF Combiner

Hail has functionality for combining single-sample GVCFs into a multi-sample matrix table. This process is sometimes called “joint calling”, although the implementation in Hail does not use cohort-level information to reassign individual genotype calls.

The resulting matrix table is different from a matrix table imported from a project VCF; see below for a synopsis of how to use this object.

Running the Hail GVCF combiner

The run_combiner() function is the primary entry point to running the Hail GVCF combiner. A typical script for running the combiner on Google Cloud Dataproc using hailctl dataproc might look like the below:

import hail as hl
hl.init(log='/home/hail/combiner.log')

path_to_input_list = 'gs://path/to/input_files.txt'  # a file with one GVCF path per line

inputs = []
with hl.hadoop_open(path_to_input_list, 'r') as f:
    for line in f:
        inputs.append(line.strip())

output_file = 'gs://path/to/combined/output.mt'  # output destination
temp_bucket = 'gs://my-temp-bucket'  # bucket for storing intermediate files
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket, reference_genome='GRCh38')

A command-line tool is also provided as a convenient wrapper around this function. This tool can be run using the below syntax:

python3 -m hail.experimental.vcf_combiner SAMPLE_MAP OUT_FILE TMP_PATH [OPTIONAL ARGS...]

The below command is equivalent to the Python pipeline above:

python3 -m hail.experimental.vcf_combiner \
    gs://path/to/input_files.txt \
    gs://path/to/combined/output.mt \
    gs://my-temp-bucket \
    --reference-genome GRCh38 \
    --log /home/hail/combiner.log

Pipeline structure

The Hail GVCF combiner merges GVCFs hierarchically, parameterized by the branch_factor setting. The number of rounds of merges is defined as math.ceil(math.log(N_GVCFS, BRANCH_FACTOR)). With the default branch factor of 100, merging between 101 and 10,000 inputs uses 2 rounds, and merging between 10,001 and 1,000,000 inputs requires 3 rounds.

The combiner will print the execution plan before it runs:

2020-04-01 08:37:32 Hail: INFO: GVCF combiner plan:
    Branch factor: 4
    Batch size: 4
    Combining 50 input files in 3 phases with 6 total jobs.
        Phase 1: 4 jobs corresponding to 13 intermediate output files.
        Phase 2: 1 job corresponding to 4 intermediate output files.
        Phase 3: 1 job corresponding to 1 final output file.

Pain points

The combiner can take some time on large numbers of GVCFs. This time is split between single-machine planning and compilation work that happens only on the driver machine, and jobs that take advantage of the entire cluster. For this reason, its is recommended that clusters with autoscaling functionality are used, to reduce the overall cost of the pipeline.

For users running with Google Dataproc, the full documentation for creating autoscaling policies can be found here.

A typical YAML policy might look like:

basicAlgorithm:
  cooldownPeriod: 120s
  yarnConfig:
    gracefulDecommissionTimeout: 120s
    scaleDownFactor: 1.0
    scaleUpFactor: 1.0
secondaryWorkerConfig:
  maxInstances: MAX_PREEMPTIBLE_INSTANCES
  weight: 1
workerConfig:
  maxInstances: 2
  minInstances: 2
  weight: 1

For MAX_PREEMPTIBLE_INSTANCES, you should fill in a value based on the number of GVCFs you are merging. For sample sizes up to about 10,000, a value of 100 should be fine.

You can start a cluster with this autoscaling policy using hailctl:

hailctl dataproc start cluster_name ...args... --autoscaling-policy=policy_id_or_uri

Working with sparse matrix tables

Sparse matrix tables are a new method of representing VCF-style data in a space efficient way. The ‘sparse’ modifier refers to the fact that these datasets contain sample-level reference blocks, just like the input GVCFs – most of the entries in the matrix are missing, because that entry falls within a reference block defined at an earlier locus. While unfamiliar, this representation (1) is incrementally mergeable with other sparse matrix tables, and (2) scales with the N_GVCFs, not N_GVCFs^1.5 as project VCFs do. The schema of a sparse matrix table also differs from the schema of a dense project VCF imported to matrix table. They do not have the same GT, AD, and PL fields found in a project VCF, but instead have LGT, LAD, LPL that provide the same information, but require additional functions to work with in combination with a sample’s local alleles, LA.

Sample Level Reference Blocks

GVCFs represent blocks of homozygous reference calls of similar qualities using one record. For example:

#CHROM  POS    ID  REF  ALT  INFO       FORMAT    SAMPLE_1
chr1    14523  .   C    .    END=15000  GT:DP:GQ  0/0:19:40

This record indicates that sample SAMPLE_1 is homozygous reference until position 15,000 with approximate GQ of 40 across the ~500-base-pair. In short read sequencing, two adjacent loci in a sample’s genome will be covered by mostly the same reads so the quality information about these two loci is highly correlated; reference blocks explicitly represent regions of reference alleles with similar quality information.

A sparse matrix table has an entry field END that corresponds to the GVCF INFO field, END. It has the same meaning, but only for the single column where the END resides. In a sparse matrix table, there will be no defined entries for this sample between chr1:14524 and chr1:15000, inclusive.

Local Alleles

The LA field constitutes a record’s local alleles, or the alleles that appeared in the original GVCF for that sample. LA is used to interpret the values of LGT (local genotype), LAD (local allele depth), and LPL (local phred-scaled genotype likelihoods). This is best explained through example:

Variant Information
-------------------
locus: chr22:10678889
alleles: ["CAT", "C", "TAT"]

Sample1 (reference block, CAT/CAT)
-------------------------
DP: 8
GQ: 21
LA: [0]
LGT: 0/0
LAD: NA
LPL: NA
END: 10678898

equivalent GT: 0/0
equivalent AD: [8, 0, 0]
equivalent PL: [0, 21, 42*, 21, 42*, 42*]

Sample1 (called CAT/TAT)
-------------------------
DP: 9
GQ: 77
LA: [0, 2]
LGT: 0/1
LAD: [3, 6]
LPL: [137, 0, 77]
END: NA

equivalent GT: 0/2
equivalent AD: [3, 0, 6]
equivalent PL: [137, 137*, 137*, 0, 137*, 77]

The LA field for the first sample only includes the reference allele (0), since this locus was the beginning of a reference block in the original GVCF. In a reference block, LA will typically be an array with only one value, 0. The LA field for the second sample, which contains a variant allele, includes the reference and that allele (0 and 2, respectively). PL entries above marked with an asterisk refer to genotypes with alleles not observed in the original GVCF; the actual value produced by a tool like GATK will be large (a low-likelihood value), but not exactly the above. As with standard GT fields, it is possible to use CallExpression.unphased_diploid_gt_index() to compute the LGT’s corresponding index into the LPL array.

Functions

There are a number of functions for working with sparse data. Of particular importance is densify(), which transforms a sparse matrix table to a dense project-VCF-like matrix table on the fly. While computationally expensive, this operation is necessary for many downstream analyses, and should be thought of as roughly costing as much as reading a matrix table created by importing a dense project VCF.

run_combiner(sample_paths, out_file, tmp_path, *)

Run the Hail VCF combiner, performing a hierarchical merge to create a combined sparse matrix table.

densify(sparse_mt)

Convert sparse matrix table to a dense VCF-like representation by expanding reference blocks.

sparse_split_multi(sparse_mt, *[, …])

Splits multiallelic variants on a sparse matrix table.

lgt_to_gt(lgt, la)

Transform LGT into GT using local alleles array.

hail.experimental.run_combiner(sample_paths, out_file, tmp_path, *, intervals=None, import_interval_size=None, use_genome_default_intervals=False, use_exome_default_intervals=False, header=None, sample_names=None, branch_factor=100, batch_size=100, target_records=30000, overwrite=False, reference_genome='default', contig_recoding=None, key_by_locus_and_alleles=False)[source]

Run the Hail VCF combiner, performing a hierarchical merge to create a combined sparse matrix table.

Partitioning

The partitioning of input GVCFs, which determines the maximum parallelism per file, is determined the four parameters below. One of these parameters must be passed to this function.

  • intervals – User-supplied intervals.

  • import_interval_size – Use intervals of this uniform size across the genome.

  • use_genome_default_intervals – Use intervals of typical uniform size for whole genome GVCFs.

  • use_exome_default_intervals – Use intervals of typical uniform size for exome GVCFs.

It is recommended that new users include either use_genome_default_intervals or use_exome_default_intervals.

Note also that the partitioning of the final, combined matrix table does not depend the GVCF input partitioning.

Parameters
  • sample_paths (list of str) – Paths to individual GVCFs.

  • out_file (str) – Path to final combined matrix table.

  • tmp_path (str) – Path for intermediate output.

  • intervals (list of Interval or None) – Import GVCFs with specified partition intervals.

  • import_interval_size (int or None) – Import GVCFs with uniform partition intervals of specified size.

  • use_genome_default_intervals (bool) – Import GVCFs with uniform partition intervals of default size for whole-genome data.

  • use_exome_default_intervals (bool) – Import GVCFs with uniform partition intervals of default size for exome data.

  • header (str or None) – External header file to use as GVCF header for all inputs. If defined, sample_names must be defined as well.

  • sample_names (list of str or None) – Sample names, to be used with header.

  • branch_factor (int) – Combiner branch factor.

  • batch_size (int) – Combiner batch size.

  • target_records (int) – Target records per partition in each combiner phase after the first.

  • overwrite (bool) – Overwrite output file, if it exists.

  • reference_genome (str) – Reference genome for GVCF import.

  • contig_recoding (dict of (str, str), optional) – Mapping from contig name in gVCFs to contig name the reference genome. All contigs must be present in the reference_genome, so this is useful for mapping differently-formatted data onto known references.

  • key_by_locus_and_alleles (bool) – Key by both locus and alleles in the final output.

Returns

None

hail.experimental.densify(sparse_mt)[source]

Convert sparse matrix table to a dense VCF-like representation by expanding reference blocks.

Parameters

sparse_mt (MatrixTable) – Sparse MatrixTable to densify. The first row key field must be named locus and have type locus. Must have an END entry field of type int32.

Returns

  • MatrixTable – The densified MatrixTable. The END entry field is dropped.

  • While computationally expensive, this

  • operation is necessary for many downstream analyses, and should be thought of as

  • roughly costing as much as reading a matrix table created by importing a dense

  • project VCF.

hail.experimental.sparse_split_multi(sparse_mt, *, filter_changed_loci=False)[source]

Splits multiallelic variants on a sparse matrix table.

Analogous to split_multi_hts() (splits entry fields) for sparse representations.

Takes a dataset formatted like the output of run_combiner(). The splitting will add was_split and a_index fields, as split_multi() does. This function drops the LA (local alleles) field, as it re-computes entry fields based on the new, split globals alleles.

Variants are split thus:

  • A row with only one (reference) or two (reference and alternate) alleles is unchanged, as local and global alleles are the same.

  • A row with multiple alternate alleles will be split, with one row for each alternate allele, and each row will contain two alleles: ref and alt. The reference and alternate allele will be minrepped using min_rep().

The split multi logic handles the following entry fields:

struct {
    LGT: call
    LAD: array<int32>
    DP: int32
    GQ: int32
    LPL: array<int32>
    RGQ: int32
    LPGT: call
    LA: array<int32>
    END: int32
}

All fields except for LA are optional, and only handled if they exist.

  • LA is used to find the corresponding local allele index for the desired global a_index, and then dropped from the resulting dataset. If LA does not contain the global a_index, calls will be downcoded to hom ref and PL will be set to missing.

  • LGT and LPGT are downcoded using the corresponding local a_index. They are renamed to GT and PGT respectively, as the resulting call is no longer local.

  • LAD is used to create an AD field consisting of the allele depths corresponding to the reference and global a_index alleles.

  • DP is preserved unchanged.

  • GQ is recalculated from the updated PL, if it exists, but otherwise preserved unchanged.

  • PL array elements are calculated from the minimum LPL value for all allele pairs that downcode to the desired one. (This logic is identical to the PL logic in split_multi_hts().) If a row has an alternate allele but it is not present in LA, the PL field is set to missing. The PL for ref/<NON_REF> in that case can be drawn from RGQ.

  • RGQ (the reference genotype quality) is preserved unchanged.

  • END is untouched.

Notes

This version of split-multi doesn’t deal with either duplicate loci (in which case the explode could possibly result in out-of-order rows, although the actual split_multi function also doesn’t handle that case).

It also checks that min-repping will not change the locus and will error if it does.

Unlike the normal split_multi function. Sparse split multi will not filter * alleles. This is because a row with a bi-allelic spanning deletion may contain reference blocks that start at this position for other samples.

Parameters
  • sparse_mt (MatrixTable) – Sparse MatrixTable to split.

  • filter_changed_loci (bool) – Rather than erroring if any REF/ALT pair changes locus under min_rep() filter that variant instead.

Returns

MatrixTable – The split MatrixTable in sparse format.

hail.experimental.lgt_to_gt(lgt, la)[source]

Transform LGT into GT using local alleles array.

Parameters
Returns

CallExpression