.. _sec-vds:

Variant Dataset
===============

The :class:`.VariantDataset` is an extra layer of abstraction of the Hail Matrix Table for working
with large sequencing datasets. It was initially developed in response to the gnomAD project's need
to combine, represent, and analyze 150,000 whole genomes. It has since been used on datasets as
large as 955,000 whole exomes. The :class:`.VariantDatasetCombiner` produces a
:class:`.VariantDataset` by combining any number of GVCF and/or :class:`.VariantDataset` files.

.. warning::

    Hail 0.1 also had a Variant Dataset class. Although pieces of the interfaces are similar, they should not
    be considered interchangeable and do not represent the same data.


.. currentmodule:: hail.vds

.. rubric:: Variant Dataset

.. autosummary::
    :nosignatures:
    :toctree: ./
    :template: class2.rst

    VariantDataset

.. autosummary::
    :toctree: ./

    read_vds
    filter_samples
    filter_variants
    filter_intervals
    filter_chromosomes
    sample_qc
    split_multi
    interval_coverage
    impute_sex_chromosome_ploidy
    impute_sex_chr_ploidy_from_interval_coverage
    to_dense_mt
    to_merged_sparse_mt
    truncate_reference_blocks
    merge_reference_blocks
    lgt_to_gt
    local_to_global
    store_ref_block_max_length

.. currentmodule:: hail.vds.combiner

.. rubric:: Variant Dataset Combiner

.. autosummary::
    :nosignatures:
    :toctree: ./
    :template: class2.rst

    VDSMetadata
    VariantDatasetCombiner


.. autosummary::
    :toctree: ./

    new_combiner
    load_combiner

The data model of :class:`.VariantDataset`
------------------------------------------

A VariantDataset is the Hail implementation of a data structure called the
"scalable variant call representation", or SVCR.

The Scalable Variant Call Representation (SVCR)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Like the project VCF (multi-sample VCF) representation, the scalable variant
call representation is a variant-by-sample matrix of records. There are two
fundamental differences, however:

1.  The scalable variant call representation is **sparse**. It is not a dense
    matrix with every entry populated. Reference calls are defined as intervals
    (reference blocks) exactly as they appear in the original GVCFs. Compared to
    a VCF representation, this stores **less data but more information**, and
    makes it possible to keep reference information about every site in the
    genome, not just sites at which there is variation in the current cohort. A
    VariantDataset has a component table of reference information,
    ``vds.reference_data``, which contains the sparse matrix of reference blocks.
    This matrix is keyed by locus (not locus and alleles), and contains an
    ``END`` field which denotes the last position included in the current
    reference block.


2.  The scalable variant call representation uses **local alleles**. In a VCF,
    the fields GT, AD, PL, etc contain information that refers to alleles in the
    VCF by index. At highly multiallelic sites, the number of elements in the
    AD/PL lists explodes to huge numbers, **even though the information content
    does not change**. To avoid this superlinear scaling, the SVCR renames these
    fields to their "local" versions: LGT, LAD, LPL, etc, and adds a new field,
    LA (local alleles). The information in the local fields refers to the alleles
    defined per row of the matrix indirectly through the LA list.

    For instance, if a sample has the following information in its GVCF:

    .. code::

         Ref=G Alt=T GT=0/1 AD=5,6 PL=102,0,150

    If the alternate alleles A,C,T are discovered in the cohort, this sample's
    entry would look like:

    .. code::

         LA=0,2 LGT=0/1 LAD=5,6 LPL=102,0,150

    The "1" allele referred to in LGT, and the allele to which the reads in the
    second position of LAD belong to, is not the allele with absolute index 1
    (**C**), but rather the allele whose index is in position 1 of the LA list.
    The *index* at position 2 of the LA list is 2, and the allele with absolute
    index 2 is **T**. Local alleles make it possible to keep the data small to
    match its inherent information content.

Component tables
^^^^^^^^^^^^^^^^

The :class:`.VariantDataset` is made up of two component matrix tables -- the
``reference_data`` and the ``variant_data``.

The ``reference_data`` matrix table is a sparse matrix of reference blocks. The
``reference_data`` matrix table has row key ``locus``, but
does not have an ``alleles`` key or field. The column key is the sample ID. The
entries indicate regions of reference calls with similar sequencing metadata
(depth, quality, etc), starting from ``vds.reference_data.locus.position`` and
ending at ``vds.reference_data.END`` (inclusive!). There is no ``GT`` call field
because all calls in the reference data are implicitly homozygous reference (in
the future, a table of ploidy by interval may be included to allow for proper
representation of structural variation, but there is no standard representation
for this at current). A record from a component GVCF is included in the
``reference_data`` if it defines the END INFO field (if the GT is not reference,
an error will be thrown by the Hail VDS combiner).


The ``variant_data`` matrix table is a sparse matrix of non-reference calls.
This table contains the complete schema from the component GVCFs, aside from
fields which are known to be defined only for reference blocks (e.g. END or
MIN_DP). A record from a component GVCF is included in the ``variant_data`` if
it does not define the END INFO field. This means that some records of the
``variant_data`` can be no-call (``./.``) or reference, depending on the
semantics of the variant caller that produced the GVCFs.

Building analyses on the :class:`.VariantDataset`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Analyses operating on sequencing data can be largely grouped into three categories
by functionality used.

1.  **Analyses that use prebuilt methods**. Some analyses can be supported by using
    only the utility functions defined in the ``hl.vds`` module, like
    :func:`.vds.sample_qc`.

2.  **Analyses that use variant data and/or reference data separately.** Some
    pipelines need to interrogate properties of the component tables
    individually. Examples might include singleton analysis or burden tests
    (which needs only to look at the variant data) or coverage analysis (which
    looks only at reference data). These pipelines should explicitly extract and
    manipulate the component tables with ``vds.variant_data`` and
    ``vds.reference_data``.

3.  **Analyses that use the full variant-by-sample matrix with variant and reference data**.
    Many pipelines require variant and reference data together. There are helper
    functions provided for producing either the sparse (containing reference
    blocks) or dense (reference information is filled in at each variant site)
    representations. For more information, see the documentation for
    :func:`.vds.to_dense_mt` and :func:`.vds.to_merged_sparse_mt`.