Annotation Database

This database contains a curated collection of variant annotations in Hail-friendly format, for use in Hail analysis pipelines.

Currently, the annotate_variants_db() VDS method associated with this database works only if you are running Hail on the Google Cloud Platform.

To incorporate these annotations in your own Hail analysis pipeline, select which annotations you would like to query from the documentation below and then copy-and-paste the Hail code generated into your own analysis script.

For example, a simple Hail script to load a VCF into a VDS, annotate the VDS with CADD raw and PHRED scores using this database, and inspect the schema could look something like this:

import hail
from pprint import pprint

hc = hail.HailContext()

vds = (
    hc
    .import_vcf('gs://annotationdb/test/sample.vcf')
    .split_multi()
    .annotate_variants_db([
        'va.cadd'
    ])
)

pprint(vds.variant_schema)

This code would return the following schema:

Struct{
    rsid: String,
    qual: Double,
    filters: Set[String],
    info: Struct{
        ...
    },
    cadd: Struct{
        RawScore: Double,
        PHRED: Double
    }
}

Database Query

Select annotations by clicking on the checkboxes in the documentation, and the appropriate Hail command will be generated in the panel below.

Use the “Copy to clipboard” button to copy the generated Hail code, and paste the command into your own Hail script.

Database Query
vds = (
hc
.read('my.vds')
.split_multi()
    .annotate_variants_db([
...
])
)

Documentation

These annotations have been collected from a variety of publications and their accompanying datasets (usually text files). Links to the relevant publications and raw data downloads are included where applicable.


Important Notes

Multiallelic variants

Annotations in the database are keyed by biallelic variants. For some annotations, this means Hail’s split_multi() method has been used to split multiallelic variants into biallelics.

Warning

It is recommended to run split_multi() on your VDS before using annotate_variants_db(). You can use annotate_variants_db() without first splitting multiallelic variants, but any multiallelics in your VDS will not be annotated. If you first split these variants, the resulting biallelic variants may then be annotated by the database.

VEP annotations

VEP annotations are included in this database under the root va.vep. To add VEP annotations, the annotate_variants_db() method runs Hail’s vep() method on your VDS. This means that your cluster must be properly initialized as described in the Running VEP section in this discussion post.

Warning

If you want to add VEP annotations to your VDS, make sure to add the initialization action gs://hail-common/vep/vep/vep85-init.sh when starting your cluster.

Gene-level annotations

Annotations beginning with va.gene. are gene-level annotations that can be used to annotate variants in your VDS. These gene-level annotations are stored in the database as keytables keyed by HGNC gene symbols.

By default, if an annotation beginning with va.gene. is given to annotate_variants_db() and no gene_key parameter is specified, the function will run VEP and parse the VEP output to define one gene symbol per variant in the VDS.

For each variant, the logic used to extract one gene symbol from the VEP output is as follows:

  • Collect all consequences found in canonical transcripts

  • Designate the most severe consequence in the collection, as defined by this hierarchy (from most severe to least severe):

    • Transcript ablation
    • Splice acceptor variant
    • Splice donor variant
    • Stop gained
    • Frameshift variant
    • Stop lost
    • Start lost
    • Transcript amplification
    • Inframe insertion
    • Missense variant
    • Protein altering variant
    • Incomplete terminal codon variant
    • Stop retained variant
    • Synonymous variant
    • Splice region variant
    • Coding sequence variant
    • Mature miRNA variant
    • 5’ UTR variant
    • 3’ UTR variant
    • Non-coding transcript exon variant
    • Intron variant
    • NMD transcript variant
    • Non-coding transcript variant
    • Upstream gene variant
    • Downstream gene variant
    • TFBS ablation
    • TFBS amplification
    • TF binding site variant
    • Regulatory region ablation
    • Regulatory region amplification
    • Feature elongation
    • Regulatory region variant
    • Feature truncation
    • Intergenic variant
  • If a canonical transcript with the most severe consequence exists, take that gene and transcript. Otherwise, take a non-canonical transcript with the most severe consequence.

Though this is the default logic, you may wish to define gene symbols differently. One way to do so while still using the VEP output would be to add VEP annotations to your VDS, create a gene symbol variant annotation by parsing through the VEP output however you wish, and then pass that annotation to annotate_variants_db() using the gene_key parameter.

Here’s an example that uses the gene symbol from the first VEP transcript:

import hail
from pprint import pprint

hc = hail.HailContext()

vds = (
    hc
    .import_vcf('gs://annotationdb/test/sample.vcf')
    .split_multi()
    .annotate_variants_db('va.vep')
    .annotate_variants_expr('va.my_gene = va.vep.transcript_consequences[0].gene_symbol')
    .annotate_variants_db('va.gene.constraint.pli', gene_key='va.my_gene')
)

pprint(vds.variant_schema)

This code would return:

Struct{
    rsid: String,
    qual: Double,
    filters: Set[String],
    info: Struct{
        ...
    },
    vep: Struct{
        ...
    },
    my_gene: String,
    gene: Struct{
        constraint: Struct{
            pli: Double
        }
    }
}

Suggest additions or edits

Please contact Andrea Ganna (aganna@broadinstitute.org) or Liam Abbott (labbott@broadinstitute.org) with any questions.