VariantDataset

class hail.VariantDataset(hc, jvds)[source]

Hail’s primary representation of genomic data, a matrix keyed by sample and variant.

Variant datasets may be generated from other formats using the HailContext import methods, constructed from a variant-keyed KeyTable using VariantDataset.from_table(), and simulated using balding_nichols_model().

Once a variant dataset has been written to disk with write(), use read() to load the variant dataset into the environment.

>>> vds = hc.read("data/example.vds")
Variables:hc (HailContext) – Hail Context.

Attributes

colkey_schema Returns the signature of the column key (sample) contained in this VDS.
genotype_schema Returns the signature of the genotypes contained in this VDS.
global_schema Returns the signature of the global annotations contained in this VDS.
globals Return global annotations as a Python object.
num_samples Number of samples.
rowkey_schema Returns the signature of the row key (variant) contained in this VDS.
sample_annotations Return a dict of sample annotations.
sample_ids Return sampleIDs.
sample_schema Returns the signature of the sample annotations contained in this VDS.
variant_schema Returns the signature of the variant annotations contained in this VDS.

Methods

__init__
aggregate_by_key Aggregate by user-defined key and aggregation expressions to produce a KeyTable.
annotate_alleles_expr Annotate alleles with expression.
annotate_genotypes_expr Annotate genotypes with expression.
annotate_global Add global annotations from Python objects.
annotate_global_expr Annotate global with expression.
annotate_samples_expr Annotate samples with expression.
annotate_samples_table Annotate samples with a key table.
annotate_variants_db Annotate variants using the Hail annotation database.
annotate_variants_expr Annotate variants with expression.
annotate_variants_table Annotate variants with a key table.
annotate_variants_vds Annotate variants with variant annotations from .vds file.
cache Mark this variant dataset to be cached in memory.
concordance Calculate call concordance with another variant dataset.
count Returns number of samples and variants in the dataset.
count_variants Count number of variants in variant dataset.
deduplicate Remove duplicate variants.
delete_va_attribute Removes an attribute from a variant annotation field.
drop_samples Removes all samples from variant dataset.
drop_variants Discard all variants, variant annotations and genotypes.
export_gen Export variant dataset as GEN and SAMPLE file.
export_genotypes Export genotype-level information to delimited text file.
export_plink Export variant dataset as PLINK2 BED, BIM and FAM.
export_samples Export sample information to delimited text file.
export_variants Export variant information to delimited text file.
export_vcf Export variant dataset as a .vcf or .vcf.bgz file.
file_version File version of variant dataset.
filter_alleles Filter a user-defined set of alternate alleles for each variant.
filter_genotypes Filter genotypes based on expression.
filter_intervals Filter variants with an interval or list of intervals.
filter_multi Filter out multi-allelic sites.
filter_samples_expr Filter samples with the expression language.
filter_samples_list Filter samples with a list of samples.
filter_samples_table Filter samples with a table keyed by sample ID.
filter_variants_expr Filter variants with the expression language.
filter_variants_list Filter variants with a list of variants.
filter_variants_table Filter variants with a Variant keyed key table.
from_table Construct a sites-only variant dataset from a key table.
genotypes_table Generate a fully expanded genotype table.
grm Compute the Genetic Relatedness Matrix (GRM).
hardcalls Drop all genotype fields except the GT field.
ibd Compute matrix of identity-by-descent estimations.
ibd_prune Prune samples from the VariantDataset based on ibd() PI_HAT measures of relatedness.
impute_sex Impute sex of samples by calculating inbreeding coefficient on the X chromosome.
join Join two variant datasets.
ld_matrix Computes the linkage disequilibrium (correlation) matrix for the variants in this VDS.
ld_prune Prune variants in linkage disequilibrium (LD).
linreg Test each variant for association using linear regression.
linreg3 Test each variant for association with multiple phenotypes using linear regression.
linreg_burden Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the linear regression model.
linreg_multi_pheno Test each variant for association with multiple phenotypes using linear regression.
lmmreg Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association.
logreg Test each variant for association using logistic regression.
logreg_burden Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the logistic regression model.
make_table Produce a key with one row per variant and one or more columns per sample.
mendel_errors Find Mendel errors; count per variant, individual and nuclear family.
min_rep Gives minimal, left-aligned representation of alleles.
naive_coalesce Naively descrease the number of partitions.
num_partitions Number of partitions.
pc_relate Compute relatedness estimates between individuals using a variant of the PC-Relate method.
pca Run Principal Component Analysis (PCA) on the matrix of genotypes.
persist Persist this variant dataset to memory and/or disk.
query_genotypes Performs aggregation queries over genotypes, and returns Python object(s).
query_genotypes_typed Performs aggregation queries over genotypes, and returns Python object(s) and type(s).
query_samples Performs aggregation queries over samples and sample annotations, and returns Python object(s).
query_samples_typed Performs aggregation queries over samples and sample annotations, and returns Python object(s) and type(s).
query_variants Performs aggregation queries over variants and variant annotations, and returns Python object(s).
query_variants_typed Performs aggregation queries over variants and variant annotations, and returns Python object(s) and type(s).
rename_samples Rename samples.
repartition Increase or decrease the number of variant dataset partitions.
rrm Computes the Realized Relationship Matrix (RRM).
same True if the two variant datasets have the same variants, samples, genotypes, and annotation schemata and values.
sample_qc Compute per-sample QC metrics.
sample_variants Downsample variants to a given fraction of the dataset.
samples_table Convert samples and sample annotations to KeyTable.
set_va_attributes Sets attributes for a variant annotation.
split_multi Split multiallelic variants.
storage_level Returns the storage (persistence) level of the variant dataset.
summarize Returns a summary of useful information about the dataset.
tdt Find transmitted and untransmitted variants; count per variant and nuclear family.
union Take the union of datasets vertically (include all variants).
unpersist Unpersists this VDS from memory/disk.
variant_qc Compute common variant statistics (quality control metrics).
variants_table Convert variants and variant annotations to a KeyTable.
vep Annotate variants with VEP.
was_split True if multiallelic variants have been split into multiple biallelic variants.
write Write variant dataset as VDS file.
aggregate_by_key(key_exprs, agg_exprs)[source]

Aggregate by user-defined key and aggregation expressions to produce a KeyTable. Equivalent to a group-by operation in SQL.

Examples

Compute the number of LOF heterozygote calls per gene per sample:

>>> kt_result = (vds
...     .aggregate_by_key(['Sample = s', 'Gene = va.gene'],
...                        'nHet = g.filter(g => g.isHet() && va.consequence == "LOF").count()')
...     .export("test.tsv"))

This will produce a KeyTable with 3 columns (Sample, Gene, nHet).

Parameters:
  • key_exprs (str or list of str) – Named expression(s) for which fields are keys.
  • agg_exprs (str or list of str) – Named aggregation expression(s).
Return type:

KeyTable

annotate_alleles_expr(expr, propagate_gq=False)[source]

Annotate alleles with expression.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

To create a variant annotation va.nNonRefSamples: Array[Int] where the ith entry of the array is the number of samples carrying the ith alternate allele:

>>> vds_result = vds.annotate_alleles_expr('va.nNonRefSamples = gs.filter(g => g.isCalledNonRef()).count()')

Notes

This method is similar to annotate_variants_expr(). annotate_alleles_expr() dynamically splits multi-allelic sites, evaluates each expression on each split allele separately, and for each expression annotates with an array with one element per alternate allele. In the splitting, genotypes are downcoded and each alternate allele is represented using its minimal representation (see split_multi() for more details).

Parameters:
  • expr (str or list of str) – Annotation expression.
  • propagate_gq (bool) – Propagate GQ instead of computing from (split) PL.
Returns:

Annotated variant dataset.

Return type:

VariantDataset

annotate_genotypes_expr(expr)[source]

Annotate genotypes with expression.

Examples

Convert the genotype schema to a TStruct with two fields GT and CASE_HET:

>>> vds_result = vds.annotate_genotypes_expr('g = {GT: g.gt, CASE_HET: sa.pheno.isCase && g.isHet()}')

Assume a VCF is imported with generic=True and the resulting genotype schema is a Struct and the field GTA is a Call type. Use the .toGenotype() method in the expression language to convert a Call to a Genotype. vds_gta will have a genotype schema equal to TGenotype

>>> vds_gta = (hc.import_vcf('data/example3.vcf.bgz', generic=True, call_fields=['GTA'])
...                 .annotate_genotypes_expr('g = g.GTA.toGenotype()'))

Notes

annotate_genotypes_expr() evaluates the expression given by expr and assigns the result of the right hand side to the annotation path specified by the left-hand side (must begin with g). This is analogous to annotate_variants_expr() and annotate_samples_expr() where the annotation paths are va and sa respectively.

expr is in genotype context so the following symbols are in scope:

  • g: genotype annotation
  • v (Variant): Variant
  • va: variant annotations
  • s (Sample): sample
  • sa: sample annotations
  • global: global annotations

For more information, see the documentation on writing expressions and using the Hail Expression Language.

Warning

  • If the resulting genotype schema is not TGenotype, subsequent function calls on the annotated variant dataset may not work such as pca() and linreg().
  • Hail performance may be significantly slower if the annotated variant dataset does not have a genotype schema equal to TGenotype.
  • Genotypes are immutable. For example, if g is initially of type Genotype, the expression g.gt = g.gt + 1 will return a Struct with one field gt of type Int and NOT a Genotype with the gt incremented by 1.
Parameters:expr (str or list of str) – Annotation expression.
Returns:Annotated variant dataset.
Return type:VariantDataset
annotate_global(path, annotation, annotation_type)[source]

Add global annotations from Python objects.

Examples

Add populations as a global annotation:

>>> vds_result = vds.annotate_global('global.populations',
...                                     ['EAS', 'AFR', 'EUR', 'SAS', 'AMR'],
...                                     TArray(TString()))

Notes

This method registers new global annotations in a VDS. These annotations can then be accessed through expressions in downstream operations. The Hail data type must be provided and must match the given annotation parameter.

Parameters:
  • path (str) – annotation path starting in ‘global’
  • annotation – annotation to add to global
  • annotation_type (Type) – Hail type of annotation
Returns:

Annotated variant dataset.

Return type:

VariantDataset

annotate_global_expr(expr)[source]

Annotate global with expression.

Example

Annotate global with an array of populations:

>>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS", "NFE"]')

Create, then overwrite, then drop a global annotation:

>>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS"]')
>>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS", "NFE"]')
>>> vds = vds.annotate_global_expr('global.pops = drop(global, pops)')

The expression namespace contains only one variable:

  • global: global annotations
Parameters:expr (str or list of str) – Annotation expression
Returns:Annotated variant dataset.
Return type:VariantDataset
annotate_samples_expr(expr)[source]

Annotate samples with expression.

Examples

Compute per-sample GQ statistics for hets:

>>> vds_result = (vds.annotate_samples_expr('sa.gqHetStats = gs.filter(g => g.isHet()).map(g => g.gq).stats()')
...     .export_samples('output/samples.txt', 'sample = s, het_gq_mean = sa.gqHetStats.mean'))

Compute the list of genes with a singleton LOF per sample:

>>> variant_annotations_table = hc.import_table('data/consequence.tsv', impute=True).key_by('Variant')
>>> vds_result = (vds.annotate_variants_table(variant_annotations_table, root='va.consequence')
...     .annotate_variants_expr('va.isSingleton = gs.map(g => g.nNonRefAlleles()).sum() == 1')
...     .annotate_samples_expr('sa.LOF_genes = gs.filter(g => va.isSingleton && g.isHet() && va.consequence == "LOF").map(g => va.gene).collect()'))

To create an annotation for only a subset of samples based on an existing annotation:

>>> vds_result = vds.annotate_samples_expr('sa.newpheno = if (sa.pheno.cohortName == "cohort1") sa.pheno.bloodPressure else NA: Double')

Note

For optimal performance, be sure to explicitly give the alternative (NA) the same type as the consequent (sa.pheno.bloodPressure).

Notes

expr is in sample context so the following symbols are in scope:

  • s (Sample): sample
  • sa: sample annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for sample s
Parameters:expr (str or list of str) – Annotation expression.
Returns:Annotated variant dataset.
Return type:VariantDataset
annotate_samples_table(table, root=None, expr=None, vds_key=None, product=False)[source]

Annotate samples with a key table.

Examples

To annotates samples using samples1.tsv with type imputation:

>>> table = hc.import_table('data/samples1.tsv', impute=True).key_by('Sample')
>>> vds_result = vds.annotate_samples_table(table, root='sa.pheno')

Given this file

$ cat data/samples1.tsv
Sample      Height  Status  Age
PT-1234     154.1   ADHD    24
PT-1236     160.9   Control 19
PT-1238     NA      ADHD    89
PT-1239     170.3   Control 55

the three new sample annotations are sa.pheno.Height: Double, sa.pheno.Status: String, and sa.pheno.Age: Int.

To annotate without type imputation, resulting in all String types:

>>> annotations = hc.import_table('data/samples1.tsv').key_by('Sample')
>>> vds_result = vds.annotate_samples_table(annotations, root='sa.phenotypes')

Detailed examples

Let’s import annotations from a CSV file with missing data and special characters

$ cat data/samples2.tsv
Batch,PT-ID
1kg,PT-0001
1kg,PT-0002
study1,PT-0003
study3,PT-0003
.,PT-0004
1kg,PT-0005
.,PT-0006
1kg,PT-0007

In this case, we should:

  • Pass the non-default delimiter ,
  • Pass the non-default missing value .
>>> annotations = hc.import_table('data/samples2.tsv', delimiter=',', missing='.').key_by('PT-ID')
>>> vds_result = vds.annotate_samples_table(annotations, root='sa.batch')

Let’s import annotations from a file with no header and sample IDs that need to be transformed. Suppose the vds sample IDs are of the form NA#####. This file has no header line, and the sample ID is hidden in a field with other information.

$ cat data/samples3.tsv
1kg_NA12345   female
1kg_NA12346   male
1kg_NA12348   female
pgc_NA23415   male
pgc_NA23418   male

To import it:

>>> annotations = (hc.import_table('data/samples3.tsv', no_header=True)
...                  .annotate('sample = f0.split("_")[1]')
...                  .key_by('sample'))
>>> vds_result = vds.annotate_samples_table(annotations,
...                             expr='sa.sex = table.f1, sa.batch = table.f0.split("_")[0]')

Notes

This method takes as an argument a KeyTable object. Hail has a default join strategy for tables keyed by String, which is to join by sample ID. If the table is keyed by something else, like population or cohort, then the vds_key argument must be passed to describe the key in the dataset to use for the join. This argument expects a list of Hail expressions whose types match, in order, the table’s key types.

Each expression in the list vds_key has the following symbols in scope:

  • s (String): sample ID
  • sa: sample annotations

The root and expr arguments

Note

One of root or expr is required, but not both.

The expr parameter expects an annotation expression involving sa (the existing sample annotations in the dataset) and table (a struct containing the columns in the table), like sa.col1 = table.col1, sa.col2 = table.col2 or sa = merge(sa, table). The root parameter expects an annotation path beginning in sa, like sa.annotations. Passing root='sa.annotations' is exactly the same as passing expr='sa.annotations = table'.

expr has the following symbols in scope:

  • sa: sample annotations
  • table: See note.

Note

The value of table inside root/expr depends on the number of values in the key table, as well as the product argument. There are three behaviors based on the number of values and one branch for product being true and false, for a total of six modes:

Number of value columns product Type of table Value of table
More than 2 False Struct Struct with an element for each column.
1 False T The value column.
0 False Boolean Existence of any matching key.
More than 2 True Array[Struct] An array with a struct for each matching key.
1 True Array[T] An array with a value for each matching key.
0 True Int The number of matching keys.

Common uses for the expr argument

Put annotations on the top level under sa

expr='sa = merge(sa, table)'

Annotate only specific annotations from the table

expr='sa.annotations = select(table, toKeep1, toKeep2, toKeep3)'

The above is equivalent to

expr='''sa.annotations.toKeep1 = table.toKeep1,
    sa.annotations.toKeep2 = table.toKeep2,
    sa.annotations.toKeep3 = table.toKeep3'''

Finally, for more information about importing key tables from text, see the documentation for HailContext.import_table().

Parameters:
  • table (KeyTable) – Key table.
  • root (str or None) – Sample annotation path to store text table. (This or expr required).
  • expr (str or None) – Annotation expression. (This or root required).
  • vds_key (str, list of str, or None.) – Join key for the dataset, if not sample ID.
  • product (bool) – Join with all matching keys (see note).
Returns:

Annotated variant dataset.

Return type:

VariantDataset

annotate_variants_db(annotations, gene_key=None)[source]

Annotate variants using the Hail annotation database.

Warning

Experimental. Supported only while running Hail on the Google Cloud Platform.

Documentation describing the annotations that are accessible through this method can be found here.

Examples

Annotate variants with CADD raw and PHRED scores:

>>> vds = vds.annotate_variants_db(['va.cadd.RawScore', 'va.cadd.PHRED']) 

Annotate variants with gene-level PLI score, using the VEP-generated gene symbol to map variants to genes:

>>> pli_vds = vds.annotate_variants_db('va.gene.constraint.pli') 

Again annotate variants with gene-level PLI score, this time using the existing va.gene_symbol annotation to map variants to genes:

>>> vds = vds.annotate_variants_db('va.gene.constraint.pli', gene_key='va.gene_symbol') 

Notes

Annotations in the database are bi-allelic, so splitting multi-allelic variants in the VDS before using this method is recommended to capture all appropriate annotations from the database. To do this, run split_multi() prior to annotating variants with this method:

>>> vds = vds.split_multi().annotate_variants_db(['va.cadd.RawScore', 'va.cadd.PHRED']) 

To add VEP annotations, or to add gene-level annotations without a predefined gene symbol for each variant, the annotate_variants_db() method runs Hail’s vep() method on the VDS. This means that your cluster must be properly initialized to run VEP.

Warning

If you want to add VEP annotations to your VDS, make sure to add the initialization action gs://hail-common/vep/vep/vep85-init.sh when starting your cluster.

Parameters:
  • annotations (str or list of str) – List of annotations to import from the database.
  • gene_key (str) – Existing variant annotation used to map variants to gene symbols if importing gene-level annotations. If not provided, the method will add VEP annotations and parse them as described in the database documentation to obtain one gene symbol per variant.
Returns:

Annotated variant dataset.

Return type:

VariantDataset

annotate_variants_expr(expr)[source]

Annotate variants with expression.

Examples

Compute GQ statistics about heterozygotes per variant:

>>> vds_result = vds.annotate_variants_expr('va.gqHetStats = gs.filter(g => g.isHet()).map(g => g.gq).stats()')

Collect a list of sample IDs with non-ref calls in LOF variants:

>>> vds_result = vds.annotate_variants_expr('va.nonRefSamples = gs.filter(g => g.isCalledNonRef()).map(g => s).collect()')

Substitute a custom string for the rsID field:

>>> vds_result = vds.annotate_variants_expr('va.rsid = str(v)')

Notes

expr is in variant context so the following symbols are in scope:

  • v (Variant): Variant
  • va: variant annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for variant v

For more information, see the documentation on writing expressions and using the Hail Expression Language.

Parameters:expr (str or list of str) – Annotation expression or list of annotation expressions.
Returns:Annotated variant dataset.
Return type:VariantDataset
annotate_variants_table(table, root=None, expr=None, vds_key=None, product=False)[source]

Annotate variants with a key table.

Examples

Add annotations from a variant-keyed tab separated file:

>>> table = hc.import_table('data/variant-lof.tsv', impute=True).key_by('v')
>>> vds_result = vds.annotate_variants_table(table, root='va.lof')

Add annotations from a locus-keyed TSV:

>>> kt = hc.import_table('data/locus-table.tsv', impute=True).key_by('Locus')
>>> vds_result = vds.annotate_variants_table(table, root='va.scores')

Add annotations from a gene-and-type-keyed TSV:

>>> table = hc.import_table('data/locus-metadata.tsv', impute=True).key_by(['gene', 'type'])
>>> vds_result = (vds.annotate_variants_table(table,
...       root='va.foo',
...       vds_key=['va.gene', 'if (va.score > 10) "Type1" else "Type2"']))

Annotate variants with the target in a GATK interval list file:

>>> intervals = KeyTable.import_interval_list('data/exons2.interval_list')
>>> vds_result = vds.annotate_variants_table(intervals, root='va.exon')

Annotate variants with all targets from matching intervals in a GATK interval list file:

>>> intervals = KeyTable.import_interval_list('data/exons2.interval_list')
>>> vds_result = vds.annotate_variants_table(intervals, root='va.exons', product=True)

Annotate variants using a UCSC BED file, marking each variant true/false for an overlap with any interval:

>>> intervals = KeyTable.import_bed('data/file2.bed')
>>> vds_result = vds.annotate_variants_table(intervals, root='va.bed')

Notes

This method takes as an argument a KeyTable object. Hail has default join strategies for tables keyed by Variant, Locus, or Interval.

Join strategies:

If the key is a Variant, then a variant in the dataset will match a variant in the table that is equivalent. Be careful, however: 1:1:A:T does not match 1:1:A:T,C, and vice versa.

If the key is a Locus, then a variant in the dataset will match any locus in the table which is equivalent to v.locus (same chromosome and position).

If the key is an Interval, then a variant in the dataset will match any interval in the table that contains the variant’s locus (chromosome and position).

If the key is not one of the above three types (a String representing gene ID, for instance), or if another join strategy should be used for a key of one of these three types (join with a locus object in variant annotations, for instance) for these types, then the vds_key argument should be passed. This argument expects a list of expressions whose types match, in order, the table’s key types. Note that using vds_key is slower than annotation with a standard key type.

Each expression in the list vds_key has the following symbols in scope:

  • v (Variant): Variant
  • va: variant annotations

The root and expr arguments

Note

One of root or expr is required, but not both.

The expr parameter expects an annotation assignment involving va (the existing variant annotations in the dataset) and table (the values(s) in the table), like va.col1 = table.col1, va.col2 = table.col2 or va = merge(va, table). The root parameter expects an annotation path beginning in va, like va.annotations. Passing root='va.annotations' is the same as passing expr='va.annotations = table'.

expr has the following symbols in scope:

  • va: variant annotations
  • table: See note.

Note

The value of table inside root/expr depends on the number of values in the key table, as well as the product argument. There are three behaviors based on the number of values and one branch for product being true and false, for a total of six modes:

Number of value columns product Type of table Value of table
More than 2 False Struct Struct with an element for each column.
1 False T The value column.
0 False Boolean Existence of any matching key.
More than 2 True Array[Struct] An array with a struct for each matching key.
1 True Array[T] An array with a value for each matching key.
0 True Int The number of matching keys.

Common uses for the expr argument

Put annotations on the top level under va:

expr='va = merge(va, table)'

Annotate only specific annotations from the table:

expr='va.annotations = select(table, toKeep1, toKeep2, toKeep3)'

The above is roughly equivalent to:

expr='''va.annotations.toKeep1 = table.toKeep1,
    va.annotations.toKeep2 = table.toKeep2,
    va.annotations.toKeep3 = table.toKeep3'''

Finally, for more information about importing key tables from text, see the documentation for HailContext.import_table().

Parameters:
  • table (KeyTable) – Key table.
  • root (str or None) – Variant annotation path to store text table. (This or expr required).
  • expr (str or None) – Annotation expression. (This or root required).
  • vds_key (str, list of str, or None.) – Join key for the dataset. Much slower than default joins.
  • product (bool) – Join with all matching keys (see note).
Returns:

Annotated variant dataset.

Return type:

VariantDataset

annotate_variants_vds(other, expr=None, root=None)[source]

Annotate variants with variant annotations from .vds file.

Examples

Import a second variant dataset with annotations to merge into vds:

>>> vds1 = vds.annotate_variants_expr('va = drop(va, anno1)')
>>> vds2 = (hc.read("data/example2.vds")
...           .annotate_variants_expr('va = select(va, anno1, toKeep1, toKeep2, toKeep3)'))

Copy the anno1 annotation from other to va.annot:

>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annot = vds.anno1')

Merge the variant annotations from the two vds together and places them at va:

>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va = merge(va, vds)')

Select a subset of the annotations from other:

>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annotations = select(vds, toKeep1, toKeep2, toKeep3)')

The previous expression is equivalent to:

>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annotations.toKeep1 = vds.toKeep1, ' +
...                                       'va.annotations.toKeep2 = vds.toKeep2, ' +
...                                       'va.annotations.toKeep3 = vds.toKeep3')

Notes

Using this method requires one of the two optional arguments: expr and root. They specify how to insert the annotations from other into the this vds’s variant annotations.

The root argument copies all the variant annotations from other to the specified annotation path.

The expr argument expects an annotation assignment whose scope includes, va, the variant annotations in the current VDS, and vds, the variant annotations in other.

VDSes with multi-allelic variants may produce surprising results because all alternate alleles are considered part of the variant key. For example:

  • The variant 22:140012:A:T,TTT will not be annotated by 22:140012:A:T or 22:140012:A:TTT
  • The variant 22:140012:A:T will not be annotated by 22:140012:A:T,TTT

It is possible that an unsplit variant dataset contains no multiallelic variants, so ignore any warnings Hail prints if you know that to be the case. Otherwise, run split_multi() before annotate_variants_vds().

Parameters:
  • other (VariantDataset) – Variant dataset to annotate with.
  • root (str) – Sample annotation path to add variant annotations.
  • expr (str) – Annotation expression.
Returns:

Annotated variant dataset.

Return type:

VariantDataset

cache()[source]

Mark this variant dataset to be cached in memory.

cache() is the same as persist("MEMORY_ONLY").

Return type:VariantDataset
colkey_schema

Returns the signature of the column key (sample) contained in this VDS.

Examples

>>> print(vds.colkey_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.colkey_schema)
Return type:Type
concordance(right)[source]

Calculate call concordance with another variant dataset.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Example

>>> comparison_vds = hc.read('data/example2.vds')
>>> summary, samples, variants = vds.concordance(comparison_vds)

Notes

This method computes the genotype call concordance between two bialellic variant datasets. It performs an inner join on samples (only samples in both datasets will be considered), and an outer join on variants. If a variant is only in one dataset, then each genotype is treated as “no data” in the other. This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a key table with sample concordance statistics, and a key table with variant concordance statistics.

Using the global summary result

The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:

  1. No Data (missing variant)
  2. No Call (missing genotype call)
  3. Hom Ref
  4. Heterozygous
  5. Hom Var

The first index is the state in the left dataset (the one on which concordance was called), and the second index is the state in the right dataset (the argument to the concordance method call). Typical uses of the summary list are shown below.

>>> summary, samples, variants = vds.concordance(hc.read('data/example2.vds'))
>>> left_homref_right_homvar = summary[2][4]
>>> left_het_right_missing = summary[3][1]
>>> left_het_right_something_else = sum(summary[3][:]) - summary[3][3]
>>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4]
>>> total_discordant = sum([sum(s[2:]) for s in summary[2:]]) - total_concordant

Using the key table results

Columns of the sample key table:

  • s (String) – Sample ID, key column.
  • nDiscordant (Long) – Count of discordant calls (see below for full definition).
  • concordance (Array[Array[Long]]) – Array of concordance per state on left and right, matching the structure of the global summary defined above.

Columns of the variant key table:

  • v (Variant) – Key column.
  • nDiscordant (Long) – Count of discordant calls (see below for full definition).
  • concordance (Array[Array[Long]]) – Array of concordance per state on left and right, matches the structure of the global summary defined above.

The two key tables produced by the concordance method can be queried with KeyTable.query(), exported to text with KeyTable.export(), and used to annotate a variant dataset with VariantDataset.annotate_variants_table(), among other things.

In these tables, the column nDiscordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.

The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example, concordance[1][4] is the number of “no call” genotypes on the left that were called homozygous variant on the right.

Parameters:right (VariantDataset) – right hand variant dataset for concordance
Returns:The global concordance statistics, a key table with sample concordance statistics, and a key table with variant concordance statistics.
Return type:(list of list of int, KeyTable, KeyTable)
count()[source]

Returns number of samples and variants in the dataset.

Examples

>>> samples, variants = vds.count()

Notes

This is also the fastest way to force evaluation of a Hail pipeline.

Returns:The sample and variant counts.
Return type:(int, int)
count_variants()[source]

Count number of variants in variant dataset.

Return type:long
deduplicate()[source]

Remove duplicate variants.

Returns:Deduplicated variant dataset.
Return type:VariantDataset
delete_va_attribute(ann_path, attribute)[source]

Removes an attribute from a variant annotation field. Attributes are key/value pairs that can be attached to a variant annotation field.

The following attributes are read from the VCF header when importing a VCF and written to the VCF header when exporting a VCF:

  • INFO fields attributes (attached to (va.info.*)):
    • ‘Number’: The arity of the field. Can take values
      • 0 (Boolean flag),
      • 1 (single value),
      • R (one value per allele, including the reference),
      • A (one value per non-reference allele),
      • G (one value per genotype), and
      • . (any number of values)
        • When importing: The value in read from the VCF INFO field definition
        • When exporting: The default value is 0 for Boolean, . for Arrays and 1 for all other types
      • ‘Description’ (default is ‘’)
  • FILTER entries in the VCF header are generated based on the attributes of va.filters. Each key/value pair in the attributes will generate a FILTER entry in the VCF with ID = key and Description = value.
Parameters:
  • ann_path (str) – Variant annotation path starting with ‘va’, period-delimited.
  • attribute (str) – The attribute to remove (key).
Returns:

Annotated dataset with the updated variant annotation without the attribute.

Return type:

VariantDataset

drop_samples()[source]

Removes all samples from variant dataset.

The variants, variant annotations, and global annnotations will remain, producing a sites-only variant dataset.

Returns:Sites-only variant dataset.
Return type:VariantDataset
drop_variants()[source]

Discard all variants, variant annotations and genotypes.

Samples, sample annotations and global annotations are retained. This is the same as filter_variants_expr('false'), but much faster.

Examples

>>> vds_result = vds.drop_variants()
Returns:Samples-only variant dataset.
Return type:VariantDataset
export_gen(output, precision=4)[source]

Export variant dataset as GEN and SAMPLE file.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Import genotype probability data, filter variants based on INFO score, and export data to a GEN and SAMPLE file:

>>> vds3 = hc.import_bgen("data/example3.bgen", sample_file="data/example3.sample")
>>> (vds3.filter_variants_expr("gs.infoScore().score >= 0.9")
...      .export_gen("output/infoscore_filtered"))

Notes

Writes out the internal VDS to a GEN and SAMPLE fileset in the Oxford spec.

The first 6 columns of the resulting GEN file are the following:

  • Chromosome (v.contig)
  • Variant ID (va.varid if defined, else Chromosome:Position:Ref:Alt)
  • rsID (va.rsid if defined, else ”.”)
  • position (v.start)
  • reference allele (v.ref)
  • alternate allele (v.alt)

Genotype probabilities:

  • 3 probabilities per sample (pHomRef, pHet, pHomVar).
  • Any filtered genotypes will be output as (0.0, 0.0, 0.0).
  • If the input data contained Phred-scaled likelihoods, the probabilities in the GEN file will be the normalized genotype probabilities assuming a uniform prior.
  • If the input data did not have genotype probabilities such as data imported using import_plink(), all genotype probabilities will be (0.0, 0.0, 0.0).

The sample file has 3 columns:

  • ID_1 and ID_2 are identical and set to the sample ID (s).
  • The third column (“missing”) is set to 0 for all samples.
Parameters:
  • output (str) – Output file base. Will write GEN and SAMPLE files.
  • precision (int) – Number of digits after the decimal point each probability is truncated to.
export_genotypes(output, expr, types=False, export_ref=False, export_missing=False, parallel=False)[source]

Export genotype-level information to delimited text file.

Examples

Export genotype information with identifiers that form the header:

>>> vds.export_genotypes('output/genotypes.tsv', 'SAMPLE=s, VARIANT=v, GQ=g.gq, DP=g.dp, ANNO1=va.anno1, ANNO2=va.anno2')

Export the same information without identifiers, resulting in a file with no header:

>>> vds.export_genotypes('output/genotypes.tsv', 's, v, g.gq, g.dp, va.anno1, va.anno2')

Notes

export_genotypes() outputs one line per cell (genotype) in the data set, though HomRef and missing genotypes are not output by default if the genotype schema is equal to TGenotype. Use the export_ref and export_missing parameters to force export of HomRef and missing genotypes, respectively.

The expr argument is a comma-separated list of fields or expressions, all of which must be of the form IDENTIFIER = <expression>, or else of the form <expression>. If some fields have identifiers and some do not, Hail will throw an exception. The accessible namespace includes g, s, sa, v, va, and global.

Warning

If the genotype schema does not have the type TGenotype, all genotypes will be exported unless the value of g is missing. Use filter_genotypes() to filter out genotypes based on an expression before exporting.

Parameters:
  • output (str) – Output path.
  • expr (str) – Export expression for values to export.
  • types (bool) – Write types of exported columns to a file at (output + ”.types”)
  • export_ref (bool) – If true, export reference genotypes. Only applicable if the genotype schema is TGenotype.
  • export_missing (bool) – If true, export missing genotypes.
  • parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.

Export variant dataset as PLINK2 BED, BIM and FAM.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Import data from a VCF file, split multi-allelic variants, and export to a PLINK binary file:

>>> vds.split_multi().export_plink('output/plink')

Notes

fam_expr can be used to set the fields in the FAM file. The following fields can be assigned:

  • famID: String
  • id: String
  • matID: String
  • patID: String
  • isFemale: Boolean
  • isCase: Boolean or qPheno: Double

If no assignment is given, the value is missing and the missing value is used: 0 for IDs and sex and -9 for phenotype. Only one of isCase or qPheno can be assigned.

fam_expr is in sample context only and the following symbols are in scope:

  • s (Sample): sample
  • sa: sample annotations
  • global: global annotations

The BIM file ID field is set to CHR:POS:REF:ALT.

This code:

>>> vds.split_multi().export_plink('output/plink')

will behave similarly to the PLINK VCF conversion command

plink --vcf /path/to/file.vcf --make-bed --out sample --const-fid --keep-allele-order

except:

  • The order among split multi-allelic alternatives in the BED file may disagree.
  • PLINK uses the rsID for the BIM file ID.
Parameters:
  • output (str) – Output file base. Will write BED, BIM, and FAM files.
  • fam_expr (str) – Expression for FAM file fields.
export_samples(output, expr, types=False)[source]

Export sample information to delimited text file.

Examples

Export some sample QC metrics:

>>> (vds.sample_qc()
...     .export_samples('output/samples.tsv', 'SAMPLE = s, CALL_RATE = sa.qc.callRate, NHET = sa.qc.nHet'))

This will produce a file with a header and three columns. To produce a file with no header, just leave off the assignment to the column identifier:

>>> (vds.sample_qc()
...     .export_samples('output/samples.tsv', 's, sa.qc.rTiTv'))

Notes

One line per sample will be exported. As export_samples() runs in sample context, the following symbols are in scope:

  • s (Sample): sample
  • sa: sample annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for sample s
Parameters:
  • output (str) – Output file.
  • expr (str) – Export expression for values to export.
  • types (bool) – Write types of exported columns to a file at (output + ”.types”).
export_variants(output, expr, types=False, parallel=False)[source]

Export variant information to delimited text file.

Examples

Export a four column TSV with v, va.pass, va.filters, and one computed field: 1 - va.qc.callRate.

>>> vds.export_variants('output/file.tsv',
...        'VARIANT = v, PASS = va.pass, FILTERS = va.filters, MISSINGNESS = 1 - va.qc.callRate')

It is also possible to export without identifiers, which will result in a file with no header. In this case, the expressions should look like the examples below:

>>> vds.export_variants('output/file.tsv', 'v, va.pass, va.qc.AF')

Note

If any field is named, all fields must be named.

In the common case that a group of annotations needs to be exported (for example, the annotations produced by variantqc), one can use the struct.* syntax. This syntax produces one column per field in the struct, and names them according to the struct field name.

For example, the following invocation (assuming va.qc was generated by variant_qc()):

>>> vds.export_variants('output/file.tsv', 'variant = v, va.qc.*')

will produce the following set of columns:

variant  callRate  AC  AF  nCalled  ...

Note that using the .* syntax always results in named arguments, so it is not possible to export header-less files in this manner. However, naming the “splatted” struct will apply the name in front of each column like so:

>>> vds.export_variants('output/file.tsv', 'variant = v, QC = va.qc.*')

which produces these columns:

variant  QC.callRate  QC.AC  QC.AF  QC.nCalled  ...

Notes

This module takes a comma-delimited list of fields or expressions to print. These fields will be printed in the order they appear in the expression in the header and on each line.

One line per variant in the VDS will be printed. The accessible namespace includes:

  • v (Variant): Variant
  • va: variant annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for variant v

Designating output with an expression

Much like the filtering methods, this method uses the Hail expression language. While the filtering methods expect an expression that evaluates to true or false, this method expects a comma-separated list of fields to print. These fields take the form IDENTIFIER = <expression>.

Parameters:
  • output (str) – Output file.
  • expr (str) – Export expression for values to export.
  • types (bool) – Write types of exported columns to a file at (output + ”.types”)
  • parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.
export_vcf(output, append_to_header=None, export_pp=False, parallel=False)[source]

Export variant dataset as a .vcf or .vcf.bgz file.

Examples

Export to VCF as a block-compressed file:

>>> vds.export_vcf('output/example.vcf.bgz')

Notes

export_vcf() writes the VDS to disk in VCF format as described in the VCF 4.2 spec.

Use the .vcf.bgz extension rather than .vcf in the output file name for blocked GZIP compression.

Note

We strongly recommended compressed (.bgz extension) and parallel output (parallel=True) when exporting large VCFs.

Consider the workflow of importing VCF to VDS and immediately exporting VDS to VCF:

>>> vds.export_vcf('output/example_out.vcf')

The example_out.vcf header will contain the FORMAT, FILTER, and INFO lines present in example.vcf. However, it will not contain CONTIG lines or lines added by external tools (such as bcftools and GATK) unless they are explicitly inserted using the append_to_header option.

Hail only exports the contents of va.info to the INFO field. No other annotations besides va.info are exported.

The genotype schema must have the type TGenotype or TStruct. If the type is TGenotype, then the FORMAT fields will be GT, AD, DP, GQ, and PL (or PP if export_pp is True). If the type is TStruct, then the exported FORMAT fields will be the names of each field of the Struct. Each field must have a type of String, Char, Int, Double, or Call. Arrays and Sets are also allowed as long as they are not nested. For example, a field with type Array[Int] can be exported but not a field with type Array[Array[Int]]. Nested Structs are also not allowed.

Caution

If samples or genotypes are filtered after import, the value stored in va.info.AC value may no longer reflect the number of called alternate alleles in the filtered VDS. If the filtered VDS is then exported to VCF, downstream tools may produce erroneous results. The solution is to create new annotations in va.info or overwrite existing annotations. For example, in order to produce an accurate AC field, one can run variant_qc() and copy the va.qc.AC field to va.info.AC:

>>> (vds.filter_genotypes('g.gq >= 20')
...     .variant_qc()
...     .annotate_variants_expr('va.info.AC = va.qc.AC')
...     .export_vcf('output/example.vcf.bgz'))
Parameters:
  • output (str) – Path of .vcf file to write.
  • append_to_header (str or None) – Path of file to append to VCF header.
  • export_pp (bool) – If true, export linear-scaled probabilities (Hail’s pp field on genotype) as the VCF PP FORMAT field.
  • parallel (bool) – If true, return a set of VCF files (one per partition) rather than serially concatenating these files.
file_version()[source]

File version of variant dataset.

Return type:int
filter_alleles(expr, annotation='va = va', subset=True, keep=True, filter_altered_genotypes=False, max_shift=100, keep_star=False)[source]

Filter a user-defined set of alternate alleles for each variant. If all alternate alleles of a variant are filtered, the variant itself is filtered. The expr expression is evaluated for each alternate allele, but not for the reference allele (i.e. aIndex will never be zero).

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

To remove alternate alleles with zero allele count and update the alternate allele count annotation with the new indices:

>>> vds_result = vds.filter_alleles('va.info.AC[aIndex - 1] == 0',
...     annotation='va.info.AC = aIndices[1:].map(i => va.info.AC[i - 1])',
...     keep=False)

Note that we skip the first element of aIndices because we are mapping between the old and new allele indices, not the alternate allele indices.

Notes

If filter_altered_genotypes is true, genotypes that contain filtered-out alleles are set to missing.

filter_alleles() implements two algorithms for filtering alleles: subset and downcode. We will illustrate their behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.

GT: 1/2
GQ: 10
AD: 0,50,35

0 | 1000
1 | 1000   10
2 | 1000   0     20
  +-----------------
     0     1     2

Subset algorithm

The subset algorithm (the default, subset=True) subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are re-normalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.

The subset algorithm would produce the following:

GT: 1/1
GQ: 980
AD: 0,50

0 | 980
1 | 980    0
  +-----------
     0      1

In summary:

  • GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).
  • AD: The filtered alleles’ columns are eliminated, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 25,20.
  • DP: No change.
  • PL: The filtered alleles’ columns are eliminated and the remaining columns shifted so the minimum value is 0.
  • GQ: The second-lowest PL (after shifting).

Downcode algorithm

The downcode algorithm (subset=False) recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 -> 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcodeing filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then re-normalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly to split_multi().

The downcoding algorithm would produce the following:

GT: 0/1
GQ: 10
AD: 35,50

0 | 20
1 | 0    10
  +-----------
    0    1

In summary:

  • GT: Downcode filtered alleles to reference.
  • AD: The filtered alleles’ columns are eliminated and their value is added to the reference, e.g., filtering alleles 1 and 2 transforms 25,5,10,20 to 40,20.
  • DP: No change.
  • PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.
  • GQ: The second-lowest PL (after shifting).

Expression Variables

The following symbols are in scope for expr:

  • v (Variant): Variant
  • va: variant annotations
  • aIndex (Int): the index of the allele being tested

The following symbols are in scope for annotation:

  • v (Variant): Variant
  • va: variant annotations
  • aIndices (Array[Int]): the array of old indices (such that aIndices[newIndex] = oldIndex and aIndices[0] = 0)
Parameters:
  • expr (str) – Boolean filter expression involving v (variant), va (variant annotations), and aIndex (allele index)
  • annotation (str) – Annotation modifying expression involving v (new variant), va (old variant annotations), and aIndices (maps from new to old indices)
  • subset (bool) – If true, subsets PL and AD, otherwise downcodes the PL and AD. Genotype and GQ are set based on the resulting PLs.
  • keep (bool) – If true, keep variants matching expr
  • filter_altered_genotypes (bool) – If true, genotypes that contain filtered-out alleles are set to missing.
  • max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered.
  • keepStar (bool) – If true, keep variants where the only allele left is a * allele.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_genotypes(expr, keep=True)[source]

Filter genotypes based on expression.

Examples

Filter genotypes by allele balance dependent on genotype call:

>>> vds_result = vds.filter_genotypes('let ab = g.ad[1] / g.ad.sum() in ' +
...                      '((g.isHomRef() && ab <= 0.1) || ' +
...                      '(g.isHet() && ab >= 0.25 && ab <= 0.75) || ' +
...                      '(g.isHomVar() && ab >= 0.9))')

Notes

expr is in genotype context so the following symbols are in scope:

  • s (Sample): sample
  • v (Variant): Variant
  • sa: sample annotations
  • va: variant annotations
  • global: global annotations

For more information, see the documentation on data representation, annotations, and the expression language.

Caution

When expr evaluates to missing, the genotype will be removed regardless of whether keep=True or keep=False.

Parameters:
  • expr (str) – Boolean filter expression.
  • keep (bool) – Keep genotypes where expr evaluates to true.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_intervals(intervals, keep=True)[source]

Filter variants with an interval or list of intervals.

Examples

Filter to one interval:

>>> vds_result = vds.filter_intervals(Interval.parse('17:38449840-38530994'))

Another way of writing this same query:

>>> vds_result = vds.filter_intervals(Interval(Locus('17', 38449840), Locus('17', 38530994)))

Two identical ways of parsing a list of intervals:

>>> intervals = map(Interval.parse, ['1:50M-75M', '2:START-400000', '3-22'])
>>> intervals = [Interval.parse(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]

Use this interval list to filter:

>>> vds_result = vds.filter_intervals(intervals)

Notes

This method takes an argument of Interval or list of Interval.

Based on the keep argument, this method will either restrict to variants in the supplied interval ranges, or remove all variants in those ranges. Note that intervals are left-inclusive, and right-exclusive. The below interval includes the locus 15:100000 but not 15:101000.

>>> interval = Interval.parse('15:100000-101000')

This method performs predicate pushdown when keep=True, meaning that data shards that don’t overlap any supplied interval will not be loaded at all. This property enables filter_intervals to be used for reasonably low-latency queries of small ranges of the genome, even on large datasets. Suppose we are interested in variants on chromosome 15 between 100000 and 200000. This implementation with filter_variants_expr() may come to mind first:

>>> vds_filtered = vds.filter_variants_expr('v.contig == "15" && v.start >= 100000 && v.start < 200000')

However, it is much faster (and easier!) to use this method:

>>> vds_filtered = vds.filter_intervals(Interval.parse('15:100000-200000'))

Note

A KeyTable keyed by interval can be used to filter a dataset efficiently as well. See the documentation for filter_variants_table() for an example. This is useful for using interval files to filter a dataset.

Parameters:
  • intervals (Interval or list of Interval) – Interval(s) to keep or remove.
  • keep (bool) – Keep variants overlapping an interval if True, remove variants overlapping an interval if False.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_multi()[source]

Filter out multi-allelic sites.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

This method is much less computationally expensive than split_multi(), and can also be used to produce a variant dataset that can be used with methods that do not support multiallelic variants.

Returns:Dataset with no multiallelic sites, which can be used for biallelic-only methods.
Return type:VariantDataset
filter_samples_expr(expr, keep=True)[source]

Filter samples with the expression language.

Examples

Filter samples by phenotype (assumes sample annotation sa.isCase exists and is a Boolean variable):

>>> vds_result = vds.filter_samples_expr("sa.isCase")

Remove samples with an ID that matches a regular expression:

>>> vds_result = vds.filter_samples_expr('"^NA" ~ s' , keep=False)

Filter samples from sample QC metrics and write output to a new variant dataset:

>>> (vds.sample_qc()
...     .filter_samples_expr('sa.qc.callRate >= 0.99 && sa.qc.dpMean >= 10')
...     .write("output/filter_samples.vds"))

Notes

expr is in sample context so the following symbols are in scope:

  • s (Sample): sample
  • sa: sample annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for sample s

For more information, see the documentation on data representation, annotations, and the expression language.

Caution

When expr evaluates to missing, the sample will be removed regardless of whether keep=True or keep=False.

Parameters:
  • expr (str) – Boolean filter expression.
  • keep (bool) – Keep samples where expr evaluates to true.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_samples_list(samples, keep=True)[source]

Filter samples with a list of samples.

Examples

>>> to_remove = ['NA12878', 'NA12891', 'NA12892']
>>> vds_result = vds.filter_samples_list(to_remove, keep=False)

Read list from a file:

>>> to_remove = [s.strip() for s in open('data/exclude_samples.txt')]
>>> vds_result = vds.filter_samples_list(to_remove, keep=False)
Parameters:
  • samples (list of str) – List of samples to keep or remove.
  • keep (bool) – If true, keep samples in samples, otherwise remove them.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_samples_table(table, keep=True)[source]

Filter samples with a table keyed by sample ID.

Examples

Keep samples in a text file:

>>> table = hc.import_table('data/samples1.tsv').key_by('Sample')
>>> vds_filtered = vds.filter_samples_table(table, keep=True)

Remove samples in a text file with 1 field, and no header:

>>> to_remove = hc.import_table('data/exclude_samples.txt', no_header=True).key_by('f0')
>>> vds_filtered = vds.filter_samples_table(to_remove, keep=False)

Notes

This method filters out or filters to the keys of a table. The table must have a key of type String.

Parameters:
  • table (KeyTable) – Key table.
  • keep (bool) – If true, keep only the keys in table, otherwise remove them.
Returns:

Filtered dataset.

Return type:

VariantDataset

filter_variants_expr(expr, keep=True)[source]

Filter variants with the expression language.

Examples

Keep variants in the gene CHD8 (assumes the variant annotation va.gene exists):

>>> vds_result = vds.filter_variants_expr('va.gene == "CHD8"')

Remove all variants on chromosome 1:

>>> vds_result = vds.filter_variants_expr('v.contig == "1"', keep=False)

Caution

The double quotes on "1" are necessary because v.contig is of type String.

Notes

The following symbols are in scope for expr:

  • v (Variant): Variant
  • va: variant annotations
  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype for variant v

For more information, see the Overview and the Expression Language.

Caution

When expr evaluates to missing, the variant will be removed regardless of whether keep=True or keep=False.

Parameters:
  • expr (str) – Boolean filter expression.
  • keep (bool) – Keep variants where expr evaluates to true.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_variants_list(variants, keep=True)[source]

Filter variants with a list of variants.

Examples

Filter VDS down to a list of variants:

>>> vds_filtered = vds.filter_variants_list([Variant.parse('20:10626633:G:GC'), 
...                                          Variant.parse('20:10019093:A:G')], keep=True)

Notes

This method performs predicate pushdown when keep=True, meaning that data shards that don’t overlap with any supplied variant will not be loaded at all. This property enables filter_variants_list to be used for reasonably low-latency queries of one or more variants, even on large datasets.

Parameters:
  • variants (list of Variant) – List of variants to keep or remove.
  • keep (bool) – If true, keep variants in variants, otherwise remove them.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

filter_variants_table(table, keep=True)[source]

Filter variants with a Variant keyed key table.

Example

Filter variants of a VDS to those appearing in a text file:

>>> kt = hc.import_table('data/sample_variants.txt', key='Variant', impute=True)
>>> filtered_vds = vds.filter_variants_table(kt, keep=True)

Keep all variants whose chromosome and position (locus) appear in a file with a chromosome:position column:

>>> kt = hc.import_table('data/locus-table.tsv', impute=True).key_by('Locus')
>>> filtered_vds = vds.filter_variants_table(kt, keep=True)

Remove all variants which overlap an interval in a UCSC BED file:

>>> kt = KeyTable.import_bed('data/file2.bed')
>>> filtered_vds = vds.filter_variants_table(kt, keep=False)

Notes

This method takes a key table as an argument, which must be keyed by one of the following:

  • Interval
  • Locus
  • Variant

If the key is a Variant, then a variant in the dataset will be kept or removed based on finding a complete match in the table. Be careful, however: 1:1:A:T does not match 1:1:A:T,C, and vice versa.

If the key is a Locus, then a variant in the dataset will be kept or removed based on finding a locus in the table that matches by chromosome and position.

If the key is an Interval, then a variant in the dataset will be kept or removed based on finding an interval in the table that contains the variant’s chromosome and position.

Parameters:
  • table (KeyTable) – Key table object.
  • keep (bool) – If true, keep only matches in table, otherwise remove them.
Returns:

Filtered variant dataset.

Return type:

VariantDataset

static from_table(table)[source]

Construct a sites-only variant dataset from a key table.

Examples

Import a text table and construct a sites-only VDS:

>>> table = hc.import_table('data/variant-lof.tsv', types={'v': TVariant()}).key_by('v')
>>> sites_vds = VariantDataset.from_table(table)

Notes

The key table must be keyed by one column of type TVariant.

All columns in the key table become variant annotations in the result. For example, a key table with key column v (Variant) and column gene (String) will produce a sites-only variant dataset with a va.gene variant annotation.

Parameters:table (KeyTable) – Variant-keyed table.
Returns:Sites-only variant dataset.
Return type:VariantDataset
genotype_schema

Returns the signature of the genotypes contained in this VDS.

Examples

>>> print(vds.genotype_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.genotype_schema)
Return type:Type
genotypes_table()[source]

Generate a fully expanded genotype table.

Examples

>>> gs = vds.genotypes_table()

Notes

This produces a (massive) flat table from all the genotypes in the dataset. The table has columns:

  • v (Variant) - Variant (key column).
  • va (Variant annotation schema) - Variant annotations.
  • s (String) - Sample ID (key column).
  • sa (Sample annotation schema) - Sample annotations.
  • g (Genotype schema) - Genotype or generic genotype.

Caution

This table has a row for each variant/sample pair. The genotype key table for a dataset with 10M variants and 10K samples will contain 100 billion rows. Writing or exporting this table will produce a file much larger than the equivalent VDS.

Returns:Key table with a row for each genotype.
Return type:KeyTable
global_schema

Returns the signature of the global annotations contained in this VDS.

Examples

>>> print(vds.global_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.global_schema)
Return type:Type
globals

Return global annotations as a Python object.

Returns:Dataset global annotations.
Return type:Struct
grm()[source]

Compute the Genetic Relatedness Matrix (GRM).

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

>>> km = vds.grm()

Notes

The genetic relationship matrix (GRM) \(G\) encodes genetic correlation between each pair of samples. It is defined by \(G = MM^T\) where \(M\) is a standardized version of the genotype matrix, computed as follows. Let \(C\) be the \(n \times m\) matrix of raw genotypes in the variant dataset, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as

\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},\]

with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in Hardy-Weinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,

\[G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}-2p_j)(C_{kj}-2p_j)}{2 p_j (1-p_j)} \]
Returns:Genetic Relatedness Matrix for all samples.
Return type:KinshipMatrix
hardcalls()[source]

Drop all genotype fields except the GT field.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

A hard-called variant dataset is about two orders of magnitude smaller than a standard sequencing dataset. Use this method to create a smaller, faster representation for downstream processing that only requires the GT field.

Returns:Variant dataset with no genotype metadata.
Return type:VariantDataset
ibd(maf=None, bounded=True, min=None, max=None)[source]

Compute matrix of identity-by-descent estimations.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

To calculate a full IBD matrix, using minor allele frequencies computed from the variant dataset itself:

>>> vds.ibd()

To calculate an IBD matrix containing only pairs of samples with PI_HAT in [0.2, 0.9], using minor allele frequencies stored in va.panel_maf:

>>> vds.ibd(maf='va.panel_maf', min=0.2, max=0.9)

Notes

The implementation is based on the IBD algorithm described in the PLINK paper.

ibd() requires the dataset to be bi-allelic (otherwise run split_multi() or otherwise run filter_multi()) and does not perform LD pruning. Linkage disequilibrium may bias the result so consider filtering variants first.

The resulting KeyTable entries have the type: { i: String, j: String, ibd: { Z0: Double, Z1: Double, Z2: Double, PI_HAT: Double }, ibs0: Long, ibs1: Long, ibs2: Long }. The key list is: *i: String, j: String*.

Conceptually, the output is a symmetric, sample-by-sample matrix. The output key table has the following form

i           j       ibd.Z0  ibd.Z1  ibd.Z2  ibd.PI_HAT ibs0 ibs1    ibs2
sample1     sample2 1.0000  0.0000  0.0000  0.0000 ...
sample1     sample3 1.0000  0.0000  0.0000  0.0000 ...
sample1     sample4 0.6807  0.0000  0.3193  0.3193 ...
sample1     sample5 0.1966  0.0000  0.8034  0.8034 ...
Parameters:
  • maf (str or None) – Expression for the minor allele frequency.
  • bounded (bool) – Forces the estimations for Z0, Z1, Z2, and PI_HAT to take on biologically meaningful values (in the range [0,1]).
  • min (float or None) – Sample pairs with a PI_HAT below this value will not be included in the output. Must be in [0,1].
  • max (float or None) – Sample pairs with a PI_HAT above this value will not be included in the output. Must be in [0,1].
Returns:

A KeyTable mapping pairs of samples to their IBD statistics

Return type:

KeyTable

ibd_prune(threshold, tiebreaking_expr=None, maf=None, bounded=True)[source]

Prune samples from the VariantDataset based on ibd() PI_HAT measures of relatedness.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Prune samples so that no two have a PI_HAT value greater than or equal to 0.6.

>>> pruned_vds = vds.ibd_prune(0.6)

Prune samples so that no two have a PI_HAT value greater than or equal to 0.5, with a tiebreaking expression that selects cases over controls:

>>> pruned_vds = vds.ibd_prune(0.5, tiebreaking_expr="if (sa1.isCase) 1 else 0")

Notes

The variant dataset returned may change in near future as a result of algorithmic improvements. The current algorithm is very efficient on datasets with many small families, less so on datasets with large families. Currently, the algorithm works by deleting the person from each family who has the highest number of relatives, and iterating until no two people have a PI_HAT value greater than that specified. If two people within a family have the same number of relatives, the tiebreaking_expr given will be used to determine which sample gets deleted.

The tiebreaking_expr namespace has the following variables available:

  • s1: The first sample id.
  • sa1: The annotations associated with s1.
  • s2: The second sample id.
  • sa2: The annotations associated with s2.

The tiebreaking_expr returns an integer expressing the preference for one sample over the other. Any negative integer expresses a preference for keeping s1. Any positive integer expresses a preference for keeping s2. A zero expresses no preference. This function must induce a preorder on the samples, in particular:

  • tiebreaking_expr(sample1, sample2) must equal -1 * tie breaking_expr(sample2, sample1), which evokes the common sense understanding that if x < y then y > x`.
  • tiebreaking_expr(sample1, sample1) must equal 0, i.e. x = x
  • if sample1 is preferred to sample2 and sample2 is preferred to sample3, then sample1 must also be preferred to sample3

The last requirement is only important if you have three related samples with the same number of relatives and all three are related to one another. In cases like this one, it is important that either:

  • one of the three is preferred to both other ones, or
  • there is no preference among the three samples
Parameters:
  • threshold – The desired maximum PI_HAT value between any pair of samples.
  • tiebreaking_expr – Expression used to choose between two samples with the same number of relatives.
  • maf – Expression for the minor allele frequency.
  • bounded – Forces the estimations for Z0, Z1, Z2, and PI_HAT to take on biologically meaningful values (in the range [0,1]).
Returns:

A VariantDataset containing no samples with a PI_HAT greater than threshold.

Return type:

VariantDataset

impute_sex(maf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, pop_freq=None)[source]

Impute sex of samples by calculating inbreeding coefficient on the X chromosome.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Remove samples where imputed sex does not equal reported sex:

>>> imputed_sex_vds = (vds.impute_sex()
...     .annotate_samples_expr('sa.sexcheck = sa.pheno.isFemale == sa.imputesex.isFemale')
...     .filter_samples_expr('sa.sexcheck || isMissing(sa.sexcheck)'))

Notes

We have used the same implementation as PLINK v1.7.

  1. X chromosome variants are selected from the VDS: v.contig == "X" || v.contig == "23"
  2. Variants with a minor allele frequency less than the threshold given by maf-threshold are removed
  3. Variants in the pseudoautosomal region (X:60001-2699520) || (X:154931044-155260560) are included if the include_par optional parameter is set to true.
  4. The minor allele frequency (maf) per variant is calculated.
  5. For each variant and sample with a non-missing genotype call, \(E\), the expected number of homozygotes (from population MAF), is computed as \(1.0 - (2.0*maf*(1.0-maf))\).
  6. For each variant and sample with a non-missing genotype call, \(O\), the observed number of homozygotes, is computed as 0 = heterozygote; 1 = homozygote
  7. For each variant and sample with a non-missing genotype call, \(N\) is incremented by 1
  8. For each sample, \(E\), \(O\), and \(N\) are combined across variants
  9. \(F\) is calculated by \((O - E) / (N - E)\)
  10. A sex is assigned to each sample with the following criteria: F < 0.2 => Female; F > 0.8 => Male. Use female-threshold and male-threshold to change this behavior.

Annotations

The below annotations can be accessed with sa.imputesex.

  • isFemale (Boolean) – True if the imputed sex is female, false if male, missing if undetermined
  • Fstat (Double) – Inbreeding coefficient
  • nTotal (Long) – Total number of variants considered
  • nCalled (Long) – Number of variants with a genotype call
  • expectedHoms (Double) – Expected number of homozygotes
  • observedHoms (Long) – Observed number of homozygotes
Parameters:
  • maf_threshold (float) – Minimum minor allele frequency threshold.
  • include_par (bool) – Include pseudoautosomal regions.
  • female_threshold (float) – Samples are called females if F < femaleThreshold
  • male_threshold (float) – Samples are called males if F > maleThreshold
  • pop_freq (str) – Variant annotation for estimate of MAF. If None, MAF will be computed.
Returns:

Annotated dataset.

Return type:

VariantDataset

join(right)[source]

Join two variant datasets.

Notes

This method performs an inner join on variants, concatenates samples, and takes variant and global annotations from the left dataset (self).

The datasets must have distinct samples, the same sample schema, and the same split status (both split or both multi-allelic).

Parameters:right (VariantDataset) – right-hand variant dataset
Returns:Joined variant dataset
Return type:VariantDataset
ld_matrix(force_local=False)[source]

Computes the linkage disequilibrium (correlation) matrix for the variants in this VDS.

Examples

>>> ld_mat = vds.ld_matrix()

Notes

Each entry (i, j) in the LD matrix gives the \(r\) value between variants i and j, defined as Pearson’s correlation coefficient \(\rho_{x_i,x_j}\) between the two genotype vectors \(x_i\) and \(x_j\).

\[\rho_{x_i,x_j} = \frac{\mathrm{Cov}(X_i,X_j)}{\sigma_{X_i} \sigma_{X_j}}\]

Also note that variants with zero variance (\(\sigma = 0\)) will be dropped from the matrix.

Caution

The matrix returned by this function can easily be very large with most entries near zero (for example, entries between variants on different chromosomes in a homogenous population). Most likely you’ll want to reduce the number of variants with methods like sample_variants(), filter_variants_expr(), or ld_prune() before calling this unless your dataset is very small.

Parameters:force_local (bool) – If true, the LD matrix is computed using local matrix multiplication on the Spark driver. This may improve performance when the genotype matrix is small enough to easily fit in local memory. If false, the LD matrix is computed using distributed matrix multiplication if the number of genotypes exceeds \(5000^2\) and locally otherwise.
Returns:Matrix of r values between pairs of variants.
Return type:LDMatrix
ld_prune(r2=0.2, window=1000000, memory_per_core=256, num_cores=1)[source]

Prune variants in linkage disequilibrium (LD).

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Requires was_split equals True.

Examples

Export the set of common LD pruned variants to a file:

>>> vds_result = (vds.variant_qc()
...                  .filter_variants_expr("va.qc.AF >= 0.05 && va.qc.AF <= 0.95")
...                  .ld_prune()
...                  .export_variants("output/ldpruned.variants", "v"))

Notes

Variants are pruned in each contig from smallest to largest start position. The LD pruning algorithm is as follows:

pruned_set = []
for v1 in contig:
    keep = True
    for v2 in pruned_set:
        if ((v1.position - v2.position) <= window and correlation(v1, v2) >= r2):
            keep = False
    if keep:
        pruned_set.append(v1)

The parameter window defines the maximum distance in base pairs between two variants to check whether the variants are independent (\(R^2\) < r2) where r2 is the maximum \(R^2\) allowed. \(R^2\) is defined as the square of Pearson’s correlation coefficient \({\rho}_{x,y}\) between the two genotype vectors \({\mathbf{x}}\) and \({\mathbf{y}}\).

\[{\rho}_{x,y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y}\]

ld_prune() with default arguments is equivalent to plink --indep-pairwise 1000kb 1 0.2. The list of pruned variants returned by Hail and PLINK will differ because Hail mean-imputes missing values and tests pairs of variants in a different order than PLINK.

Be sure to provide enough disk space per worker because ld_prune() persists up to 3 copies of the data to both memory and disk. The amount of disk space required will depend on the size and minor allele frequency of the input data and the prune parameters r2 and window. The number of bytes stored in memory per variant is about nSamples / 4 + 50.

Warning

The variants in the pruned set are not guaranteed to be identical each time ld_prune() is run. We recommend running ld_prune() once and exporting the list of LD pruned variants using export_variants() for future use.

Parameters:
  • r2 (float) – Maximum \(R^2\) threshold between two variants in the pruned set within a given window.
  • window (int) – Width of window in base-pairs for computing pair-wise \(R^2\) values.
  • memory_per_core (int) – Total amount of memory available for each core in MB. If unsure, use the default value.
  • num_cores (int) – The number of cores available. Equivalent to the total number of workers times the number of cores per worker.
Returns:

Variant dataset filtered to those variants which remain after LD pruning.

Return type:

VariantDataset

linreg(y, covariates=[], root='va.linreg', use_dosages=False, min_ac=1, min_af=0.0)[source]

Test each variant for association using linear regression.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Run linear regression per variant using a phenotype and two covariates stored in sample annotations:

>>> vds_result = vds.linreg('sa.pheno.height', covariates=['sa.pheno.age', 'sa.pheno.isFemale'])

Notes

The linreg() method computes, for each variant, statistics of the \(t\)-test for the genotype coefficient of the linear function of best fit from sample genotype and covariates to quantitative phenotype or case-control status. Hail only includes samples for which phenotype and all covariates are defined. For each variant, missing genotypes as the mean of called genotypes.

By default, genotypes values are given by hard call genotypes (g.gt). If use_dosages=True, then genotype values are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.

Assuming there are sample annotations sa.pheno.height, sa.pheno.age, sa.pheno.isFemale, and sa.cov.PC1, the code:

>>> vds_result = vds.linreg('sa.pheno.height', covariates=['sa.pheno.age', 'sa.pheno.isFemale', 'sa.cov.PC1'])

considers a model of the form

\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{isFemale} + \beta_4 \, \mathrm{PC1} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]

where the genotype \(\mathrm{gt}\) is coded as \(0\) for HomRef, \(1\) for Het, and \(2\) for HomVar, and the Boolean covariate \(\mathrm{isFemale}\) is coded as \(1\) for true (female) and \(0\) for false (male). The null model sets \(\beta_1 = 0\).

Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations. One can further restrict computation to those variants with at least \(k\) observed alternate alleles (AC) or alternate allele frequency (AF) at least \(p\) in the included samples using the options min_ac=k or min_af=p, respectively. Unlike the filter_variants_expr() method, these filters do not remove variants from the underlying variant dataset; rather the linear regression annotations for variants with low AC or AF are set to missing. Adding both filters is equivalent to applying the more stringent of the two.

Phenotype and covariate sample annotations may also be specified using programmatic expressions without identifiers, such as:

>>> vds_result = vds.linreg('if (sa.pheno.isFemale) sa.pheno.age else (2 * sa.pheno.age + 10)')

For Boolean covariate types, true is coded as 1 and false as 0. In particular, for the sample annotation sa.fam.isCase added by importing a FAM file with case-control phenotype, case is 1 and control is 0.

The standard least-squares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the t-statistic which follows the t-distribution with \(n - k - 2\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to genotype and intercept.

Annotations

With the default root, the following four variant annotations are added.

  • va.linreg.beta (Double) – fit genotype coefficient, \(\hat\beta_1\)
  • va.linreg.se (Double) – estimated standard error, \(\widehat{\mathrm{se}}\)
  • va.linreg.tstat (Double) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
  • va.linreg.pval (Double) – \(p\)-value
Parameters:
  • y (str) – Response expression
  • covariates (list of str) – list of covariate expressions
  • root (str) – Variant annotation path to store result of linear regression.
  • use_dosages (bool) – If true, use dosages genotypes rather than hard call genotypes.
  • min_ac (int) – Minimum alternate allele count.
  • min_af (float) – Minimum alternate allele frequency.
Returns:

Variant dataset with linear regression variant annotations.

Return type:

VariantDataset

linreg3(ys, covariates=[], root='va.linreg', use_dosages=False, variant_block_size=16)[source]

Test each variant for association with multiple phenotypes using linear regression.

This method runs linear regression for multiple phenotypes more efficiently than looping over linreg(). This method is more efficient than linreg_multi_pheno() but doesn’t implicitly filter on allele count or allele frequency.

Warning

linreg3() uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.

Annotations

With the default root, the following four variant annotations are added. The indexing of the array annotations corresponds to that of y.

  • va.linreg.nCompleteSamples (Int) – number of samples used
  • va.linreg.AC (Double) – sum of the genotype values x
  • va.linreg.ytx (Array[Double]) – array of dot products of each phenotype vector y with the genotype vector x
  • va.linreg.beta (Array[Double]) – array of fit genotype coefficients, \(\hat\beta_1\)
  • va.linreg.se (Array[Double]) – array of estimated standard errors, \(\widehat{\mathrm{se}}\)
  • va.linreg.tstat (Array[Double]) – array of \(t\)-statistics, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
  • va.linreg.pval (Array[Double]) – array of \(p\)-values
Parameters:
  • ys – list of one or more response expressions.
  • covariates (list of str) – list of covariate expressions.
  • root (str) – Variant annotation path to store result of linear regression.
  • use_dosages (bool) – If true, use dosage genotypes rather than hard call genotypes.
  • variant_block_size (int) – Number of variant regressions to perform simultaneously. Larger block size requires more memmory.
Returns:

Variant dataset with linear regression variant annotations.

Return type:

VariantDataset

linreg_burden(key_name, variant_keys, single_key, agg_expr, y, covariates=[])[source]

Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the linear regression model.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Run a gene burden test using linear regression on the maximum genotype per gene. Here va.genes is a variant annotation of type Set[String] giving the set of genes containing the variant (see Extended example below for a deep dive):

>>> linreg_kt, sample_kt = (hc.read('data/example_burden.vds')
...     .linreg_burden(key_name='gene',
...                    variant_keys='va.genes',
...                    single_key=False,
...                    agg_expr='gs.map(g => g.gt).max()',
...                    y='sa.burden.pheno',
...                    covariates=['sa.burden.cov1', 'sa.burden.cov2']))

Run a gene burden test using linear regression on the weighted sum of genotypes per gene. Here va.gene is a variant annotation of type String giving a single gene per variant (or no gene if missing), and va.weight is a numeric variant annotation:

>>> linreg_kt, sample_kt = (hc.read('data/example_burden.vds')
...     .linreg_burden(key_name='gene',
...                    variant_keys='va.gene',
...                    single_key=True,
...                    agg_expr='gs.map(g => va.weight * g.gt).sum()',
...                    y='sa.burden.pheno',
...                    covariates=['sa.burden.cov1', 'sa.burden.cov2']))

To use a weighted sum of genotypes with missing genotypes mean-imputed rather than ignored, set agg_expr='gs.map(g => va.weight * orElse(g.gt.toDouble, 2 * va.qc.AF)).sum()' where va.qc.AF is the allele frequency over those samples that have no missing phenotype or covariates.

Caution

With single_key=False, variant_keys expects a variant annotation of Set or Array type, in order to allow each variant to have zero, one, or more keys (for example, the same variant may appear in multiple genes). Unlike with type Set, if the same key appears twice in a variant annotation of type Array, then that variant will be counted twice in that key’s group. With single_key=True, variant_keys expects a variant annotation whose value is itself the key of interest. In bose cases, variants with missing keys are ignored.

Notes

This method modifies linreg() by replacing the genotype covariate per variant and sample with an aggregated (i.e., collapsed) score per key and sample. This numeric score is computed from the sample’s genotypes and annotations over all variants with that key. Conceptually, the method proceeds as follows:

  1. Filter to the set of samples for which all phenotype and covariates are defined.

  2. For each key and sample, aggregate genotypes across variants with that key to produce a numeric score. agg_expr must be of numeric type and has the following symbols are in scope:

    • s (Sample): sample
    • sa: sample annotations
    • global: global annotations
    • gs (Aggregable[Genotype]): aggregable of Genotype for sample s

    Note that v, va, and g are accessible through Aggregable methods on gs.

    The resulting sample key table has key column key_name and a numeric column of scores for each sample named by the sample ID.

  3. For each key, fit the linear regression model using the supplied phenotype and covariates. The model is that of linreg() with sample genotype gt replaced by the score in the sample key table. For each key, missing scores are mean-imputed across all samples.

    The resulting linear regression key table has the following columns:

    • value of key_name (String) – descriptor of variant group key (key column)
    • beta (Double) – fit coefficient, \(\hat\beta_1\)
    • se (Double) – estimated standard error, \(\widehat{\mathrm{se}}\)
    • tstat (Double) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
    • pval (Double) – \(p\)-value

linreg_burden() returns both the linear regression key table and the sample key table.

Extended example

Let’s walk through these steps in the max() toy example above. There are six samples with the following annotations:

Sample pheno cov1 cov2
A 0 0 -1
B 0 2 3
C 1 1 5
D 1 -2 0
E 1 -2 -4
F 1 4 3

There are three variants with the following gt values:

Variant A B C D E F
1:1:A:C 0 1 0 0 0 1
1:2:C:T . 2 . 2 0 0
1:3:G:C 0 . 1 1 1 .

The va.genes annotation of type Set[String] on example_burden.vds was created using annotate_variants_table() with product=True on the interval list:

1	1	2	+	geneA
1	2	2	+	geneB
1	1	3	+	geneC

So there are three overlapping genes: gene A contains two variants, gene B contains one variant, and gene C contains all three variants.

gene 1:1:A:C 1:2:C:T 1:3:G:C
geneA X X  
geneB   X  
geneC X X X

Therefore annotate_variants_table() with product=True creates a variant annotation of type Set[String] with values Set('geneA', 'geneB'), Set('geneB'), and Set('geneA', 'geneB', 'geneC').

So the sample aggregation key table is:

gene A B C D E F
geneA 0 2 0 2 0 1
geneB NA 2 NA 2 0 0
geneC 0 2 1 2 1 1

Linear regression is done for each row using the supplied phenotype, covariates, and implicit intercept. The resulting linear regression key table is:

gene beta se tstat pval
geneA -0.084 0.368 -0.227 0.841
geneB -0.542 0.335 -1.617 0.247
geneC 0.075 0.515 0.145 0.898
Parameters:
  • key_name (str) – Name to assign to key column of returned key tables.
  • variant_keys (str) – Variant annotation path for the TArray or TSet of keys associated to each variant.
  • single_key (bool) – if true, variant_keys is interpreted as a single (or missing) key per variant, rather than as a collection of keys.
  • agg_expr (str) – Sample aggregation expression (per key).
  • y (str) – Response expression.
  • covariates (list of str) – list of covariate expressions.
Returns:

Tuple of linear regression key table and sample aggregation key table.

Return type:

(KeyTable, KeyTable)

linreg_multi_pheno(ys, covariates=[], root='va.linreg', use_dosages=False, min_ac=1, min_af=0.0)[source]

Test each variant for association with multiple phenotypes using linear regression.

This method runs linear regression for multiple phenotypes more efficiently than looping over linreg().

Warning

linreg_multi_pheno() uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.

Annotations

With the default root, the following four variant annotations are added. The indexing of these annotations corresponds to that of y.

  • va.linreg.beta (Array[Double]) – array of fit genotype coefficients, \(\hat\beta_1\)
  • va.linreg.se (Array[Double]) – array of estimated standard errors, \(\widehat{\mathrm{se}}\)
  • va.linreg.tstat (Array[Double]) – array of \(t\)-statistics, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
  • va.linreg.pval (Array[Double]) – array of \(p\)-values
Parameters:
  • ys – list of one or more response expressions.
  • covariates (list of str) – list of covariate expressions.
  • root (str) – Variant annotation path to store result of linear regression.
  • use_dosages (bool) – If true, use dosage genotypes rather than hard call genotypes.
  • min_ac (int) – Minimum alternate allele count.
  • min_af (float) – Minimum alternate allele frequency.
Returns:

Variant dataset with linear regression variant annotations.

Return type:

VariantDataset

lmmreg(kinshipMatrix, y, covariates=[], global_root='global.lmmreg', va_root='va.lmmreg', run_assoc=True, use_ml=False, delta=None, sparsity_threshold=1.0, use_dosages=False, n_eigs=None, dropped_variance_fraction=None)[source]

Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Suppose the variant dataset saved at data/example_lmmreg.vds has a Boolean variant annotation va.useInKinship and numeric or Boolean sample annotations sa.pheno, sa.cov1, sa.cov2. Then the lmmreg() function in

>>> assoc_vds = hc.read("data/example_lmmreg.vds")
>>> kinship_matrix = assoc_vds.filter_variants_expr('va.useInKinship').rrm()
>>> lmm_vds = assoc_vds.lmmreg(kinship_matrix, 'sa.pheno', ['sa.cov1', 'sa.cov2'])

will execute the following four steps in order:

  1. filter to samples in given kinship matrix to those for which sa.pheno, sa.cov, and sa.cov2 are all defined
  2. compute the eigendecomposition \(K = USU^T\) of the kinship matrix
  3. fit covariate coefficients and variance parameters in the sample-covariates-only (global) model using restricted maximum likelihood (REML), storing results in global annotations under global.lmmreg
  4. test each variant for association, storing results under va.lmmreg in variant annotations

This plan can be modified as follows:

  • Set run_assoc=False to not test any variants for association, i.e. skip Step 5.
  • Set use_ml=True to use maximum likelihood instead of REML in Steps 4 and 5.
  • Set the delta argument to manually set the value of \(\delta\) rather that fitting \(\delta\) in Step 4.
  • Set the global_root argument to change the global annotation root in Step 4.
  • Set the va_root argument to change the variant annotation root in Step 5.

lmmreg() adds 9 or 13 global annotations in Step 4, depending on whether \(\delta\) is set or fit.

Annotation Type Value
global.lmmreg.useML Boolean true if fit by ML, false if fit by REML
global.lmmreg.beta Dict[String, Double] map from intercept and the given covariates expressions to the corresponding fit \(\beta\) coefficients
global.lmmreg.sigmaG2 Double fit coefficient of genetic variance, \(\hat{\sigma}_g^2\)
global.lmmreg.sigmaE2 Double fit coefficient of environmental variance \(\hat{\sigma}_e^2\)
global.lmmreg.delta Double fit ratio of variance component coefficients, \(\hat{\delta}\)
global.lmmreg.h2 Double fit narrow-sense heritability, \(\hat{h}^2\)
global.lmmreg.nEigs Int number of eigenvectors of kinship matrix used to fit model
global.lmmreg.dropped_variance_fraction Double specified value of dropped_variance_fraction
global.lmmreg.evals Array[Double] all eigenvalues of the kinship matrix in descending order
global.lmmreg.fit.seH2 Double standard error of \(\hat{h}^2\) under asymptotic normal approximation
global.lmmreg.fit.normLkhdH2 Array[Double] likelihood function of \(h^2\) normalized on the discrete grid 0.01, 0.02, ..., 0.99. Index i is the likelihood for percentage i.
global.lmmreg.fit.maxLogLkhd Double (restricted) maximum log likelihood corresponding to \(\hat{\delta}\)
global.lmmreg.fit.logDeltaGrid Array[Double] values of \(\mathrm{ln}(\delta)\) used in the grid search
global.lmmreg.fit.logLkhdVals Array[Double] (restricted) log likelihood of \(y\) given \(X\) and \(\mathrm{ln}(\delta)\) at the (RE)ML fit of \(\beta\) and \(\sigma_g^2\)

These global annotations are also added to hail.log, with the ranked evals and \(\delta\) grid with values in .tsv tabular form. Use grep 'lmmreg:' hail.log to find the lines just above each table.

If Step 5 is performed, lmmreg() also adds four linear regression variant annotations.

Annotation Type Value
va.lmmreg.beta Double fit genotype coefficient, \(\hat\beta_0\)
va.lmmreg.sigmaG2 Double fit coefficient of genetic variance component, \(\hat{\sigma}_g^2\)
va.lmmreg.chi2 Double \(\chi^2\) statistic of the likelihood ratio test
va.lmmreg.pval Double \(p\)-value

Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.

The simplest way to export all resulting annotations is:

>>> lmm_vds.export_variants('output/lmmreg.tsv.bgz', 'variant = v, va.lmmreg.*')
>>> lmmreg_results = lmm_vds.globals['lmmreg']

By default, genotypes values are given by hard call genotypes (g.gt). If use_dosages=True, then genotype values for per-variant association are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.

Performance

Hail’s initial version of lmmreg() scales beyond 15k samples and to an essentially unbounded number of variants, making it particularly well-suited to modern sequencing studies and complementary to tools designed for SNP arrays. Analysts have used lmmreg() in research to compute kinship from 100k common variants and test 32 million non-rare variants on 8k whole genomes in about 10 minutes on Google cloud.

While lmmreg() computes the kinship matrix \(K\) using distributed matrix multiplication (Step 2), the full eigendecomposition (Step 3) is currently run on a single core of master using the LAPACK routine DSYEVD, which we empirically find to be the most performant of the four available routines; laptop performance plots showing cubic complexity in \(n\) are available here. On Google cloud, eigendecomposition takes about 2 seconds for 2535 sampes and 1 minute for 8185 samples. If you see worse performance, check that LAPACK natives are being properly loaded (see “BLAS and LAPACK” in Getting Started).

Given the eigendecomposition, fitting the global model (Step 4) takes on the order of a few seconds on master. Association testing (Step 5) is fully distributed by variant with per-variant time complexity that is completely independent of the number of sample covariates and dominated by multiplication of the genotype vector \(v\) by the matrix of eigenvectors \(U^T\) as described below, which we accelerate with a sparse representation of \(v\). The matrix \(U^T\) has size about \(8n^2\) bytes and is currently broadcast to each Spark executor. For example, with 15k samples, storing \(U^T\) consumes about 3.6GB of memory on a 16-core worker node with two 8-core executors. So for large \(n\), we recommend using a high-memory configuration such as highmem workers.

Linear mixed model

lmmreg() estimates the genetic proportion of residual phenotypic variance (narrow-sense heritability) under a kinship-based linear mixed model, and then optionally tests each variant for association using the likelihood ratio test. Inference is exact.

We first describe the sample-covariates-only model used to estimate heritability, which we simply refer to as the global model. With \(n\) samples and \(c\) sample covariates, we define:

  • \(y = n \times 1\) vector of phenotypes
  • \(X = n \times c\) matrix of sample covariates and intercept column of ones
  • \(K = n \times n\) kinship matrix
  • \(I = n \times n\) identity matrix
  • \(\beta = c \times 1\) vector of covariate coefficients
  • \(\sigma_g^2 =\) coefficient of genetic variance component \(K\)
  • \(\sigma_e^2 =\) coefficient of environmental variance component \(I\)
  • \(\delta = \frac{\sigma_e^2}{\sigma_g^2} =\) ratio of environmental and genetic variance component coefficients
  • \(h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2} = \frac{1}{1 + \delta} =\) genetic proportion of residual phenotypic variance

Under a linear mixed model, \(y\) is sampled from the \(n\)-dimensional multivariate normal distribution with mean \(X \beta\) and variance components that are scalar multiples of \(K\) and \(I\):

\[y \sim \mathrm{N}\left(X\beta, \sigma_g^2 K + \sigma_e^2 I\right)\]

Thus the model posits that the residuals \(y_i - X_{i,:}\beta\) and \(y_j - X_{j,:}\beta\) have covariance \(\sigma_g^2 K_{ij}\) and approximate correlation \(h^2 K_{ij}\). Informally: phenotype residuals are correlated as the product of overall heritability and pairwise kinship. By contrast, standard (unmixed) linear regression is equivalent to fixing \(\sigma_2\) (equivalently, \(h^2\)) at 0 above, so that all phenotype residuals are independent.

Caution: while it is tempting to interpret \(h^2\) as the narrow-sense heritability of the phenotype alone, note that its value depends not only the phenotype and genetic data, but also on the choice of sample covariates.

Fitting the global model

The core algorithm is essentially a distributed implementation of the spectral approach taken in FastLMM. Let \(K = USU^T\) be the eigendecomposition of the real symmetric matrix \(K\). That is:

  • \(U = n \times n\) orthonormal matrix whose columns are the eigenvectors of \(K\)
  • \(S = n \times n\) diagonal matrix of eigenvalues of \(K\) in descending order. \(S_{ii}\) is the eigenvalue of eigenvector \(U_{:,i}\)
  • \(U^T = n \times n\) orthonormal matrix, the transpose (and inverse) of \(U\)

A bit of matrix algebra on the multivariate normal density shows that the linear mixed model above is mathematically equivalent to the model

\[U^Ty \sim \mathrm{N}\left(U^TX\beta, \sigma_g^2 (S + \delta I)\right)\]

for which the covariance is diagonal (e.g., unmixed). That is, rotating the phenotype vector (\(y\)) and covariate vectors (columns of \(X\)) in \(\mathbb{R}^n\) by \(U^T\) transforms the model to one with independent residuals. For any particular value of \(\delta\), the restricted maximum likelihood (REML) solution for the latter model can be solved exactly in time complexity that is linear rather than cubic in \(n\). In particular, having rotated, we can run a very efficient 1-dimensional optimization procedure over \(\delta\) to find the REML estimate \((\hat{\delta}, \hat{\beta}, \hat{\sigma}_g^2)\) of the triple \((\delta, \beta, \sigma_g^2)\), which in turn determines \(\hat{\sigma}_e^2\) and \(\hat{h}^2\).

We first compute the maximum log likelihood on a \(\delta\)-grid that is uniform on the log scale, with \(\mathrm{ln}(\delta)\) running from -8 to 8 by 0.01, corresponding to \(h^2\) decreasing from 0.9995 to 0.0005. If \(h^2\) is maximized at the lower boundary then standard linear regression would be more appropriate and Hail will exit; more generally, consider using standard linear regression when \(\hat{h}^2\) is very small. A maximum at the upper boundary is highly suspicious and will also cause Hail to exit, with the hail.log recording all values over the grid for further inspection.

If the optimal grid point falls in the interior of the grid as expected, we then use Brent’s method to find the precise location of the maximum over the same range, with initial guess given by the optimal grid point and a tolerance on \(\mathrm{ln}(\delta)\) of 1e-6. If this location differs from the optimal grid point by more than 0.01, a warning will be displayed and logged, and one would be wise to investigate by plotting the values over the grid.

Note that \(h^2\) is related to \(\mathrm{ln}(\delta)\) through the sigmoid function. More precisely,

\[h^2 = 1 - \mathrm{sigmoid}(\mathrm{ln}(\delta)) = \mathrm{sigmoid}(-\mathrm{ln}(\delta))\]

Hence one can change variables to extract a high-resolution discretization of the likelihood function of \(h^2\) over \([0,1]\) at the corresponding REML estimators for \(\beta\) and \(\sigma_g^2\), as well as integrate over the normalized likelihood function using change of variables and the sigmoid differential equation.

For convenience, global.lmmreg.fit.normLkhdH2 records the the likelihood function of \(h^2\) normalized over the discrete grid 0.01, 0.02, ..., 0.98, 0.99. The length of the array is 101 so that index i contains the likelihood at percentage i. The values at indices 0 and 100 are left undefined.

By the theory of maximum likelihood estimation, this normalized likelihood function is approximately normally distributed near the maximum likelihood estimate. So we estimate the standard error of the estimator of \(h^2\) as follows. Let \(x_2\) be the maximum likelihood estimate of \(h^2\) and let \(x_ 1\) and \(x_3\) be just to the left and right of \(x_2\). Let \(y_1\), \(y_2\), and \(y_3\) be the corresponding values of the (unnormalized) log likelihood function. Setting equal the leading coefficient of the unique parabola through these points (as given by Lagrange interpolation) and the leading coefficient of the log of the normal distribution, we have:

\[\frac{x_3 (y_2 - y_1) + x_2 (y_1 - y_3) + x_1 (y_3 - y_2))}{(x_2 - x_1)(x_1 - x_3)(x_3 - x_2)} = -\frac{1}{2 \sigma^2}\]

The standard error \(\hat{\sigma}\) is then estimated by solving for \(\sigma\).

Note that the mean and standard deviation of the (discretized or continuous) distribution held in global.lmmreg.fit.normLkhdH2 will not coincide with \(\hat{h}^2\) and \(\hat{\sigma}\), since this distribution only becomes normal in the infinite sample limit. One can visually assess normality by plotting this distribution against a normal distribution with the same mean and standard deviation, or use this distribution to approximate credible intervals under a flat prior on \(h^2\).

Testing each variant for association

Fixing a single variant, we define:

  • \(v = n \times 1\) vector of genotypes, with missing genotypes imputed as the mean of called genotypes
  • \(X_v = \left[v | X \right] = n \times (1 + c)\) matrix concatenating \(v\) and \(X\)
  • \(\beta_v = (\beta^0_v, \beta^1_v, \ldots, \beta^c_v) = (1 + c) \times 1\) vector of covariate coefficients

Fixing \(\delta\) at the global REML estimate \(\hat{\delta}\), we find the REML estimate \((\hat{\beta}_v, \hat{\sigma}_{g,v}^2)\) via rotation of the model

\[y \sim \mathrm{N}\left(X_v\beta_v, \sigma_{g,v}^2 (K + \hat{\delta} I)\right)\]

Note that the only new rotation to compute here is \(U^T v\).

To test the null hypothesis that the genotype coefficient \(\beta^0_v\) is zero, we consider the restricted model with parameters \(((0, \beta^1_v, \ldots, \beta^c_v), \sigma_{g,v}^2)\) within the full model with parameters \((\beta^0_v, \beta^1_v, \ldots, \beta^c_v), \sigma_{g_v}^2)\), with \(\delta\) fixed at \(\hat\delta\) in both. The latter fit is simply that of the global model, \(((0, \hat{\beta}^1, \ldots, \hat{\beta}^c), \hat{\sigma}_g^2)\). The likelihood ratio test statistic is given by

\[\chi^2 = n \, \mathrm{ln}\left(\frac{\hat{\sigma}^2_g}{\hat{\sigma}_{g,v}^2}\right)\]

and follows a chi-squared distribution with one degree of freedom. Here the ratio \(\hat{\sigma}^2_g / \hat{\sigma}_{g,v}^2\) captures the degree to which adding the variant \(v\) to the global model reduces the residual phenotypic variance.

Kinship Matrix

FastLMM uses the Realized Relationship Matrix (RRM) for kinship. This can be computed with rrm(). However, any instance of KinshipMatrix may be used, so long as sample_list contains the complete samples of the caller variant dataset in the same order.

Low-rank approximation of kinship for improved performance

lmmreg() can implicitly use a low-rank approximation of the kinship matrix to more rapidly fit delta and the statistics for each variant. The computational complexity per variant is proportional to the number of eigenvectors used. This number can be specified in two ways. Specify the parameter n_eigs to use only the top n_eigs eigenvectors. Alternatively, specify dropped_variance_fraction to use as many eigenvectors as necessary to capture all but at most this fraction of the sample variance (also known as the trace, or the sum of the eigenvalues). For example, dropped_variance_fraction=0.01 will use the minimal number of eigenvectors to account for 99% of the sample variance. Specifying both parameters will apply the more stringent (fewest eigenvectors) of the two.

Further background

For the history and mathematics of linear mixed models in genetics, including FastLMM, see Christoph Lippert’s PhD thesis. For an investigation of various approaches to defining kinship, see Comparison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data.

Parameters:
  • kinshipMatrix (KinshipMatrix) – Kinship matrix to be used.
  • y (str) – Response sample annotation.
  • covariates (list of str) – List of covariate sample annotations.
  • global_root (str) – Global annotation root, a period-delimited path starting with global.
  • va_root (str) – Variant annotation root, a period-delimited path starting with va.
  • run_assoc (bool) – If true, run association testing in addition to fitting the global model.
  • use_ml (bool) – Use ML instead of REML throughout.
  • delta (float or None) – Fixed delta value to use in the global model, overrides fitting delta.
  • sparsity_threshold (float) – Genotype vector sparsity at or below which to use sparse genotype vector in rotation (advanced).
  • use_dosages (bool) – If true, use dosages rather than hard call genotypes.
  • n_eigs (int) – Number of eigenvectors of the kinship matrix used to fit the model.
  • dropped_variance_fraction (float) – Upper bound on fraction of sample variance lost by dropping eigenvectors with small eigenvalues.
Returns:

Variant dataset with linear mixed regression annotations.

Return type:

VariantDataset

logreg(test, y, covariates=[], root='va.logreg', use_dosages=False)[source]

Test each variant for association using logistic regression.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Run the logistic regression Wald test per variant using a Boolean phenotype and two covariates stored in sample annotations:

>>> vds_result = vds.logreg('wald', 'sa.pheno.isCase', covariates=['sa.pheno.age', 'sa.pheno.isFemale'])

Notes

The logreg() method performs, for each variant, a significance test of the genotype in predicting a binary (case-control) phenotype based on the logistic regression model. The phenotype type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.

Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes samples for which the phenotype and all covariates are defined. For each variant, Hail imputes missing genotypes as the mean of called genotypes.

By default, genotypes values are given by hard call genotypes (g.gt). If use_dosages=True, then genotype values are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.

The example above considers a model of the form

\[\mathrm{Prob}(\mathrm{isCase}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{isFemale} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]

where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{isFemale}\) is coded as 1 for true (female) and 0 for false (male). The null model sets \(\beta_1 = 0\).

The resulting variant annotations depend on the test statistic as shown in the tables below.

Test Annotation Type Value
Wald va.logreg.beta Double fit genotype coefficient, \(\hat\beta_1\)
Wald va.logreg.se Double estimated standard error, \(\widehat{\mathrm{se}}\)
Wald va.logreg.zstat Double Wald \(z\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
Wald va.logreg.pval Double Wald p-value testing \(\beta_1 = 0\)
LRT, Firth va.logreg.beta Double fit genotype coefficient, \(\hat\beta_1\)
LRT, Firth va.logreg.chi2 Double deviance statistic
LRT, Firth va.logreg.pval Double LRT / Firth p-value testing \(\beta_1 = 0\)
Score va.logreg.chi2 Double score statistic
Score va.logreg.pval Double score p-value testing \(\beta_1 = 0\)

For the Wald and likelihood ratio tests, Hail fits the logistic model for each variant using Newton iteration and only emits the above annotations when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three variant annotations which summarize the iterative fitting process:

Test Annotation Type Value
Wald, LRT, Firth va.logreg.fit.nIter Int number of iterations until convergence, explosion, or reaching the max (25 for Wald, LRT; 100 for Firth)
Wald, LRT, Firth va.logreg.fit.converged Boolean true if iteration converged
Wald, LRT, Firth va.logreg.fit.exploded Boolean true if iteration exploded

We consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{-6}\). For Wald and LRT, up to 25 iterations are attempted; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to low-level numerical linear algebra exceptions caused by manipulating ill-conditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.

A more common situation in genetics is quasi-complete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) p-values. To not miss such variants, consider using Firth logistic regression, linear regression, or group-based tests.

Here’s a concrete illustration of quasi-complete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:

Status HomRef Het HomVar
Case 1000 10 0
Control 1000 0 0

The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where x is genotype, y is phenotype, and logistf is from the logistf package:

x <- c(rep(0,1000), rep(1,1000), rep(1,10)
y <- c(rep(0,1000), rep(0,1000), rep(1,10))
logfit <- glm(y ~ x, family=binomial())
firthfit <- logistf(y ~ x)
linfit <- lm(y ~ x)

The resulting p-values for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasi-complete separation. Moving one of the 10 hets from case to control eliminates this quasi-complete separation; the p-values from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.

The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the sa.lmmreg.fit annotations reflect the null model; otherwise, they reflect the full model.

See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.

Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.

Phenotype and covariate sample annotations may also be specified using programmatic expressions without identifiers, such as:

if (sa.isFemale) sa.cov.age else (2 * sa.cov.age + 10)

For Boolean covariate types, true is coded as 1 and false as 0. In particular, for the sample annotation sa.fam.isCase added by importing a FAM file with case-control phenotype, case is 1 and control is 0.

Hail’s logistic regression tests correspond to the b.wald, b.lrt, and b.score tests in EPACTS. For each variant, Hail imputes missing genotypes as the mean of called genotypes, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.

Parameters:
  • test (str) – Statistical test, one of: ‘wald’, ‘lrt’, ‘score’, or ‘firth’.
  • y (str) – Response expression. Must evaluate to Boolean or numeric with all values 0 or 1.
  • covariates (list of str) – list of covariate expressions
  • root (str) – Variant annotation path to store result of logistic regression.
  • use_dosages (bool) – If true, use genotype dosage rather than hard call.
Returns:

Variant dataset with logistic regression variant annotations.

Return type:

VariantDataset

logreg_burden(key_name, variant_keys, single_key, agg_expr, test, y, covariates=[])[source]

Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the logistic regression model.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Run a gene burden test using the logistic Wald test on the maximum genotype per gene. Here va.genes is a variant annotation of type Set[String] giving the set of genes containing the variant (see Extended example in linreg_burden() for a deeper dive in the context of linear regression):

>>> logreg_kt, sample_kt = (hc.read('data/example_burden.vds')
...     .logreg_burden(key_name='gene',
...                    variant_keys='va.genes',
...                    single_key=False,
...                    agg_expr='gs.map(g => g.gt).max()',
...                    test='wald',
...                    y='sa.burden.pheno',
...                    covariates=['sa.burden.cov1', 'sa.burden.cov2']))

Run a gene burden test using the logistic score test on the weighted sum of genotypes per gene. Here va.gene is a variant annotation of type String giving a single gene per variant (or no gene if missing), and va.weight is a numeric variant annotation:

>>> logreg_kt, sample_kt = (hc.read('data/example_burden.vds')
...     .logreg_burden(key_name='gene',
...                    variant_keys='va.gene',
...                    single_key=True,
...                    agg_expr='gs.map(g => va.weight * g.gt).sum()',
...                    test='score',
...                    y='sa.burden.pheno',
...                    covariates=['sa.burden.cov1', 'sa.burden.cov2']))

To use a weighted sum of genotypes with missing genotypes mean-imputed rather than ignored, set agg_expr='gs.map(g => va.weight * orElse(g.gt.toDouble, 2 * va.qc.AF)).sum()' where va.qc.AF is the allele frequency over those samples that have no missing phenotype or covariates.

Caution

With single_key=False, variant_keys expects a variant annotation of Set or Array type, in order to allow each variant to have zero, one, or more keys (for example, the same variant may appear in multiple genes). Unlike with type Set, if the same key appears twice in a variant annotation of type Array, then that variant will be counted twice in that key’s group. With single_key=True, variant_keys expects a variant annotation whose value is itself the key of interest. In bose cases, variants with missing keys are ignored.

Notes

This method modifies logreg() by replacing the genotype covariate per variant and sample with an aggregated (i.e., collapsed) score per key and sample. This numeric score is computed from the sample’s genotypes and annotations over all variants with that key. The phenotype type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.

Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’) as the test parameter. Conceptually, the method proceeds as follows:

  1. Filter to the set of samples for which all phenotype and covariates are defined.

  2. For each key and sample, aggregate genotypes across variants with that key to produce a numeric score. agg_expr must be of numeric type and has the following symbols are in scope:

    • s (Sample): sample
    • sa: sample annotations
    • global: global annotations
    • gs (Aggregable[Genotype]): aggregable of Genotype for sample s

    Note that v, va, and g are accessible through Aggregable methods on gs.

    The resulting sample key table has key column key_name and a numeric column of scores for each sample named by the sample ID.

  3. For each key, fit the logistic regression model using the supplied phenotype, covariates, and test. The model and tests are those of logreg() with sample genotype gt replaced by the score in the sample key table. For each key, missing scores are mean-imputed across all samples.

    The resulting logistic regression key table has key column of type String given by the key_name parameter and additional columns corresponding to the fields of the va.logreg schema given for test in logreg().

logreg_burden() returns both the logistic regression key table and the sample key table.

Parameters:
  • key_name (str) – Name to assign to key column of returned key tables.
  • variant_keys (str) – Variant annotation path for the TArray or TSet of keys associated to each variant.
  • single_key (bool) – if true, variant_keys is interpreted as a single (or missing) key per variant, rather than as a collection of keys.
  • agg_expr (str) – Sample aggregation expression (per key).
  • test (str) – Statistical test, one of: ‘wald’, ‘lrt’, ‘score’, or ‘firth’.
  • y (str) – Response expression.
  • covariates (list of str) – list of covariate expressions.
Returns:

Tuple of logistic regression key table and sample aggregation key table.

Return type:

(KeyTable, KeyTable)

make_table(variant_expr, genotype_expr, key=[], separator='.')[source]

Produce a key with one row per variant and one or more columns per sample.

Examples

Consider a VariantDataset vds with 2 variants and 3 samples:

Variant       FORMAT  A       B       C
1:1:A:T       GT:GQ   0/1:99  ./.     0/0:99
1:2:G:C       GT:GQ   0/1:89  0/1:99  1/1:93

Then

>>> kt = vds.make_table('v = v', ['gt = g.gt', 'gq = g.gq'])

returns a KeyTable with schema

v: Variant
A.gt: Int
A.gq: Int
B.gt: Int
B.gq: Int
C.gt: Int
C.gq: Int

and values

v   A.gt    A.gq    B.gt    B.gq    C.gt    C.gq
1:1:A:T     1       99      NA      NA      0       99
1:2:G:C     1       89      1       99      2       93

The above table can be generated and exported as a TSV using KeyTable export().

Notes

Per sample field names in the result are formed by concatenating the sample ID with the genotype_expr left hand side with separator. If the left hand side is empty:

`` = expr

then the dot (.) is omitted.

Parameters:
  • variant_expr (str or list of str) – Variant annotation expressions.
  • genotype_expr (str or list of str) – Genotype annotation expressions.
  • key (str or list of str) – List of key columns.
  • separator (str) – Separator to use between sample IDs and genotype expression left-hand side identifiers.
Return type:

KeyTable

mendel_errors(pedigree)[source]

Find Mendel errors; count per variant, individual and nuclear family.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables:

>>> ped = Pedigree.read('data/trios.fam')
>>> all, per_fam, per_sample, per_variant = vds.mendel_errors(ped)

Export all mendel errors to a text file:

>>> all.export('output/all_mendel_errors.tsv')

Annotate samples with the number of Mendel errors:

>>> annotated_vds = vds.annotate_samples_table(per_sample, root="sa.mendel")

Annotate variants with the number of Mendel errors:

>>> annotated_vds = vds.annotate_variants_table(per_variant, root="va.mendel")

Notes

This method assumes all contigs apart from X and Y are fully autosomal; mitochondria, decoys, etc. are not given special treatment.

The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats. The four tables contain the following columns:

First table: all Mendel errors. This table contains one row per Mendel error in the dataset; it is possible that a variant or sample may be found on more than one row. This table closely reflects the structure of the ”.mendel” PLINK format detailed below.

Columns:

  • fid (String) – Family ID.
  • s (String) – Proband ID.
  • v (Variant) – Variant in which the error was found.
  • code (Int) – Mendel error code, see below.
  • error (String) – Readable representation of Mendel error.

Second table: errors per nuclear family. This table contains one row per nuclear family in the dataset. This table closely reflects the structure of the ”.fmendel” PLINK format detailed below.

Columns:

  • fid (String) – Family ID.
  • father (String) – Paternal ID.
  • mother (String) – Maternal ID.
  • nChildren (Int) – Number of children in this nuclear family.
  • nErrors (Int) – Number of Mendel errors in this nuclear family.
  • nSNP (Int) – Number of Mendel errors at SNPs in this nuclear family.

Third table: errors per individual. This table contains one row per individual in the dataset, including founders. This table closely reflects the structure of the ”.imendel” PLINK format detailed below.

Columns:

  • s (String) – Sample ID (key column).
  • fid (String) – Family ID.
  • nErrors (Int) – Number of Mendel errors found involving this individual.
  • nSNP (Int) – Number of Mendel errors found involving this individual at SNPs.
  • error (String) – Readable representation of Mendel error.

Fourth table: errors per variant. This table contains one row per variant in the dataset.

Columns:

  • v (Variant) – Variant (key column).
  • nErrors (Int) – Number of Mendel errors in this variant.

PLINK Mendel error formats:

  • *.mendel – all mendel errors: FID KID CHR SNP CODE ERROR
  • *.fmendel – error count per nuclear family: FID PAT MAT CHLD N
  • *.imendel – error count per individual: FID IID N
  • *.lmendel – error count per variant: CHR SNP N

In the PLINK formats, FID, KID, PAT, MAT, and IID refer to family, kid, dad, mom, and individual ID, respectively, with missing values set to 0. SNP denotes the variant identifier chr:pos:ref:alt. N is the error count. CHLD is the number of children in a nuclear family.

The CODE of each Mendel error is determined by the table below, extending the Plink classification.

Those individuals implicated by each code are in bold.

The copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR).

  • HemiX – in non-PAR of X, male child
  • HemiY – in non-PAR of Y, male child
  • Auto – otherwise (in autosome or PAR, or female child)

Any refers to \(\{ HomRef, Het, HomVar, NoCall \}\) and ! denotes complement in this set.

Code Dad Mom Kid Copy State
1 HomVar HomVar Het Auto
2 HomRef HomRef Het Auto
3 HomRef ! HomRef HomVar Auto
4 ! HomRef HomRef HomVar Auto
5 HomRef HomRef HomVar Auto
6 HomVar ! HomVar HomRef Auto
7 ! HomVar HomVar HomRef Auto
8 HomVar HomVar HomRef Auto
9 Any HomVar HomRef HemiX
10 Any HomRef HomVar HemiX
11 HomVar Any HomRef HemiY
12 HomRef Any HomVar HemiY

This method only considers children with two parents and a defined sex.

PAR is currently defined with respect to reference GRCh37:

  • X: 60001 - 2699520, 154931044 - 155260560
  • Y: 10001 - 2649520, 59034050 - 59363566
Parameters:pedigree (Pedigree) – Sample pedigree.
Returns:Four tables with Mendel error statistics.
Return type:(KeyTable, KeyTable, KeyTable, KeyTable)
min_rep(max_shift=100)[source]

Gives minimal, left-aligned representation of alleles. Note that this can change the variant position.

Examples

1. Simple trimming of a multi-allelic site, no change in variant position 1:10000:TAA:TAA,AA => 1:10000:TA:T,A

2. Trimming of a bi-allelic site leading to a change in position 1:10000:AATAA,AAGAA => 1:10002:T:G

Parameters:max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered.
Return type:VariantDataset
naive_coalesce(max_partitions)[source]

Naively descrease the number of partitions.

Warning

naive_coalesce() simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlike repartition(), so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.

Parameters:max_partitions (int) – Desired number of partitions. If the current number of partitions is less than max_partitions, do nothing.
Returns:Variant dataset with the number of partitions equal to at most max_partitions
Return type:VariantDataset
num_partitions()[source]

Number of partitions.

Notes

The data in a variant dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. Partitions are a core concept of distributed computation in Spark, see here for details.

Return type:int
num_samples

Number of samples.

Return type:int
pc_relate(k, maf, block_size=512, min_kinship=-inf, statistics='all')[source]

Compute relatedness estimates between individuals using a variant of the PC-Relate method.

Danger

This method is experimental. We neither guarantee interface stability nor that the results are viable for any particular use.

Examples

Estimate kinship, identity-by-descent two, identity-by-descent one, and identity-by-descent zero for every pair of samples, using 5 prinicpal components to correct for ancestral populations, and a minimum minor allele frequency filter of 0.01:

>>> rel = vds.pc_relate(5, 0.01)

Calculate values as above, but when performing distributed matrix multiplications use a matrix-block-size of 1024 by 1024.

>>> rel = vds.pc_relate(5, 0.01, 1024)

Calculate values as above, excluding sample-pairs with kinship less than 0.1. This is more efficient than producing the full key table and filtering using filter().

>>> rel = vds.pc_relate(5, 0.01, min_kinship=0.1)

Method

The traditional estimator for kinship between a pair of individuals \(i\) and \(j\), sharing the set \(S_{ij}\) of single-nucleotide variants, from a population with allele frequencies \(p_s\), is given by:

\[\widehat{\phi_{ij}} := \frac{1}{|S_{ij}|}\sum_{s \in S_{ij}}\frac{(g_{is} - 2 p_s) (g_{js} - 2 p_s)}{4 * \sum_{s \in S_{ij} p_s (1 - p_s)}}\]

This estimator is true under the model that the sharing of common (relative to the population) alleles is not very informative to relatedness (because they’re common) and the sharing of rare alleles suggests a recent common ancestor from which the allele was inherited by descent.

When multiple ancestry groups are mixed in a sample, this model breaks down. Alleles that are rare in all but one ancestry group are treated as very informative to relatedness. However, these alleles are simply markers of the ancestry group. The PC-Relate method corrects for this situation and the related situation of admixed individuals.

PC-Relate slightly modifies the usual estimator for relatedness: occurences of population allele frequency are replaced with an “individual-specific allele frequency”. This modification allows the method to correctly weight an allele according to an individual’s unique ancestry profile.

The “individual-specific allele frequency” at a given genetic locus is modeled by PC-Relate as a linear function of their first k principal component coordinates. As such, the efficacy of this method rests on two assumptions:

  • an individual’s first k principal component coordinates fully describe their allele-frequency-relevant ancestry, and
  • the relationship between ancestry (as described by principal component coordinates) and population allele frequency is linear

The estimators for kinship, and identity-by-descent zero, one, and two follow. Let:

  • \(S_{ij}\) be the set of genetic loci at which both individuals \(i\) and \(j\) have a defined genotype
  • \(g_{is} \in {0, 1, 2}\) be the number of alternate alleles that individual \(i\) has at gentic locus \(s\)
  • \(\widehat{\mu_{is}} \in [0, 1]\) be the individual-specific allele frequency for individual \(i\) at genetic locus \(s\)
  • \({\widehat{\sigma^2_{is}}} := \widehat{\mu_{is}} (1 - \widehat{\mu_{is}})\), the binomial variance of \(\widehat{\mu_{is}}\)
  • \(\widehat{\sigma_{is}} := \sqrt{\widehat{\sigma^2_{is}}}\), the binomial standard deviation of \(\widehat{\mu_{is}}\)
  • \(\text{IBS}^{(0)}_{ij} := \sum_{s \in S_{ij}} \mathbb{1}_{||g_{is} - g_{js} = 2||}\), the number of genetic loci at which individuals \(i\) and \(j\) share no alleles
  • \(\widehat{f_i} := 2 \widehat{\phi_{ii}} - 1\), the inbreeding coefficient for individual \(i\)
  • \(g^D_{is}\) be a dominance encoding of the genotype matrix, and \(X_{is}\) be a normalized dominance-coded genotype matrix
\[ \begin{align}\begin{aligned}\begin{split}g^D_{is} := \begin{cases} \widehat{\mu_{is}} & g_{is} = 0 \\ 0 & g_{is} = 1 \\ 1 - \widehat{\mu_{is}} & g_{is} = 2 \end{cases}\end{split}\\X_{is} := g^D_{is} - \widehat{\sigma^2_{is}} (1 - \widehat{f_i})\end{aligned}\end{align} \]

The estimator for kinship is given by:

\[\widehat{\phi_{ij}} := \frac{\sum_{s \in S_{ij}}(g - 2 \mu)_{is} (g - 2 \mu)_{js}}{4 * \sum_{s \in S_{ij}}\widehat{\sigma_{is}} \widehat{\sigma_{js}}}\]

The estimator for identity-by-descent two is given by:

\[\widehat{k^{(2)}_{ij}} := \frac{\sum_{s \in S_{ij}}X_{is} X_{js}}{\sum_{s \in S_{ij}}\widehat{\sigma^2_{is}} \widehat{\sigma^2_{js}}}\]

The estimator for identity-by-descent zero is given by:

\[\begin{split}\widehat{k^{(0)}_{ij}} := \begin{cases} \frac{\text{IBS}^{(0)}_{ij}} {\sum_{s \in S_{ij}} \widehat{\mu_{is}}^2(1 - \widehat{\mu_{js}})^2 + (1 - \widehat{\mu_{is}})^2\widehat{\mu_{js}}^2} & \widehat{\phi_{ij}} > 2^{-5/2} \\ 1 - 4 \widehat{\phi_{ij}} + k^{(2)}_{ij} & \widehat{\phi_{ij}} \le 2^{-5/2} \end{cases}\end{split}\]

The estimator for identity-by-descent one is given by:

\[\widehat{k^{(1)}_{ij}} := 1 - \widehat{k^{(2)}_{ij}} - \widehat{k^{(0)}_{ij}}\]

Details

The PC-Relate method is described in “Model-free Estimation of Recent Genetic Relatedness”. Conomos MP, Reiner AP, Weir BS, Thornton TA. in American Journal of Human Genetics. 2016 Jan 7. The reference implementation is available in the GENESIS Bioconductor package .

pc_relate() differs from the reference implementation in a couple key ways:

  • the principal components analysis does not use an unrelated set of individuals
  • the estimators do not perform small sample correction
  • the algorithm does not provide an option to use population-wide allele frequency estimates
  • the algorithm does not provide an option to not use “overall standardization” (see R pcrelate documentation)

Notes

The block_size controls memory usage and parallelism. If it is large enough to hold an entire sample-by-sample matrix of 64-bit doubles in memory, then only one Spark worker node can be used to compute matrix operations. If it is too small, communication overhead will begin to dominate the computation’s time. The author has found that on Google Dataproc (where each core has about 3.75GB of memory), setting block_size larger than 512 tends to cause memory exhaustion errors.

The minimum allele frequency filter is applied per-pair: if either of the two individual’s individual-specific minor allele frequency is below the threshold, then the variant’s contribution to relatedness estimates is zero.

Under the PC-Relate model, kinship, [ phi_{ij} ], ranges from 0 to 0.5, and is precisely half of the fraction-of-genetic-material-shared. Listed below are the statistics for a few pairings:

  • Monozygotic twins share all their genetic material so their kinship statistic is 0.5 in expection.
  • Parent-child and sibling pairs both have kinship 0.25 in expectation and are separated by the identity-by-descent-zero, [ k^{(2)}_{ij} ], statistic which is zero for parent-child pairs and 0.25 for sibling pairs.
  • Avuncular pairs and grand-parent/-child pairs both have kinship 0.125 in expectation and both have identity-by-descent-zero 0.5 in expectation
  • “Third degree relatives” are those pairs sharing [ 2^{-3} = 12.5 % ] of their genetic material, the results of PCRelate are often too noisy to reliably distinguish these pairs from higher-degree-relative-pairs or unrelated pairs.

The resulting KeyTable entries have the type: { i: String, j: String, kin: Double, k2: Double, k1: Double, k0: Double }. The key list is: i: String, j: String.

Parameters:
  • k (int) – The number of principal components to use to distinguish ancestries.
  • maf (float) – The minimum individual-specific allele frequency for an allele used to measure relatedness.
  • block_size (int) – the side length of the blocks of the block- distributed matrices; this should be set such that at least three of these matrices fit in memory (in addition to all other objects necessary for Spark and Hail).
  • min_kinship (float) – Pairs of samples with kinship lower than min_kinship are excluded from the results.
  • statistics (str) – the set of statistics to compute, ‘phi’ will only compute the kinship statistic, ‘phik2’ will compute the kinship and identity-by-descent two statistics, ‘phik2k0’ will compute the kinship statistics and both identity-by-descent two and zero, ‘all’ computes the kinship statistic and all three identity-by-descent statistics.
Returns:

A KeyTable mapping pairs of samples to estimations of their kinship and identity-by-descent zero, one, and two.

Return type:

KeyTable

pca(scores, loadings=None, eigenvalues=None, k=10, as_array=False)[source]

Run Principal Component Analysis (PCA) on the matrix of genotypes.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Compute the top 5 principal component scores, stored as sample annotations sa.scores.PC1, ..., sa.scores.PC5 of type Double:

>>> vds_result = vds.pca('sa.scores', k=5)

Compute the top 5 principal component scores, loadings, and eigenvalues, stored as annotations sa.scores, va.loadings, and global.evals of type Array[Double]:

>>> vds_result = vds.pca('sa.scores', 'va.loadings', 'global.evals', 5, as_array=True)

Notes

Hail supports principal component analysis (PCA) of genotype data, a now-standard procedure Patterson, Price and Reich, 2006. This method expects a variant dataset with biallelic autosomal variants. Scores are computed and stored as sample annotations of type Struct by default; variant loadings and eigenvalues can optionally be computed and stored in variant and global annotations, respectively.

PCA is based on the singular value decomposition (SVD) of a standardized genotype matrix \(M\), computed as follows. An \(n \times m\) matrix \(C\) records raw genotypes, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as

\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},\]

with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in Hardy-Weinberg equilibrium and is further motivated in the paper cited above. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the sample correlation or genetic relationship matrix (GRM) as simply \(MM^T\).

PCA then computes the SVD

\[M = USV^T\]

where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.

From the perspective of the samples or rows of \(M\) as data, \(V_k\) contains the variant loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the GRM \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the loadings parameter is specified.

Note: In PLINK/GCTA the GRM is taken as the starting point and it is computed slightly differently with regard to missing data. Here the \(ij\) entry of \(MM^T\) is simply the dot product of rows \(i\) and \(j\) of \(M\); in terms of \(C\) it is

\[\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}-2p_l)(C_{jl} - 2p_l)}{2p_l(1-p_l)}\]

where \(\mathcal{C}_i = \{l \mid C_{il} \text{ is non-missing}\}\). In PLINK/GCTA the denominator \(m\) is replaced with the number of terms in the sum \(\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert\), i.e. the number of variants where both samples have non-missing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections.

Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM; even ignoring the above discrepancy that means the left singular vectors \(U_k\) instead of the component scores \(U_k S_k\). While this is just a matter of the scale on each PC, the scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. (In PC bi-plots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.)

Annotations

Given root scores='sa.scores' and as_array=False, pca() adds a Struct to sample annotations:

  • sa.scores (Struct) – Struct of sample scores

With k=3, the Struct has three field:

  • sa.scores.PC1 (Double) – Score from first PC
  • sa.scores.PC2 (Double) – Score from second PC
  • sa.scores.PC3 (Double) – Score from third PC

Analogous variant and global annotations of type Struct are added by specifying the loadings and eigenvalues arguments, respectively.

Given roots scores='sa.scores', loadings='va.loadings', and eigenvalues='global.evals', and as_array=True, pca() adds the following annotations:

  • sa.scores (Array[Double]) – Array of sample scores from the top k PCs
  • va.loadings (Array[Double]) – Array of variant loadings in the top k PCs
  • global.evals (Array[Double]) – Array of the top k eigenvalues
Parameters:
  • scores (str) – Sample annotation path to store scores.
  • loadings (str or None) – Variant annotation path to store site loadings.
  • eigenvalues (str or None) – Global annotation path to store eigenvalues.
  • k (bool or None) – Number of principal components.
  • as_array (bool) – Store annotations as type Array rather than Struct
Returns:

Dataset with new PCA annotations.

Return type:

VariantDataset

persist(storage_level='MEMORY_AND_DISK')[source]

Persist this variant dataset to memory and/or disk.

Examples

Persist the variant dataset to both memory and disk:

>>> vds_result = vds.persist()

Notes

The persist() and cache() methods allow you to store the current dataset on disk or in memory to avoid redundant computation and improve the performance of Hail pipelines.

cache() is an alias for persist("MEMORY_ONLY"). Most users will want “MEMORY_AND_DISK”. See the Spark documentation for a more in-depth discussion of persisting data.

Warning

Persist, like all other VariantDataset functions, is functional. Its output must be captured. This is wrong:

>>> vds = vds.linreg('sa.phenotype') 
>>> vds.persist() 

The above code does NOT persist vds. Instead, it copies vds and persists that result. The proper usage is this:

>>> vds = vds.pca().persist() 
Parameters:storage_level – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
Return type:VariantDataset
query_genotypes(exprs)[source]

Performs aggregation queries over genotypes, and returns Python object(s).

Examples

Compute global GQ histogram

>>> gq_hist = vds.query_genotypes('gs.map(g => g.gq).hist(0, 100, 100)')

Compute call rate

>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')

Compute GQ and DP histograms

>>> [gq_hist, dp_hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)',
...                                                     'gs.map(g => g.dp).hist(0, 60, 60)'])
Parameters:exprs (str or list of str) – query expressions
Return type:annotation or list of annotation
query_genotypes_typed(exprs)[source]

Performs aggregation queries over genotypes, and returns Python object(s) and type(s).

Examples

>>> gq_hist, t = vds.query_genotypes_typed('gs.map(g => g.gq).hist(0, 100, 100)')
>>> [gq_hist, dp_hist], [t1, t2] = vds.query_genotypes_typed(['gs.map(g => g.gq).hist(0, 100, 100)',
...                                                           'gs.map(g => g.dp).hist(0, 60, 60)'])

See query_genotypes() for more information.

This method evaluates Hail expressions over genotypes, along with all variant and sample metadata for that genotype. The exprs argument requires either a list of strings or a single string The method returns a list of results and a list of types (which each contain one element if the input parameter was a single str).

The namespace of the expressions includes:

  • global: global annotations
  • gs (Aggregable[Genotype]): aggregable of Genotype

Map and filter expressions on this aggregable have the following namespace:

  • global: global annotations
  • g: Genotype
  • v: Variant
  • va: variant annotations
  • s: sample
  • sa: sample annotations

Performance Note It is far faster to execute multiple queries in one method than to execute multiple query methods. This:

>>> result1 = vds.query_genotypes('gs.count()')
>>> result2 = vds.query_genotypes('gs.filter(g => v.altAllele.isSNP() && g.isHet).count()')

will be nearly twice as slow as this:

>>> exprs = ['gs.count()', 'gs.filter(g => v.altAllele.isSNP() && g.isHet).count()']
>>> [geno_count, snp_hets] = vds.query_genotypes(exprs)
Parameters:exprs (str or list of str) – query expressions
Return type:(annotation or list of annotation, Type or list of Type)
query_samples(exprs)[source]

Performs aggregation queries over samples and sample annotations, and returns Python object(s).

Examples

>>> low_callrate_samples = vds.query_samples('samples.filter(s => sa.qc.callRate < 0.95).collect()')

Notes

This method evaluates Hail expressions over samples and sample annotations. The exprs argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.

The namespace of the expressions includes:

  • global: global annotations
  • samples (Aggregable[Sample]): aggregable of sample

Map and filter expressions on this aggregable have the additional namespace:

  • global: global annotations
  • s: sample
  • sa: sample annotations
Parameters:exprs (str or list of str) – query expressions
Return type:annotation or list of annotation
query_samples_typed(exprs)[source]

Performs aggregation queries over samples and sample annotations, and returns Python object(s) and type(s).

Examples

>>> low_callrate_samples, t = vds.query_samples_typed(
...    'samples.filter(s => sa.qc.callRate < 0.95).collect()')

See query_samples() for more information.

Parameters:exprs (str or list of str) – query expressions
Return type:(annotation or list of annotation, Type or list of Type)
query_variants(exprs)[source]

Performs aggregation queries over variants and variant annotations, and returns Python object(s).

Examples

>>> lof_variant_count = vds.query_variants('variants.filter(v => va.consequence == "LOF").count()')
>>> [lof_variant_count, missense_count] = vds.query_variants([
...     'variants.filter(v => va.consequence == "LOF").count()',
...     'variants.filter(v => va.consequence == "Missense").count()'])

Notes

This method evaluates Hail expressions over variants and variant annotations. The exprs argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.

The namespace of the expressions includes:

  • global: global annotations
  • variants (Aggregable[Variant]): aggregable of Variant

Map and filter expressions on this aggregable have the additional namespace:

  • global: global annotations
  • v: Variant
  • va: variant annotations

Performance Note It is far faster to execute multiple queries in one method than to execute multiple query methods. The combined query:

>>> exprs = ['variants.count()', 'variants.filter(v => v.altAllele.isSNP()).count()']
>>> [num_variants, num_snps] = vds.query_variants(exprs)

will be nearly twice as fast as the split query:

>>> result1 = vds.query_variants('variants.count()')
>>> result2 = vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')
Parameters:exprs (str or list of str) – query expressions
Return type:annotation or list of annotation
query_variants_typed(exprs)[source]

Performs aggregation queries over variants and variant annotations, and returns Python object(s) and type(s).

Examples

>>> lof_variant_count, t = vds.query_variants_typed(
...     'variants.filter(v => va.consequence == "LOF").count()')
>>> [lof_variant_count, missense_count], [t1, t2] = vds.query_variants_typed([
...     'variants.filter(v => va.consequence == "LOF").count()',
...     'variants.filter(v => va.consequence == "Missense").count()'])

See query_variants() for more information.

Parameters:exprs (str or list of str) – query expressions
Return type:(annotation or list of annotation, Type or list of Type)
rename_samples(mapping)[source]

Rename samples.

Examples

>>> vds_result = vds.rename_samples({'ID1': 'id1', 'ID2': 'id2'})

Use a file with an “old_id” and “new_id” column to rename samples:

>>> mapping_table = hc.import_table('data/sample_mapping.txt')
>>> mapping_dict = {row.old_id: row.new_id for row in mapping_table.collect()}
>>> vds_result = vds.rename_samples(mapping_dict)
Parameters:mapping (dict) – Mapping from old to new sample IDs.
Returns:Dataset with remapped sample IDs.
Return type:VariantDataset
repartition(num_partitions, shuffle=True)[source]

Increase or decrease the number of variant dataset partitions.

Examples

Repartition the variant dataset to have 500 partitions:

>>> vds_result = vds.repartition(500)

Notes

Check the current number of partitions with num_partitions().

The data in a variant dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a variant dataset with \(M\) variants is first imported, each of the \(k\) partition will contain about \(M/k\) of the variants. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores.

Partitions are a core concept of distributed computation in Spark, see here for details. With shuffle=True, Hail does a full shuffle of the data and creates equal sized partitions. With shuffle=False, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to the repartition and coalesce commands in Spark, respectively. In particular, when shuffle=False, num_partitions cannot exceed current number of partitions.

Parameters:
  • num_partitions (int) – Desired number of partitions, must be less than the current number if shuffle=False
  • shuffle (bool) – If true, use full shuffle to repartition.
Returns:

Variant dataset with the number of partitions equal to at most num_partitions

Return type:

VariantDataset

rowkey_schema

Returns the signature of the row key (variant) contained in this VDS.

Examples

>>> print(vds.rowkey_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.rowkey_schema)
Return type:Type
rrm(force_block=False, force_gramian=False)[source]

Computes the Realized Relationship Matrix (RRM).

Examples

>>> kinship_matrix = vds.rrm()

Notes

The Realized Relationship Matrix is defined as follows. Consider the \(n \times m\) matrix \(C\) of raw genotypes, with rows indexed by \(n\) samples and columns indexed by the \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as

\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}-2p_j)^2}},\]

with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance \(1/m\), which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the \(n \times n\) sample correlation or realized relationship matrix (RRM) \(K\) as simply

\[K = MM^T\]

Note that the only difference between the Realized Relationship Matrix and the Genetic Relationship Matrix (GRM) used in grm() is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under Hardy-Weinberg Equilibrium.

Parameters:
  • force_block (bool) – Force using Spark’s BlockMatrix to compute kinship (advanced).
  • force_gramian (bool) – Force using Spark’s RowMatrix.computeGramian to compute kinship (advanced).
Returns:

Realized Relationship Matrix for all samples.

Return type:

KinshipMatrix

same(other, tolerance=1e-06)[source]

True if the two variant datasets have the same variants, samples, genotypes, and annotation schemata and values.

Examples

This will return True:

>>> vds.same(vds)

Notes

The tolerance parameter sets the tolerance for equality when comparing floating-point fields. More precisely, \(x\) and \(y\) are equal if

\[bs{x - y} \leq tolerance * \max{bs{x}, bs{y}}\]
Parameters:
  • other (VariantDataset) – variant dataset to compare against
  • tolerance (float) – floating-point tolerance for equality
Return type:

bool

sample_annotations

Return a dict of sample annotations.

The keys of this dictionary are the sample IDs (strings). The values are sample annotations.

Returns:dict
sample_ids

Return sampleIDs.

Returns:List of sample IDs.
Return type:list of str
sample_qc(root='sa.qc', keep_star=False)[source]

Compute per-sample QC metrics.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Annotations

sample_qc() computes 20 sample statistics from the genotype data and stores the results as sample annotations that can be accessed with sa.qc.<identifier> (or <root>.<identifier> if a non-default root was passed):

Name Type Description
callRate Double Fraction of genotypes called
nHomRef Int Number of homozygous reference genotypes
nHet Int Number of heterozygous genotypes
nHomVar Int Number of homozygous alternate genotypes
nCalled Int Sum of nHomRef + nHet + nHomVar
nNotCalled Int Number of uncalled genotypes
nSNP Int Number of SNP alternate alleles
nInsertion Int Number of insertion alternate alleles
nDeletion Int Number of deletion alternate alleles
nSingleton Int Number of private alleles
nTransition Int Number of transition (A-G, C-T) alternate alleles
nTransversion Int Number of transversion alternate alleles
nNonRef Int Sum of nHet and nHomVar
rTiTv Double Transition/Transversion ratio
rHetHomVar Double Het/HomVar genotype ratio
rInsertionDeletion Double Insertion/Deletion ratio
dpMean Double Depth mean across all genotypes
dpStDev Double Depth standard deviation across all genotypes
gqMean Double The average genotype quality across all genotypes
gqStDev Double Genotype quality standard deviation across all genotypes

Missing values NA may result (for example, due to division by zero) and are handled properly in filtering and written as “NA” in export modules. The empirical standard deviation is computed with zero degrees of freedom.

Parameters:
  • root (str) – Sample annotation root for the computed struct.
  • keep_star (bool) – Count star alleles as non-reference alleles
Returns:

Annotated variant dataset with new sample qc annotations.

Return type:

VariantDataset

sample_schema

Returns the signature of the sample annotations contained in this VDS.

Examples

>>> print(vds.sample_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.sample_schema)
Return type:Type
sample_variants(fraction, seed=1)[source]

Downsample variants to a given fraction of the dataset.

Examples

>>> small_vds = vds.sample_variants(0.01)

Notes

This method may not sample exactly (fraction * n_variants) variants from the dataset.

Parameters:
  • fraction (float) – (Expected) fraction of variants to keep.
  • seed (int) – Random seed.
Returns:

Downsampled variant dataset.

Return type:

VariantDataset

samples_table()[source]

Convert samples and sample annotations to KeyTable.

The resulting KeyTable has schema:

Struct {
  s: Sample
  sa: sample annotations
}

with a single key s.

Returns:Key table with samples and sample annotations.
Return type:KeyTable
set_va_attributes(ann_path, attributes)[source]

Sets attributes for a variant annotation. Attributes are key/value pairs that can be attached to a variant annotation field.

The following attributes are read from the VCF header when importing a VCF and written to the VCF header when exporting a VCF:

  • INFO fields attributes (attached to (va.info.*)):
    • ‘Number’: The arity of the field. Can take values
      • 0 (Boolean flag),
      • 1 (single value),
      • R (one value per allele, including the reference),
      • A (one value per non-reference allele),
      • G (one value per genotype), and
      • . (any number of values)
        • When importing: The value in read from the VCF INFO field definition
        • When exporting: The default value is 0 for Boolean, . for Arrays and 1 for all other types
      • ‘Description’ (default is ‘’)
  • FILTER entries in the VCF header are generated based on the attributes of va.filters. Each key/value pair in the attributes will generate a FILTER entry in the VCF with ID = key and Description = value.

Examples

Consider the following command which adds a filter and an annotation to the VDS (we’re assuming a split VDS for simplicity):

  1. an INFO field AC_HC, which stores the allele count of high confidence genotypes (DP >= 10, GQ >= 20) for each non-reference allele,
  2. a filter HardFilter that filters all sites with the GATK suggested hard filters:
    • For SNVs: QD < 2.0 || FS < 60 || MQ < 40 || MQRankSum < -12.5 || ReadPosRankSum < -8.0
    • For Indels (and other complex): QD < 2.0 || FS < 200.0 || ReadPosRankSum < 20.0
>>> annotated_vds = vds.annotate_variants_expr([
... 'va.info.AC_HC = gs.filter(g => g.dp >= 10 && g.gq >= 20).callStats(g => v).AC[1:]',
... 'va.filters = if((v.altAllele.isSNP && (va.info.QD < 2.0 || va.info.FS < 60 || va.info.MQ < 40 || ' +
... 'va.info.MQRankSum < -12.5 || va.info.ReadPosRankSum < -8.0)) || ' +
... '(va.info.QD < 2.0 || va.info.FS < 200.0 || va.info.ReadPosRankSum < 20.0)) va.filters.add("HardFilter") else va.filters'])

If we now export this VDS as VCF, it would produce the following header (for these new fields):

##INFO=<ID=AC_HC,Number=.,Type=String,Description=""

This header doesn’t contain all information that should be present in an optimal VCF header: 1) There is no FILTER entry for HardFilter 2) Since AC_HC has one entry per non-reference allele, its Number should be A 3) AC_HC should have a Description

We can fix this by setting the attributes of these fields:

>>> annotated_vds = (annotated_vds
...     .set_va_attributes(
...         'va.info.AC_HC',
...         {'Description': 'Allele count for high quality genotypes (DP >= 10, GQ >= 20)',
...          'Number': 'A'})
...     .set_va_attributes(
...         'va.filters',
...         {'HardFilter': 'This site fails GATK suggested hard filters.'}))

Exporting the VDS with the attributes now prints the following header lines:

##INFO=<ID=test,Number=A,Type=String,Description="Allele count for high quality genotypes (DP >= 10, GQ >= 20)"
##FILTER=<ID=HardFilter,Description="This site fails GATK suggested hard filters.">
Parameters:
  • ann_path (str) – Path to variant annotation beginning with va.
  • attributes (dict) – A str-str dict containing the attributes to set
Returns:

Annotated dataset with the attribute added to the variant annotation.

Return type:

VariantDataset

split_multi(propagate_gq=False, keep_star_alleles=False, max_shift=100)[source]

Split multiallelic variants.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

>>> vds.split_multi().write('output/split.vds')

Notes

We will explain by example. Consider a hypothetical 3-allelic variant:

A   C,T 0/2:7,2,6:15:45:99,50,99,0,45,99

split_multi will create two biallelic variants (one for each alternate allele) at the same position

A   C   0/0:13,2:15:45:0,45,99
A   T   0/1:9,6:15:50:50,0,99

Each multiallelic GT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.

The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.

The biallelic DP is the same as the multiallelic DP.

The biallelic PL entry for for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.

Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.

By default, GQ is recomputed from PL. If propagate_gq=True is passed, the biallelic GQ field is simply the multiallelic GQ field, that is, genotype qualities are unchanged.

Here is a second example for a het non-ref

A   C,T 1/2:2,8,6:16:45:99,50,99,45,0,99

splits as

A   C   0/1:8,8:16:45:45,0,99
A   T   0/1:10,6:16:50:50,0,99

VCF Info Fields

Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele’s position:

>>> vds_result = (vds.split_multi()
...     .filter_variants_expr('va.info.AC[va.aIndex - 1] < 10', keep = False))

VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number ”.”.

If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:

>>> (vds.split_multi()
...     .annotate_variants_expr('va.info.AC = va.info.AC[va.aIndex - 1]')
...     .export_vcf('output/export.vcf'))

The info field AC in data/export.vcf will have Number=1.

Annotations

split_multi() adds the following annotations:

  • va.wasSplit (Boolean) – true if this variant was originally multiallelic, otherwise false.
  • va.aIndex (Int) – The original index of this alternate allele in the multiallelic representation (NB: 1 is the first alternate allele or the only alternate allele in a biallelic variant). For example, 1:100:A:T,C splits into two variants: 1:100:A:T with aIndex = 1 and 1:100:A:C with aIndex = 2.
Parameters:
  • propagate_gq (bool) – Set the GQ of output (split) genotypes to be the GQ of the input (multi-allelic) variants instead of recompute GQ as the difference between the two smallest PL values. Intended to be used in conjunction with import_vcf(store_gq=True). This option will be obviated in the future by generic genotype schemas. Experimental.
  • keep_star_alleles (bool) – Do not filter out * alleles.
  • max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered.
Returns:

A biallelic variant dataset.

Return type:

VariantDataset

storage_level()[source]

Returns the storage (persistence) level of the variant dataset.

Notes

See the Spark documentation for details on persistence levels.

Return type:str
summarize()[source]

Returns a summary of useful information about the dataset.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

>>> s = vds.summarize()
>>> print(s.contigs)
>>> print('call rate is %.2f' % s.call_rate)
>>> s.report()

The following information is contained in the summary:

  • samples (int) - Number of samples.
  • variants (int) - Number of variants.
  • call_rate (float) - Fraction of all genotypes called.
  • contigs (list of str) - List of all unique contigs found in the dataset.
  • multiallelics (int) - Number of multiallelic variants.
  • snps (int) - Number of SNP alternate alleles.
  • mnps (int) - Number of MNP alternate alleles.
  • insertions (int) - Number of insertion alternate alleles.
  • deletions (int) - Number of deletions alternate alleles.
  • complex (int) - Number of complex alternate alleles.
  • star (int) - Number of star (upstream deletion) alternate alleles.
  • max_alleles (int) - The highest number of alleles at any variant.
Returns:Object containing summary information.
Return type:Summary
tdt(pedigree, root='va.tdt')[source]

Find transmitted and untransmitted variants; count per variant and nuclear family.

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

Compute TDT association results:

>>> pedigree = Pedigree.read('data/trios.fam')
>>> (vds.tdt(pedigree)
...     .export_variants("output/tdt_results.tsv", "Variant = v, va.tdt.*"))

Notes

The transmission disequilibrium test tracks the number of times the alternate allele is transmitted (t) or not transmitted (u) from a heterozgyous parent to an affected child under the null that the rate of such transmissions is 0.5. For variants where transmission is guaranteed (i.e., the Y chromosome, mitochondria, and paternal chromosome X variants outside of the PAR), the test cannot be used.

The TDT statistic is given by

\[(t-u)^2 \over (t+u)\]

and follows a 1 degree of freedom chi-squared distribution under the null hypothesis.

The number of transmissions and untransmissions for each possible set of genotypes is determined from the table below. The copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR).

  • HemiX – in non-PAR of X and child is male
  • Auto – otherwise (in autosome or PAR, or child is female)
Kid Dad Mom Copy State T U
HomRef Het Het Auto 0 2
HomRef HomRef Het Auto 0 1
HomRef Het HomRef Auto 0 1
Het Het Het Auto 1 1
Het HomRef Het Auto 1 0
Het Het HomRef Auto 1 0
Het HomVar Het Auto 0 1
Het Het HomVar Auto 0 1
HomVar Het Het Auto 2 0
HomVar Het HomVar Auto 1 0
HomVar HomVar Het Auto 1 0
HomRef HomRef Het HemiX 0 1
HomRef HomVar Het HemiX 0 1
HomVar HomRef Het HemiX 1 0
HomVar HomVar Het HemiX 1 0

tdt() only considers complete trios (two parents and a proband) with defined sex.

PAR is currently defined with respect to reference GRCh37:

  • X: 60001-2699520
  • X: 154931044-155260560
  • Y: 10001-2649520
  • Y: 59034050-59363566

tdt() assumes all contigs apart from X and Y are fully autosomal; decoys, etc. are not given special treatment.

Annotations

tdt() adds the following annotations:

  • tdt.nTransmitted (Int) – Number of transmitted alternate alleles.
  • va.tdt.nUntransmitted (Int) – Number of untransmitted alternate alleles.
  • va.tdt.chi2 (Double) – TDT statistic.
  • va.tdt.pval (Double) – p-value.
Parameters:
  • pedigree (Pedigree) – Sample pedigree.
  • root – Variant annotation root to store TDT result.
Returns:

Variant dataset with TDT association results added to variant annotations.

Return type:

VariantDataset

union(*datasets)[source]

Take the union of datasets vertically (include all variants).

Examples

Union two datasets:

>>> vds_union = vds_autosomal.union(vds_chromX)

Given a list of datasets, union them all:

>>> all_vds = [vds_autosomal, vds_chromX, vds_chromY]

The following three syntaxes are equivalent:

>>> vds_union1 = vds_autosomal.union(vds_chromX, vds_chromY)
>>> vds_union2 = all_vds[0].union(*all_vds[1:])
>>> vds_union3 = VariantDataset.union(*all_vds)

Notes

In order to combine two datasets, these requirements must be met:
  • the samples must match
  • the variant annotation schemas must match (field order within structs matters).
  • the cell (genotype) schemas must match (field order within structs matters).

The column annotations in the resulting dataset are simply the column annotations from the first dataset; the column annotation schemas do not need to match.

This method can trigger a shuffle, if partitions from two datasets overlap.

Parameters:vds_type (tuple of VariantDataset) – Datasets to combine.
Returns:Dataset with variants from all datasets.
Return type:VariantDataset
unpersist()[source]

Unpersists this VDS from memory/disk.

Notes This function will have no effect on a VDS that was not previously persisted.

There’s nothing stopping you from continuing to use a VDS that has been unpersisted, but doing so will result in all previous steps taken to compute the VDS being performed again since the VDS must be recomputed. Only unpersist a VDS when you are done with it.

variant_qc(root='va.qc')[source]

Compute common variant statistics (quality control metrics).

Important

The genotype_schema() must be of type TGenotype in order to use this method.

Examples

>>> vds_result = vds.variant_qc()

Annotations

variant_qc() computes 18 variant statistics from the genotype data and stores the results as variant annotations that can be accessed with va.qc.<identifier> (or <root>.<identifier> if a non-default root was passed):

Name Type Description
callRate Double Fraction of samples with called genotypes
AF Double Calculated alternate allele frequency (q)
AC Int Count of alternate alleles
rHeterozygosity Double Proportion of heterozygotes
rHetHomVar Double Ratio of heterozygotes to homozygous alternates
rExpectedHetFrequency Double Expected rHeterozygosity based on HWE
pHWE Double p-value from Hardy Weinberg Equilibrium null model
nHomRef Int Number of homozygous reference samples
nHet Int Number of heterozygous samples
nHomVar Int Number of homozygous alternate samples
nCalled Int Sum of nHomRef, nHet, and nHomVar
nNotCalled Int Number of uncalled samples
nNonRef Int Sum of nHet and nHomVar
rHetHomVar Double Het/HomVar ratio across all samples
dpMean Double Depth mean across all samples
dpStDev Double Depth standard deviation across all samples
gqMean Double The average genotype quality across all samples
gqStDev Double Genotype quality standard deviation across all samples

Missing values NA may result (for example, due to division by zero) and are handled properly in filtering and written as “NA” in export modules. The empirical standard deviation is computed with zero degrees of freedom.

Parameters:root (str) – Variant annotation root for computed struct.
Returns:Annotated variant dataset with new variant QC annotations.
Return type:VariantDataset
variant_schema

Returns the signature of the variant annotations contained in this VDS.

Examples

>>> print(vds.variant_schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(vds.variant_schema)
Return type:Type
variants_table()[source]

Convert variants and variant annotations to a KeyTable.

The resulting KeyTable has schema:

Struct {
  v: Variant
  va: variant annotations
}

with a single key v.

Returns:Key table with variants and variant annotations.
Return type:KeyTable
vep(config, block_size=1000, root='va.vep', csq=False)[source]

Annotate variants with VEP.

vep() runs Variant Effect Predictor with the LOFTEE plugin on the current variant dataset and adds the result as a variant annotation.

Examples

Add VEP annotations to the dataset:

>>> vds_result = vds.vep("data/vep.properties") 

Configuration

vep() needs a configuration file to tell it how to run VEP. The format is a .properties file. Roughly, each line defines a property as a key-value pair of the form key = value. vep supports the following properties:

  • hail.vep.perl – Location of Perl. Optional, default: perl.
  • hail.vep.perl5lib – Value for the PERL5LIB environment variable when invoking VEP. Optional, by default PERL5LIB is not set.
  • hail.vep.path – Value of the PATH environment variable when invoking VEP. Optional, by default PATH is not set.
  • hail.vep.location – Location of the VEP Perl script. Required.
  • hail.vep.cache_dir – Location of the VEP cache dir, passed to VEP with the –dir option. Required.
  • hail.vep.fasta – Location of the FASTA file to use to look up the reference sequence, passed to VEP with the –fasta option. Required.
  • hail.vep.assembly – Genome assembly version to use. Optional, default: GRCh37
  • hail.vep.plugin – VEP plugin, passed to VEP with the –plugin option. Optional. Overrides hail.vep.lof.human_ancestor and hail.vep.lof.conservation_file.
  • hail.vep.lof.human_ancestor – Location of the human ancestor file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.
  • hail.vep.lof.conservation_file – Location of the conservation file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.

Here is an example vep.properties configuration file

hail.vep.perl = /usr/bin/perl
hail.vep.path = /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
hail.vep.location = /path/to/vep/ensembl-tools-release-81/scripts/variant_effect_predictor/variant_effect_predictor.pl
hail.vep.cache_dir = /path/to/vep
hail.vep.lof.human_ancestor = /path/to/loftee_data/human_ancestor.fa.gz
hail.vep.lof.conservation_file = /path/to/loftee_data//phylocsf.sql

VEP Invocation

<hail.vep.perl>
<hail.vep.location>
--format vcf
--json
--everything
--allele_number
--no_stats
--cache --offline
--dir <hail.vep.cache_dir>
--fasta <hail.vep.fasta>
--minimal
--assembly <hail.vep.assembly>
--plugin LoF,human_ancestor_fa:$<hail.vep.lof.human_ancestor>,filter_position:0.05,min_intron_size:15,conservation_file:<hail.vep.lof.conservation_file>
-o STDOUT

Annotations

Annotations with the following schema are placed in the location specified by root. The full resulting dataset schema can be queried with variant_schema.

Struct{
  assembly_name: String,
  allele_string: String,
  colocated_variants: Array[Struct{
    aa_allele: String,
    aa_maf: Double,
    afr_allele: String,
    afr_maf: Double,
    allele_string: String,
    amr_allele: String,
    amr_maf: Double,
    clin_sig: Array[String],
    end: Int,
    eas_allele: String,
    eas_maf: Double,
    ea_allele: String,,
    ea_maf: Double,
    eur_allele: String,
    eur_maf: Double,
    exac_adj_allele: String,
    exac_adj_maf: Double,
    exac_allele: String,
    exac_afr_allele: String,
    exac_afr_maf: Double,
    exac_amr_allele: String,
    exac_amr_maf: Double,
    exac_eas_allele: String,
    exac_eas_maf: Double,
    exac_fin_allele: String,
    exac_fin_maf: Double,
    exac_maf: Double,
    exac_nfe_allele: String,
    exac_nfe_maf: Double,
    exac_oth_allele: String,
    exac_oth_maf: Double,
    exac_sas_allele: String,
    exac_sas_maf: Double,
    id: String,
    minor_allele: String,
    minor_allele_freq: Double,
    phenotype_or_disease: Int,
    pubmed: Array[Int],
    sas_allele: String,
    sas_maf: Double,
    somatic: Int,
    start: Int,
    strand: Int
  }],
  end: Int,
  id: String,
  input: String,
  intergenic_consequences: Array[Struct{
    allele_num: Int,
    consequence_terms: Array[String],
    impact: String,
    minimised: Int,
    variant_allele: String
  }],
  most_severe_consequence: String,
  motif_feature_consequences: Array[Struct{
    allele_num: Int,
    consequence_terms: Array[String],
    high_inf_pos: String,
    impact: String,
    minimised: Int,
    motif_feature_id: String,
    motif_name: String,
    motif_pos: Int,
    motif_score_change: Double,
    strand: Int,
    variant_allele: String
  }],
  regulatory_feature_consequences: Array[Struct{
    allele_num: Int,
    biotype: String,
    consequence_terms: Array[String],
    impact: String,
    minimised: Int,
    regulatory_feature_id: String,
    variant_allele: String
  }],
  seq_region_name: String,
  start: Int,
  strand: Int,
  transcript_consequences: Array[Struct{
    allele_num: Int,
    amino_acids: String,
    biotype: String,
    canonical: Int,
    ccds: String,
    cdna_start: Int,
    cdna_end: Int,
    cds_end: Int,
    cds_start: Int,
    codons: String,
    consequence_terms: Array[String],
    distance: Int,
    domains: Array[Struct{
      db: String
      name: String
    }],
    exon: String,
    gene_id: String,
    gene_pheno: Int,
    gene_symbol: String,
    gene_symbol_source: String,
    hgnc_id: String,
    hgvsc: String,
    hgvsp: String,
    hgvs_offset: Int,
    impact: String,
    intron: String,
    lof: String,
    lof_flags: String,
    lof_filter: String,
    lof_info: String,
    minimised: Int,
    polyphen_prediction: String,
    polyphen_score: Double,
    protein_end: Int,
    protein_start: Int,
    protein_id: String,
    sift_prediction: String,
    sift_score: Double,
    strand: Int,
    swissprot: String,
    transcript_id: String,
    trembl: String,
    uniparc: String,
    variant_allele: String
  }],
  variant_class: String
}
Parameters:
  • config (str) – Path to VEP configuration file.
  • block_size (int) – Number of variants to annotate per VEP invocation.
  • root (str) – Variant annotation path to store VEP output.
  • csq (bool) – If True, annotates VCF CSQ field as a String. If False, annotates with the full nested struct schema
Returns:

An annotated with variant annotations from VEP.

Return type:

VariantDataset

was_split()[source]

True if multiallelic variants have been split into multiple biallelic variants.

Result is True if split_multi() or filter_multi() has been called on this variant dataset, or if the variant dataset was imported with import_plink(), import_gen(), or import_bgen(), or if the variant dataset was simulated with balding_nichols_model().

Return type:bool
write(output, overwrite=False, parquet_genotypes=False)[source]

Write variant dataset as VDS file.

Examples

Import data from a VCF file and then write the data to a VDS file:

>>> vds.write("output/sample.vds")
Parameters:
  • output (str) – Path of VDS file to write.
  • overwrite (bool) – If true, overwrite any existing VDS file. Cannot be used to read from and write to the same path.
  • parquet_genotypes (bool) – If true, store genotypes as Parquet rather than Hail’s serialization. The resulting VDS will be larger and slower in Hail but the genotypes will be accessible from other tools that support Parquet.