VariantDataset¶
-
class
hail.
VariantDataset
(hc, jvds)[source]¶ Hail’s primary representation of genomic data, a matrix keyed by sample and variant.
Variant datasets may be generated from other formats using the
HailContext
import methods, constructed from a variant-keyedKeyTable
usingVariantDataset.from_table()
, and simulated usingbalding_nichols_model()
.Once a variant dataset has been written to disk with
write()
, useread()
to load the variant dataset into the environment.>>> vds = hc.read("data/example.vds")
Variables: hc ( HailContext
) – Hail Context.Attributes
colkey_schema
Returns the signature of the column key (sample) contained in this VDS. genotype_schema
Returns the signature of the genotypes contained in this VDS. global_schema
Returns the signature of the global annotations contained in this VDS. globals
Return global annotations as a Python object. num_samples
Number of samples. rowkey_schema
Returns the signature of the row key (variant) contained in this VDS. sample_annotations
Return a dict of sample annotations. sample_ids
Return sampleIDs. sample_schema
Returns the signature of the sample annotations contained in this VDS. variant_schema
Returns the signature of the variant annotations contained in this VDS. Methods
__init__
x.__init__(…) initializes x; see help(type(x)) for signature aggregate_by_key
Aggregate by user-defined key and aggregation expressions to produce a KeyTable. annotate_alleles_expr
Annotate alleles with expression. annotate_genotypes_expr
Annotate genotypes with expression. annotate_global
Add global annotations from Python objects. annotate_global_expr
Annotate global with expression. annotate_samples_expr
Annotate samples with expression. annotate_samples_table
Annotate samples with a key table. annotate_variants_db
Annotate variants using the Hail annotation database. annotate_variants_expr
Annotate variants with expression. annotate_variants_table
Annotate variants with a key table. annotate_variants_vds
Annotate variants with variant annotations from .vds file. cache
Mark this variant dataset to be cached in memory. concordance
Calculate call concordance with another variant dataset. count
Returns number of samples and variants in the dataset. count_variants
Count number of variants in variant dataset. deduplicate
Remove duplicate variants. delete_va_attribute
Removes an attribute from a variant annotation field. drop_samples
Removes all samples from variant dataset. drop_variants
Discard all variants, variant annotations and genotypes. export_gen
Export variant dataset as GEN and SAMPLE file. export_genotypes
Export genotype-level information to delimited text file. export_plink
Export variant dataset as PLINK2 BED, BIM and FAM. export_samples
Export sample information to delimited text file. export_variants
Export variant information to delimited text file. export_vcf
Export variant dataset as a .vcf or .vcf.bgz file. file_version
File version of variant dataset. filter_alleles
Filter a user-defined set of alternate alleles for each variant. filter_genotypes
Filter genotypes based on expression. filter_intervals
Filter variants with an interval or list of intervals. filter_multi
Filter out multi-allelic sites. filter_samples_expr
Filter samples with the expression language. filter_samples_list
Filter samples with a list of samples. filter_samples_table
Filter samples with a table keyed by sample ID. filter_variants_expr
Filter variants with the expression language. filter_variants_list
Filter variants with a list of variants. filter_variants_table
Filter variants with a Variant keyed key table. from_table
Construct a sites-only variant dataset from a key table. genotypes_table
Generate a fully expanded genotype table. grm
Compute the Genetic Relatedness Matrix (GRM). hardcalls
Drop all genotype fields except the GT field. ibd
Compute matrix of identity-by-descent estimations. ibd_prune
Prune samples from the VariantDataset
based onibd()
PI_HAT measures of relatedness.impute_sex
Impute sex of samples by calculating inbreeding coefficient on the X chromosome. join
Join two variant datasets. ld_matrix
Computes the linkage disequilibrium (correlation) matrix for the variants in this VDS. ld_prune
Prune variants in linkage disequilibrium (LD). linreg
Test each variant for association using linear regression. linreg3
Test each variant for association with multiple phenotypes using linear regression. linreg_burden
Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the linear regression model. linreg_multi_pheno
Test each variant for association with multiple phenotypes using linear regression. lmmreg
Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association. logreg
Test each variant for association using logistic regression. logreg_burden
Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the logistic regression model. make_table
Produce a key with one row per variant and one or more columns per sample. mendel_errors
Find Mendel errors; count per variant, individual and nuclear family. min_rep
Gives minimal, left-aligned representation of alleles. naive_coalesce
Naively descrease the number of partitions. num_partitions
Number of partitions. pc_relate
Compute relatedness estimates between individuals using a variant of the PC-Relate method. pca
Run Principal Component Analysis (PCA) on the matrix of genotypes. persist
Persist this variant dataset to memory and/or disk. query_genotypes
Performs aggregation queries over genotypes, and returns Python object(s). query_genotypes_typed
Performs aggregation queries over genotypes, and returns Python object(s) and type(s). query_samples
Performs aggregation queries over samples and sample annotations, and returns Python object(s). query_samples_typed
Performs aggregation queries over samples and sample annotations, and returns Python object(s) and type(s). query_variants
Performs aggregation queries over variants and variant annotations, and returns Python object(s). query_variants_typed
Performs aggregation queries over variants and variant annotations, and returns Python object(s) and type(s). rename_samples
Rename samples. repartition
Increase or decrease the number of variant dataset partitions. rrm
Computes the Realized Relationship Matrix (RRM). same
True if the two variant datasets have the same variants, samples, genotypes, and annotation schemata and values. sample_qc
Compute per-sample QC metrics. sample_variants
Downsample variants to a given fraction of the dataset. samples_table
Convert samples and sample annotations to KeyTable. set_va_attributes
Sets attributes for a variant annotation. split_multi
Split multiallelic variants. storage_level
Returns the storage (persistence) level of the variant dataset. summarize
Returns a summary of useful information about the dataset. tdt
Find transmitted and untransmitted variants; count per variant and nuclear family. union
Take the union of datasets vertically (include all variants). unpersist
Unpersists this VDS from memory/disk. variant_qc
Compute common variant statistics (quality control metrics). variants_table
Convert variants and variant annotations to a KeyTable. vep
Annotate variants with VEP. was_split
True if multiallelic variants have been split into multiple biallelic variants. write
Write variant dataset as VDS file. -
aggregate_by_key
(key_exprs, agg_exprs)[source]¶ Aggregate by user-defined key and aggregation expressions to produce a KeyTable. Equivalent to a group-by operation in SQL.
Examples
Compute the number of LOF heterozygote calls per gene per sample:
>>> kt_result = (vds ... .aggregate_by_key(['Sample = s', 'Gene = va.gene'], ... 'nHet = g.filter(g => g.isHet() && va.consequence == "LOF").count()') ... .export("test.tsv"))
This will produce a
KeyTable
with 3 columns (Sample, Gene, nHet).Parameters: - key_exprs (str or list of str) – Named expression(s) for which fields are keys.
- agg_exprs (str or list of str) – Named aggregation expression(s).
Return type:
-
annotate_alleles_expr
(expr, propagate_gq=False)[source]¶ Annotate alleles with expression.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
To create a variant annotation
va.nNonRefSamples: Array[Int]
where the ith entry of the array is the number of samples carrying the ith alternate allele:>>> vds_result = vds.annotate_alleles_expr('va.nNonRefSamples = gs.filter(g => g.isCalledNonRef()).count()')
Notes
This method is similar to
annotate_variants_expr()
.annotate_alleles_expr()
dynamically splits multi-allelic sites, evaluates each expression on each split allele separately, and for each expression annotates with an array with one element per alternate allele. In the splitting, genotypes are downcoded and each alternate allele is represented using its minimal representation (seesplit_multi()
for more details).Parameters: - expr (str or list of str) – Annotation expression.
- propagate_gq (bool) – Propagate GQ instead of computing from (split) PL.
Returns: Annotated variant dataset.
Return type:
-
annotate_genotypes_expr
(expr)[source]¶ Annotate genotypes with expression.
Examples
Convert the genotype schema to a
TStruct
with two fieldsGT
andCASE_HET
:>>> vds_result = vds.annotate_genotypes_expr('g = {GT: g.gt, CASE_HET: sa.pheno.isCase && g.isHet()}')
Assume a VCF is imported with
generic=True
and the resulting genotype schema is aStruct
and the fieldGTA
is aCall
type. Use the.toGenotype()
method in the expression language to convert aCall
to aGenotype
.vds_gta
will have a genotype schema equal toTGenotype
>>> vds_gta = (hc.import_vcf('data/example3.vcf.bgz', generic=True, call_fields=['GTA']) ... .annotate_genotypes_expr('g = g.GTA.toGenotype()'))
Notes
annotate_genotypes_expr()
evaluates the expression given byexpr
and assigns the result of the right hand side to the annotation path specified by the left-hand side (must begin withg
). This is analogous toannotate_variants_expr()
andannotate_samples_expr()
where the annotation paths areva
andsa
respectively.expr
is in genotype context so the following symbols are in scope:g
: genotype annotationv
(Variant): Variantva
: variant annotationss
(Sample): samplesa
: sample annotationsglobal
: global annotations
For more information, see the documentation on writing expressions and using the Hail Expression Language.
Warning
- If the resulting genotype schema is not
TGenotype
, subsequent function calls on the annotated variant dataset may not work such aspca()
andlinreg()
. - Hail performance may be significantly slower if the annotated variant dataset does not have a
genotype schema equal to
TGenotype
. - Genotypes are immutable. For example, if
g
is initially of typeGenotype
, the expressiong.gt = g.gt + 1
will return aStruct
with one fieldgt
of typeInt
and NOT aGenotype
with thegt
incremented by 1.
Parameters: expr (str or list of str) – Annotation expression. Returns: Annotated variant dataset. Return type: VariantDataset
-
annotate_global
(path, annotation, annotation_type)[source]¶ Add global annotations from Python objects.
Examples
Add populations as a global annotation:
>>> vds_result = vds.annotate_global('global.populations', ... ['EAS', 'AFR', 'EUR', 'SAS', 'AMR'], ... TArray(TString()))
Notes
This method registers new global annotations in a VDS. These annotations can then be accessed through expressions in downstream operations. The Hail data type must be provided and must match the given
annotation
parameter.Parameters: - path (str) – annotation path starting in ‘global’
- annotation – annotation to add to global
- annotation_type (
Type
) – Hail type of annotation
Returns: Annotated variant dataset.
Return type:
-
annotate_global_expr
(expr)[source]¶ Annotate global with expression.
Example
Annotate global with an array of populations:
>>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS", "NFE"]')
Create, then overwrite, then drop a global annotation:
>>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS"]') >>> vds = vds.annotate_global_expr('global.pops = ["FIN", "AFR", "EAS", "NFE"]') >>> vds = vds.annotate_global_expr('global.pops = drop(global, pops)')
The expression namespace contains only one variable:
global
: global annotations
Parameters: expr (str or list of str) – Annotation expression Returns: Annotated variant dataset. Return type: VariantDataset
-
annotate_samples_expr
(expr)[source]¶ Annotate samples with expression.
Examples
Compute per-sample GQ statistics for hets:
>>> vds_result = (vds.annotate_samples_expr('sa.gqHetStats = gs.filter(g => g.isHet()).map(g => g.gq).stats()') ... .export_samples('output/samples.txt', 'sample = s, het_gq_mean = sa.gqHetStats.mean'))
Compute the list of genes with a singleton LOF per sample:
>>> variant_annotations_table = hc.import_table('data/consequence.tsv', impute=True).key_by('Variant') >>> vds_result = (vds.annotate_variants_table(variant_annotations_table, root='va.consequence') ... .annotate_variants_expr('va.isSingleton = gs.map(g => g.nNonRefAlleles()).sum() == 1') ... .annotate_samples_expr('sa.LOF_genes = gs.filter(g => va.isSingleton && g.isHet() && va.consequence == "LOF").map(g => va.gene).collect()'))
To create an annotation for only a subset of samples based on an existing annotation:
>>> vds_result = vds.annotate_samples_expr('sa.newpheno = if (sa.pheno.cohortName == "cohort1") sa.pheno.bloodPressure else NA: Double')
Note
For optimal performance, be sure to explicitly give the alternative (
NA
) the same type as the consequent (sa.pheno.bloodPressure
).Notes
expr
is in sample context so the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for samples
Parameters: expr (str or list of str) – Annotation expression. Returns: Annotated variant dataset. Return type: VariantDataset
-
annotate_samples_table
(table, root=None, expr=None, vds_key=None, product=False)[source]¶ Annotate samples with a key table.
Examples
To annotates samples using samples1.tsv with type imputation:
>>> table = hc.import_table('data/samples1.tsv', impute=True).key_by('Sample') >>> vds_result = vds.annotate_samples_table(table, root='sa.pheno')
Given this file
$ cat data/samples1.tsv Sample Height Status Age PT-1234 154.1 ADHD 24 PT-1236 160.9 Control 19 PT-1238 NA ADHD 89 PT-1239 170.3 Control 55
the three new sample annotations are
sa.pheno.Height: Double
,sa.pheno.Status: String
, andsa.pheno.Age: Int
.To annotate without type imputation, resulting in all String types:
>>> annotations = hc.import_table('data/samples1.tsv').key_by('Sample') >>> vds_result = vds.annotate_samples_table(annotations, root='sa.phenotypes')
Detailed examples
Let’s import annotations from a CSV file with missing data and special characters
$ cat data/samples2.tsv Batch,PT-ID 1kg,PT-0001 1kg,PT-0002 study1,PT-0003 study3,PT-0003 .,PT-0004 1kg,PT-0005 .,PT-0006 1kg,PT-0007
In this case, we should:
- Pass the non-default delimiter
,
- Pass the non-default missing value
.
>>> annotations = hc.import_table('data/samples2.tsv', delimiter=',', missing='.').key_by('PT-ID') >>> vds_result = vds.annotate_samples_table(annotations, root='sa.batch')
Let’s import annotations from a file with no header and sample IDs that need to be transformed. Suppose the vds sample IDs are of the form
NA#####
. This file has no header line, and the sample ID is hidden in a field with other information.$ cat data/samples3.tsv 1kg_NA12345 female 1kg_NA12346 male 1kg_NA12348 female pgc_NA23415 male pgc_NA23418 male
To import it:
>>> annotations = (hc.import_table('data/samples3.tsv', no_header=True) ... .annotate('sample = f0.split("_")[1]') ... .key_by('sample')) >>> vds_result = vds.annotate_samples_table(annotations, ... expr='sa.sex = table.f1, sa.batch = table.f0.split("_")[0]')
Notes
This method takes as an argument a
KeyTable
object. Hail has a default join strategy for tables keyed by String, which is to join by sample ID. If the table is keyed by something else, like population or cohort, then thevds_key
argument must be passed to describe the key in the dataset to use for the join. This argument expects a list of Hail expressions whose types match, in order, the table’s key types.Each expression in the list
vds_key
has the following symbols in scope:s
(String): sample IDsa
: sample annotations
The
root
andexpr
argumentsNote
One of
root
orexpr
is required, but not both.The
expr
parameter expects an annotation expression involvingsa
(the existing sample annotations in the dataset) andtable
(a struct containing the columns in the table), likesa.col1 = table.col1, sa.col2 = table.col2
orsa = merge(sa, table)
. Theroot
parameter expects an annotation path beginning insa
, likesa.annotations
. Passingroot='sa.annotations'
is exactly the same as passingexpr='sa.annotations = table'
.expr
has the following symbols in scope:sa
: sample annotationstable
: See note.
Note
The value of
table
inside root/expr depends on the number of values in the key table, as well as theproduct
argument. There are three behaviors based on the number of values and one branch forproduct
being true and false, for a total of six modes:Number of value columns product
Type of table
Value of table
More than 2 False Struct
Struct with an element for each column. 1 False T
The value column. 0 False Boolean
Existence of any matching key. More than 2 True Array[Struct]
An array with a struct for each matching key. 1 True Array[T]
An array with a value for each matching key. 0 True Int
The number of matching keys. Common uses for the
expr
argumentPut annotations on the top level under
sa
expr='sa = merge(sa, table)'
Annotate only specific annotations from the table
expr='sa.annotations = select(table, toKeep1, toKeep2, toKeep3)'
The above is equivalent to
expr='''sa.annotations.toKeep1 = table.toKeep1, sa.annotations.toKeep2 = table.toKeep2, sa.annotations.toKeep3 = table.toKeep3'''
Finally, for more information about importing key tables from text, see the documentation for
HailContext.import_table()
.Parameters: - table (
KeyTable
) – Key table. - root (str or None) – Sample annotation path to store text table. (This or
expr
required). - expr (str or None) – Annotation expression. (This or
root
required). - vds_key (str, list of str, or None.) – Join key for the dataset, if not sample ID.
- product (bool) – Join with all matching keys (see note).
Returns: Annotated variant dataset.
Return type: - Pass the non-default delimiter
-
annotate_variants_db
(annotations, gene_key=None)[source]¶ Annotate variants using the Hail annotation database.
Warning
Experimental. Supported only while running Hail on the Google Cloud Platform.
Documentation describing the annotations that are accessible through this method can be found here.
Examples
Annotate variants with CADD raw and PHRED scores:
>>> vds = vds.annotate_variants_db(['va.cadd.RawScore', 'va.cadd.PHRED'])
Annotate variants with gene-level PLI score, using the VEP-generated gene symbol to map variants to genes:
>>> pli_vds = vds.annotate_variants_db('va.gene.constraint.pli')
Again annotate variants with gene-level PLI score, this time using the existing
va.gene_symbol
annotation to map variants to genes:>>> vds = vds.annotate_variants_db('va.gene.constraint.pli', gene_key='va.gene_symbol')
Notes
Annotations in the database are bi-allelic, so splitting multi-allelic variants in the VDS before using this method is recommended to capture all appropriate annotations from the database. To do this, run
split_multi()
prior to annotating variants with this method:>>> vds = vds.split_multi().annotate_variants_db(['va.cadd.RawScore', 'va.cadd.PHRED'])
To add VEP annotations, or to add gene-level annotations without a predefined gene symbol for each variant, the
annotate_variants_db()
method runs Hail’svep()
method on the VDS. This means that your cluster must be properly initialized to run VEP.Warning
If you want to add VEP annotations to your VDS, make sure to add the initialization action
gs://hail-common/vep/vep/vep85-init.sh
when starting your cluster.Parameters: - annotations (str or list of str) – List of annotations to import from the database.
- gene_key (str) – Existing variant annotation used to map variants to gene symbols if importing gene-level annotations. If not provided, the method will add VEP annotations and parse them as described in the database documentation to obtain one gene symbol per variant.
Returns: Annotated variant dataset.
Return type:
-
annotate_variants_expr
(expr)[source]¶ Annotate variants with expression.
Examples
Compute GQ statistics about heterozygotes per variant:
>>> vds_result = vds.annotate_variants_expr('va.gqHetStats = gs.filter(g => g.isHet()).map(g => g.gq).stats()')
Collect a list of sample IDs with non-ref calls in LOF variants:
>>> vds_result = vds.annotate_variants_expr('va.nonRefSamples = gs.filter(g => g.isCalledNonRef()).map(g => s).collect()')
Substitute a custom string for the rsID field:
>>> vds_result = vds.annotate_variants_expr('va.rsid = str(v)')
Notes
expr
is in variant context so the following symbols are in scope:For more information, see the documentation on writing expressions and using the Hail Expression Language.
Parameters: expr (str or list of str) – Annotation expression or list of annotation expressions. Returns: Annotated variant dataset. Return type: VariantDataset
-
annotate_variants_table
(table, root=None, expr=None, vds_key=None, product=False)[source]¶ Annotate variants with a key table.
Examples
Add annotations from a variant-keyed tab separated file:
>>> table = hc.import_table('data/variant-lof.tsv', impute=True).key_by('v') >>> vds_result = vds.annotate_variants_table(table, root='va.lof')
Add annotations from a locus-keyed TSV:
>>> kt = hc.import_table('data/locus-table.tsv', impute=True).key_by('Locus') >>> vds_result = vds.annotate_variants_table(table, root='va.scores')
Add annotations from a gene-and-type-keyed TSV:
>>> table = hc.import_table('data/locus-metadata.tsv', impute=True).key_by(['gene', 'type']) >>> vds_result = (vds.annotate_variants_table(table, ... root='va.foo', ... vds_key=['va.gene', 'if (va.score > 10) "Type1" else "Type2"']))
Annotate variants with the target in a GATK interval list file:
>>> intervals = KeyTable.import_interval_list('data/exons2.interval_list') >>> vds_result = vds.annotate_variants_table(intervals, root='va.exon')
Annotate variants with all targets from matching intervals in a GATK interval list file:
>>> intervals = KeyTable.import_interval_list('data/exons2.interval_list') >>> vds_result = vds.annotate_variants_table(intervals, root='va.exons', product=True)
Annotate variants using a UCSC BED file, marking each variant true/false for an overlap with any interval:
>>> intervals = KeyTable.import_bed('data/file2.bed') >>> vds_result = vds.annotate_variants_table(intervals, root='va.bed')
Notes
This method takes as an argument a
KeyTable
object. Hail has default join strategies for tables keyed by Variant, Locus, or Interval.Join strategies:
If the key is a
Variant
, then a variant in the dataset will match a variant in the table that is equivalent. Be careful, however:1:1:A:T
does not match1:1:A:T,C
, and vice versa.If the key is a
Locus
, then a variant in the dataset will match any locus in the table which is equivalent tov.locus
(same chromosome and position).If the key is an
Interval
, then a variant in the dataset will match any interval in the table that contains the variant’s locus (chromosome and position).If the key is not one of the above three types (a String representing gene ID, for instance), or if another join strategy should be used for a key of one of these three types (join with a locus object in variant annotations, for instance) for these types, then the
vds_key
argument should be passed. This argument expects a list of expressions whose types match, in order, the table’s key types. Note that usingvds_key
is slower than annotation with a standard key type.Each expression in the list
vds_key
has the following symbols in scope:v
(Variant): Variantva
: variant annotations
The
root
andexpr
argumentsNote
One of
root
orexpr
is required, but not both.The
expr
parameter expects an annotation assignment involvingva
(the existing variant annotations in the dataset) andtable
(the values(s) in the table), likeva.col1 = table.col1, va.col2 = table.col2
orva = merge(va, table)
. Theroot
parameter expects an annotation path beginning inva
, likeva.annotations
. Passingroot='va.annotations'
is the same as passingexpr='va.annotations = table'
.expr
has the following symbols in scope:va
: variant annotationstable
: See note.
Note
The value of
table
inside root/expr depends on the number of values in the key table, as well as theproduct
argument. There are three behaviors based on the number of values and one branch forproduct
being true and false, for a total of six modes:Number of value columns product
Type of table
Value of table
More than 2 False Struct
Struct with an element for each column. 1 False T
The value column. 0 False Boolean
Existence of any matching key. More than 2 True Array[Struct]
An array with a struct for each matching key. 1 True Array[T]
An array with a value for each matching key. 0 True Int
The number of matching keys. Common uses for the
expr
argumentPut annotations on the top level under
va
:expr='va = merge(va, table)'
Annotate only specific annotations from the table:
expr='va.annotations = select(table, toKeep1, toKeep2, toKeep3)'
The above is roughly equivalent to:
expr='''va.annotations.toKeep1 = table.toKeep1, va.annotations.toKeep2 = table.toKeep2, va.annotations.toKeep3 = table.toKeep3'''
Finally, for more information about importing key tables from text, see the documentation for
HailContext.import_table()
.Parameters: - table (
KeyTable
) – Key table. - root (str or None) – Variant annotation path to store text table. (This or
expr
required). - expr (str or None) – Annotation expression. (This or
root
required). - vds_key (str, list of str, or None.) – Join key for the dataset. Much slower than default joins.
- product (bool) – Join with all matching keys (see note).
Returns: Annotated variant dataset.
Return type:
-
annotate_variants_vds
(other, expr=None, root=None)[source]¶ Annotate variants with variant annotations from .vds file.
Examples
Import a second variant dataset with annotations to merge into
vds
:>>> vds1 = vds.annotate_variants_expr('va = drop(va, anno1)') >>> vds2 = (hc.read("data/example2.vds") ... .annotate_variants_expr('va = select(va, anno1, toKeep1, toKeep2, toKeep3)'))
Copy the
anno1
annotation fromother
tova.annot
:>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annot = vds.anno1')
Merge the variant annotations from the two vds together and places them at
va
:>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va = merge(va, vds)')
Select a subset of the annotations from
other
:>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annotations = select(vds, toKeep1, toKeep2, toKeep3)')
The previous expression is equivalent to:
>>> vds_result = vds1.annotate_variants_vds(vds2, expr='va.annotations.toKeep1 = vds.toKeep1, ' + ... 'va.annotations.toKeep2 = vds.toKeep2, ' + ... 'va.annotations.toKeep3 = vds.toKeep3')
Notes
Using this method requires one of the two optional arguments:
expr
androot
. They specify how to insert the annotations fromother
into the this vds’s variant annotations.The
root
argument copies all the variant annotations fromother
to the specified annotation path.The
expr
argument expects an annotation assignment whose scope includes,va
, the variant annotations in the current VDS, andvds
, the variant annotations inother
.VDSes with multi-allelic variants may produce surprising results because all alternate alleles are considered part of the variant key. For example:
- The variant
22:140012:A:T,TTT
will not be annotated by22:140012:A:T
or22:140012:A:TTT
- The variant
22:140012:A:T
will not be annotated by22:140012:A:T,TTT
It is possible that an unsplit variant dataset contains no multiallelic variants, so ignore any warnings Hail prints if you know that to be the case. Otherwise, run
split_multi()
beforeannotate_variants_vds()
.Parameters: - other (VariantDataset) – Variant dataset to annotate with.
- root (str) – Sample annotation path to add variant annotations.
- expr (str) – Annotation expression.
Returns: Annotated variant dataset.
Return type: - The variant
-
cache
()[source]¶ Mark this variant dataset to be cached in memory.
cache()
is the same aspersist("MEMORY_ONLY")
.Return type: VariantDataset
-
colkey_schema
¶ Returns the signature of the column key (sample) contained in this VDS.
Examples
>>> print(vds.colkey_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.colkey_schema)
Return type: Type
-
concordance
(right)[source]¶ Calculate call concordance with another variant dataset.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Example
>>> comparison_vds = hc.read('data/example2.vds') >>> summary, samples, variants = vds.concordance(comparison_vds)
Notes
This method computes the genotype call concordance between two bialellic variant datasets. It performs an inner join on samples (only samples in both datasets will be considered), and an outer join on variants. If a variant is only in one dataset, then each genotype is treated as “no data” in the other. This method returns a tuple of three objects: a nested list of list of int with global concordance summary statistics, a key table with sample concordance statistics, and a key table with variant concordance statistics.
Using the global summary result
The global summary is a list of list of int (conceptually a 5 by 5 matrix), where the indices have special meaning:
- No Data (missing variant)
- No Call (missing genotype call)
- Hom Ref
- Heterozygous
- Hom Var
The first index is the state in the left dataset (the one on which concordance was called), and the second index is the state in the right dataset (the argument to the concordance method call). Typical uses of the summary list are shown below.
>>> summary, samples, variants = vds.concordance(hc.read('data/example2.vds')) >>> left_homref_right_homvar = summary[2][4] >>> left_het_right_missing = summary[3][1] >>> left_het_right_something_else = sum(summary[3][:]) - summary[3][3] >>> total_concordant = summary[2][2] + summary[3][3] + summary[4][4] >>> total_discordant = sum([sum(s[2:]) for s in summary[2:]]) - total_concordant
Using the key table results
Columns of the sample key table:
- s (String) – Sample ID, key column.
- nDiscordant (Long) – Count of discordant calls (see below for full definition).
- concordance (Array[Array[Long]]) – Array of concordance per state on left and right, matching the structure of the global summary defined above.
Columns of the variant key table:
- v (Variant) – Key column.
- nDiscordant (Long) – Count of discordant calls (see below for full definition).
- concordance (Array[Array[Long]]) – Array of concordance per state on left and right, matches the structure of the global summary defined above.
The two key tables produced by the concordance method can be queried with
KeyTable.query()
, exported to text withKeyTable.export()
, and used to annotate a variant dataset withVariantDataset.annotate_variants_table()
, among other things.In these tables, the column nDiscordant is provided as a convenience, because this is often one of the most useful concordance statistics. This value is the number of genotypes which were called (homozygous reference, heterozygous, or homozygous variant) in both datasets, but where the call did not match between the two.
The column concordance matches the structure of the global summmary, which is detailed above. Once again, the first index into this array is the state on the left, and the second index is the state on the right. For example,
concordance[1][4]
is the number of “no call” genotypes on the left that were called homozygous variant on the right.Parameters: right ( VariantDataset
) – right hand variant dataset for concordanceReturns: The global concordance statistics, a key table with sample concordance statistics, and a key table with variant concordance statistics. Return type: (list of list of int, KeyTable
,KeyTable
)
-
count
()[source]¶ Returns number of samples and variants in the dataset.
Examples
>>> samples, variants = vds.count()
Notes
This is also the fastest way to force evaluation of a Hail pipeline.
Returns: The sample and variant counts. Return type: (int, int)
-
deduplicate
()[source]¶ Remove duplicate variants.
Returns: Deduplicated variant dataset. Return type: VariantDataset
-
delete_va_attribute
(ann_path, attribute)[source]¶ Removes an attribute from a variant annotation field. Attributes are key/value pairs that can be attached to a variant annotation field.
The following attributes are read from the VCF header when importing a VCF and written to the VCF header when exporting a VCF:
- INFO fields attributes (attached to (va.info.*)):
- ‘Number’: The arity of the field. Can take values
- 0 (Boolean flag),
- 1 (single value),
- R (one value per allele, including the reference),
- A (one value per non-reference allele),
- G (one value per genotype), and
- . (any number of values)
- When importing: The value in read from the VCF INFO field definition
- When exporting: The default value is 0 for Boolean, . for Arrays and 1 for all other types
- ‘Description’ (default is ‘’)
- ‘Number’: The arity of the field. Can take values
- FILTER entries in the VCF header are generated based on the attributes of va.filters. Each key/value pair in the attributes will generate a FILTER entry in the VCF with ID = key and Description = value.
Parameters: - ann_path (str) – Variant annotation path starting with ‘va’, period-delimited.
- attribute (str) – The attribute to remove (key).
Returns: Annotated dataset with the updated variant annotation without the attribute.
Return type: - INFO fields attributes (attached to (va.info.*)):
-
drop_samples
()[source]¶ Removes all samples from variant dataset.
The variants, variant annotations, and global annnotations will remain, producing a sites-only variant dataset.
Returns: Sites-only variant dataset. Return type: VariantDataset
-
drop_variants
()[source]¶ Discard all variants, variant annotations and genotypes.
Samples, sample annotations and global annotations are retained. This is the same as
filter_variants_expr('false')
, but much faster.Examples
>>> vds_result = vds.drop_variants()
Returns: Samples-only variant dataset. Return type: VariantDataset
-
export_gen
(output, precision=4)[source]¶ Export variant dataset as GEN and SAMPLE file.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Import genotype probability data, filter variants based on INFO score, and export data to a GEN and SAMPLE file:
>>> vds3 = hc.import_bgen("data/example3.bgen", sample_file="data/example3.sample")
>>> (vds3.filter_variants_expr("gs.infoScore().score >= 0.9") ... .export_gen("output/infoscore_filtered"))
Notes
Writes out the internal VDS to a GEN and SAMPLE fileset in the Oxford spec.
The first 6 columns of the resulting GEN file are the following:
- Chromosome (
v.contig
) - Variant ID (
va.varid
if defined, else Chromosome:Position:Ref:Alt) - rsID (
va.rsid
if defined, else “.”) - position (
v.start
) - reference allele (
v.ref
) - alternate allele (
v.alt
)
Genotype probabilities:
- 3 probabilities per sample
(pHomRef, pHet, pHomVar)
. - Any filtered genotypes will be output as
(0.0, 0.0, 0.0)
. - If the input data contained Phred-scaled likelihoods, the probabilities in the GEN file will be the normalized genotype probabilities assuming a uniform prior.
- If the input data did not have genotype probabilities such as data imported using
import_plink()
, all genotype probabilities will be(0.0, 0.0, 0.0)
.
The sample file has 3 columns:
- ID_1 and ID_2 are identical and set to the sample ID (
s
). - The third column (“missing”) is set to 0 for all samples.
Parameters: - output (str) – Output file base. Will write GEN and SAMPLE files.
- precision (int) – Number of digits after the decimal point each probability is truncated to.
- Chromosome (
-
export_genotypes
(output, expr, types=False, export_ref=False, export_missing=False, parallel=False)[source]¶ Export genotype-level information to delimited text file.
Examples
Export genotype information with identifiers that form the header:
>>> vds.export_genotypes('output/genotypes.tsv', 'SAMPLE=s, VARIANT=v, GQ=g.gq, DP=g.dp, ANNO1=va.anno1, ANNO2=va.anno2')
Export the same information without identifiers, resulting in a file with no header:
>>> vds.export_genotypes('output/genotypes.tsv', 's, v, g.gq, g.dp, va.anno1, va.anno2')
Notes
export_genotypes()
outputs one line per cell (genotype) in the data set, though HomRef and missing genotypes are not output by default if the genotype schema is equal toTGenotype
. Use theexport_ref
andexport_missing
parameters to force export of HomRef and missing genotypes, respectively.The
expr
argument is a comma-separated list of fields or expressions, all of which must be of the formIDENTIFIER = <expression>
, or else of the form<expression>
. If some fields have identifiers and some do not, Hail will throw an exception. The accessible namespace includesg
,s
,sa
,v
,va
, andglobal
.Warning
If the genotype schema does not have the type
TGenotype
, all genotypes will be exported unless the value ofg
is missing. Usefilter_genotypes()
to filter out genotypes based on an expression before exporting.Parameters: - output (str) – Output path.
- expr (str) – Export expression for values to export.
- types (bool) – Write types of exported columns to a file at (output + “.types”)
- export_ref (bool) – If true, export reference genotypes. Only applicable if the genotype schema is
TGenotype
. - export_missing (bool) – If true, export missing genotypes.
- parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.
-
export_plink
(output, fam_expr='id = s', parallel=False)[source]¶ Export variant dataset as PLINK2 BED, BIM and FAM.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Import data from a VCF file, split multi-allelic variants, and export to a PLINK binary file:
>>> vds.split_multi().export_plink('output/plink')
Notes
fam_expr
can be used to set the fields in the FAM file. The following fields can be assigned:famID: String
id: String
matID: String
patID: String
isFemale: Boolean
isCase: Boolean
orqPheno: Double
If no assignment is given, the value is missing and the missing value is used:
0
for IDs and sex and-9
for phenotype. Only one ofisCase
orqPheno
can be assigned.fam_expr
is in sample context only and the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotations
The BIM file ID field is set to
CHR:POS:REF:ALT
.This code:
>>> vds.split_multi().export_plink('output/plink')
will behave similarly to the PLINK VCF conversion command
plink --vcf /path/to/file.vcf --make-bed --out sample --const-fid --keep-allele-order
except:
- The order among split multi-allelic alternatives in the BED file may disagree.
- PLINK uses the rsID for the BIM file ID.
Parameters: - output (str) – Output file base. Will write BED, BIM, and FAM files.
- fam_expr (str) – Expression for FAM file fields.
- parallel (bool) – If true, return a set of BED and BIM files (one per partition) rather than serially concatenating these files.
-
export_samples
(output, expr, types=False)[source]¶ Export sample information to delimited text file.
Examples
Export some sample QC metrics:
>>> (vds.sample_qc() ... .export_samples('output/samples.tsv', 'SAMPLE = s, CALL_RATE = sa.qc.callRate, NHET = sa.qc.nHet'))
This will produce a file with a header and three columns. To produce a file with no header, just leave off the assignment to the column identifier:
>>> (vds.sample_qc() ... .export_samples('output/samples.tsv', 's, sa.qc.rTiTv'))
Notes
One line per sample will be exported. As
export_samples()
runs in sample context, the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for samples
Parameters: - output (str) – Output file.
- expr (str) – Export expression for values to export.
- types (bool) – Write types of exported columns to a file at (output + “.types”).
-
export_variants
(output, expr, types=False, parallel=False)[source]¶ Export variant information to delimited text file.
Examples
Export a four column TSV with
v
,va.pass
,va.filters
, and one computed field:1 - va.qc.callRate
.>>> vds.export_variants('output/file.tsv', ... 'VARIANT = v, PASS = va.pass, FILTERS = va.filters, MISSINGNESS = 1 - va.qc.callRate')
It is also possible to export without identifiers, which will result in a file with no header. In this case, the expressions should look like the examples below:
>>> vds.export_variants('output/file.tsv', 'v, va.pass, va.qc.AF')
Note
If any field is named, all fields must be named.
In the common case that a group of annotations needs to be exported (for example, the annotations produced by
variantqc
), one can use thestruct.*
syntax. This syntax produces one column per field in the struct, and names them according to the struct field name.For example, the following invocation (assuming
va.qc
was generated byvariant_qc()
):>>> vds.export_variants('output/file.tsv', 'variant = v, va.qc.*')
will produce the following set of columns:
variant callRate AC AF nCalled ...
Note that using the
.*
syntax always results in named arguments, so it is not possible to export header-less files in this manner. However, naming the “splatted” struct will apply the name in front of each column like so:>>> vds.export_variants('output/file.tsv', 'variant = v, QC = va.qc.*')
which produces these columns:
variant QC.callRate QC.AC QC.AF QC.nCalled ...
Notes
This module takes a comma-delimited list of fields or expressions to print. These fields will be printed in the order they appear in the expression in the header and on each line.
One line per variant in the VDS will be printed. The accessible namespace includes:
v
(Variant): Variantva
: variant annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for variantv
Designating output with an expression
Much like the filtering methods, this method uses the Hail expression language. While the filtering methods expect an expression that evaluates to true or false, this method expects a comma-separated list of fields to print. These fields take the form
IDENTIFIER = <expression>
.Parameters: - output (str) – Output file.
- expr (str) – Export expression for values to export.
- types (bool) – Write types of exported columns to a file at (output + “.types”)
- parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.
-
export_vcf
(output, append_to_header=None, export_pp=False, parallel=False)[source]¶ Export variant dataset as a .vcf or .vcf.bgz file.
Examples
Export to VCF as a block-compressed file:
>>> vds.export_vcf('output/example.vcf.bgz')
Notes
export_vcf()
writes the VDS to disk in VCF format as described in the VCF 4.2 spec.Use the
.vcf.bgz
extension rather than.vcf
in the output file name for blocked GZIP compression.Note
We strongly recommended compressed (
.bgz
extension) and parallel output (parallel=True
) when exporting large VCFs.Consider the workflow of importing VCF to VDS and immediately exporting VDS to VCF:
>>> vds.export_vcf('output/example_out.vcf')
The example_out.vcf header will contain the FORMAT, FILTER, and INFO lines present in example.vcf. However, it will not contain CONTIG lines or lines added by external tools (such as bcftools and GATK) unless they are explicitly inserted using the
append_to_header
option.Hail only exports the contents of
va.info
to the INFO field. No other annotations besidesva.info
are exported.The genotype schema must have the type
TGenotype
orTStruct
. If the type isTGenotype
, then the FORMAT fields will be GT, AD, DP, GQ, and PL (or PP ifexport_pp
is True). If the type isTStruct
, then the exported FORMAT fields will be the names of each field of the Struct. Each field must have a type of String, Char, Int, Double, or Call. Arrays and Sets are also allowed as long as they are not nested. For example, a field with typeArray[Int]
can be exported but not a field with typeArray[Array[Int]]
. Nested Structs are also not allowed.Caution
If samples or genotypes are filtered after import, the value stored in
va.info.AC
value may no longer reflect the number of called alternate alleles in the filtered VDS. If the filtered VDS is then exported to VCF, downstream tools may produce erroneous results. The solution is to create new annotations inva.info
or overwrite existing annotations. For example, in order to produce an accurateAC
field, one can runvariant_qc()
and copy theva.qc.AC
field tova.info.AC
:>>> (vds.filter_genotypes('g.gq >= 20') ... .variant_qc() ... .annotate_variants_expr('va.info.AC = va.qc.AC') ... .export_vcf('output/example.vcf.bgz'))
Parameters: - output (str) – Path of .vcf file to write.
- append_to_header (str or None) – Path of file to append to VCF header.
- export_pp (bool) – If true, export linear-scaled probabilities (Hail’s pp field on genotype) as the VCF PP FORMAT field.
- parallel (bool) – If true, return a set of VCF files (one per partition) rather than serially concatenating these files.
-
filter_alleles
(expr, annotation='va = va', subset=True, keep=True, filter_altered_genotypes=False, max_shift=100, keep_star=False)[source]¶ Filter a user-defined set of alternate alleles for each variant. If all alternate alleles of a variant are filtered, the variant itself is filtered. The expr expression is evaluated for each alternate allele, but not for the reference allele (i.e.
aIndex
will never be zero).Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
To remove alternate alleles with zero allele count and update the alternate allele count annotation with the new indices:
>>> vds_result = vds.filter_alleles('va.info.AC[aIndex - 1] == 0', ... annotation='va.info.AC = aIndices[1:].map(i => va.info.AC[i - 1])', ... keep=False)
Note that we skip the first element of
aIndices
because we are mapping between the old and new allele indices, not the alternate allele indices.Notes
If
filter_altered_genotypes
is true, genotypes that contain filtered-out alleles are set to missing.filter_alleles()
implements two algorithms for filtering alleles: subset and downcode. We will illustrate their behavior on the example genotype below when filtering the first alternate allele (allele 1) at a site with 1 reference allele and 2 alternate alleles.GT: 1/2 GQ: 10 AD: 0,50,35 0 | 1000 1 | 1000 10 2 | 1000 0 20 +----------------- 0 1 2
Subset algorithm
The subset algorithm (the default,
subset=True
) subsets the AD and PL arrays (i.e. removes entries corresponding to filtered alleles) and then sets GT to the genotype with the minimum PL. Note that if the genotype changes (as in the example), the PLs are re-normalized (shifted) so that the most likely genotype has a PL of 0. Qualitatively, subsetting corresponds to the belief that the filtered alleles are not real so we should discard any probability mass associated with them.The subset algorithm would produce the following:
GT: 1/1 GQ: 980 AD: 0,50 0 | 980 1 | 980 0 +----------- 0 1
In summary:
- GT: Set to most likely genotype based on the PLs ignoring the filtered allele(s).
- AD: The filtered alleles’ columns are eliminated, e.g., filtering alleles 1 and 2 transforms
25,5,10,20
to25,20
. - DP: No change.
- PL: The filtered alleles’ columns are eliminated and the remaining columns shifted so the minimum value is 0.
- GQ: The second-lowest PL (after shifting).
Downcode algorithm
The downcode algorithm (
subset=False
) recodes occurances of filtered alleles to occurances of the reference allele (e.g. 1 -> 0 in our example). So the depths of filtered alleles in the AD field are added to the depth of the reference allele. Where downcodeing filtered alleles merges distinct genotypes, the minimum PL is used (since PL is on a log scale, this roughly corresponds to adding probabilities). The PLs are then re-normalized (shifted) so that the most likely genotype has a PL of 0, and GT is set to this genotype. If an allele is filtered, this algorithm acts similarly tosplit_multi()
.The downcoding algorithm would produce the following:
GT: 0/1 GQ: 10 AD: 35,50 0 | 20 1 | 0 10 +----------- 0 1
In summary:
- GT: Downcode filtered alleles to reference.
- AD: The filtered alleles’ columns are eliminated and their value is added to the reference, e.g., filtering alleles 1 and 2 transforms
25,5,10,20
to40,20
. - DP: No change.
- PL: Downcode filtered alleles to reference, combine PLs using minimum for each overloaded genotype, and shift so the overall minimum PL is 0.
- GQ: The second-lowest PL (after shifting).
Expression Variables
The following symbols are in scope for
expr
:v
(Variant): Variantva
: variant annotationsaIndex
(Int): the index of the allele being tested
The following symbols are in scope for
annotation
:v
(Variant): Variantva
: variant annotationsaIndices
(Array[Int]): the array of old indices (such thataIndices[newIndex] = oldIndex
andaIndices[0] = 0
)
Parameters: - expr (str) – Boolean filter expression involving v (variant), va (variant annotations), and aIndex (allele index)
- annotation (str) – Annotation modifying expression involving v (new variant), va (old variant annotations), and aIndices (maps from new to old indices)
- subset (bool) – If true, subsets PL and AD, otherwise downcodes the PL and AD. Genotype and GQ are set based on the resulting PLs.
- keep (bool) – If true, keep variants matching expr
- filter_altered_genotypes (bool) – If true, genotypes that contain filtered-out alleles are set to missing.
- max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered.
- keepStar (bool) – If true, keep variants where the only allele left is a
*
allele.
Returns: Filtered variant dataset.
Return type:
-
filter_genotypes
(expr, keep=True)[source]¶ Filter genotypes based on expression.
Examples
Filter genotypes by allele balance dependent on genotype call:
>>> vds_result = vds.filter_genotypes('let ab = g.ad[1] / g.ad.sum() in ' + ... '((g.isHomRef() && ab <= 0.1) || ' + ... '(g.isHet() && ab >= 0.25 && ab <= 0.75) || ' + ... '(g.isHomVar() && ab >= 0.9))')
Notes
expr
is in genotype context so the following symbols are in scope:s
(Sample): samplev
(Variant): Variantsa
: sample annotationsva
: variant annotationsglobal
: global annotations
For more information, see the documentation on data representation, annotations, and the expression language.
Caution
When
expr
evaluates to missing, the genotype will be removed regardless of whetherkeep=True
orkeep=False
.Parameters: - expr (str) – Boolean filter expression.
- keep (bool) – Keep genotypes where
expr
evaluates to true.
Returns: Filtered variant dataset.
Return type:
-
filter_intervals
(intervals, keep=True)[source]¶ Filter variants with an interval or list of intervals.
Examples
Filter to one interval:
>>> vds_result = vds.filter_intervals(Interval.parse('17:38449840-38530994'))
Another way of writing this same query:
>>> vds_result = vds.filter_intervals(Interval(Locus('17', 38449840), Locus('17', 38530994)))
Two identical ways of parsing a list of intervals:
>>> intervals = map(Interval.parse, ['1:50M-75M', '2:START-400000', '3-22']) >>> intervals = [Interval.parse(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
Use this interval list to filter:
>>> vds_result = vds.filter_intervals(intervals)
Notes
This method takes an argument of
Interval
or list ofInterval
.Based on the
keep
argument, this method will either restrict to variants in the supplied interval ranges, or remove all variants in those ranges. Note that intervals are left-inclusive, and right-exclusive. The below interval includes the locus15:100000
but not15:101000
.>>> interval = Interval.parse('15:100000-101000')
This method performs predicate pushdown when
keep=True
, meaning that data shards that don’t overlap any supplied interval will not be loaded at all. This property enablesfilter_intervals
to be used for reasonably low-latency queries of small ranges of the genome, even on large datasets. Suppose we are interested in variants on chromosome 15 between 100000 and 200000. This implementation withfilter_variants_expr()
may come to mind first:>>> vds_filtered = vds.filter_variants_expr('v.contig == "15" && v.start >= 100000 && v.start < 200000')
However, it is much faster (and easier!) to use this method:
>>> vds_filtered = vds.filter_intervals(Interval.parse('15:100000-200000'))
Note
A
KeyTable
keyed by interval can be used to filter a dataset efficiently as well. See the documentation forfilter_variants_table()
for an example. This is useful for using interval files to filter a dataset.Parameters: Returns: Filtered variant dataset.
Return type:
-
filter_multi
()[source]¶ Filter out multi-allelic sites.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.This method is much less computationally expensive than
split_multi()
, and can also be used to produce a variant dataset that can be used with methods that do not support multiallelic variants.Returns: Dataset with no multiallelic sites, which can be used for biallelic-only methods. Return type: VariantDataset
-
filter_samples_expr
(expr, keep=True)[source]¶ Filter samples with the expression language.
Examples
Filter samples by phenotype (assumes sample annotation sa.isCase exists and is a Boolean variable):
>>> vds_result = vds.filter_samples_expr("sa.isCase")
Remove samples with an ID that matches a regular expression:
>>> vds_result = vds.filter_samples_expr('"^NA" ~ s' , keep=False)
Filter samples from sample QC metrics and write output to a new variant dataset:
>>> (vds.sample_qc() ... .filter_samples_expr('sa.qc.callRate >= 0.99 && sa.qc.dpMean >= 10') ... .write("output/filter_samples.vds"))
Notes
expr
is in sample context so the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for samples
For more information, see the documentation on data representation, annotations, and the expression language.
Caution
When
expr
evaluates to missing, the sample will be removed regardless of whetherkeep=True
orkeep=False
.Parameters: - expr (str) – Boolean filter expression.
- keep (bool) – Keep samples where
expr
evaluates to true.
Returns: Filtered variant dataset.
Return type:
-
filter_samples_list
(samples, keep=True)[source]¶ Filter samples with a list of samples.
Examples
>>> to_remove = ['NA12878', 'NA12891', 'NA12892'] >>> vds_result = vds.filter_samples_list(to_remove, keep=False)
Read list from a file:
>>> to_remove = [s.strip() for s in open('data/exclude_samples.txt')] >>> vds_result = vds.filter_samples_list(to_remove, keep=False)
Parameters: - samples (list of str) – List of samples to keep or remove.
- keep (bool) – If true, keep samples in
samples
, otherwise remove them.
Returns: Filtered variant dataset.
Return type:
-
filter_samples_table
(table, keep=True)[source]¶ Filter samples with a table keyed by sample ID.
Examples
Keep samples in a text file:
>>> table = hc.import_table('data/samples1.tsv').key_by('Sample') >>> vds_filtered = vds.filter_samples_table(table, keep=True)
Remove samples in a text file with 1 field, and no header:
>>> to_remove = hc.import_table('data/exclude_samples.txt', no_header=True).key_by('f0') >>> vds_filtered = vds.filter_samples_table(to_remove, keep=False)
Notes
This method filters out or filters to the keys of a table. The table must have a key of type
String
.Parameters: - table (
KeyTable
) – Key table. - keep (bool) – If true, keep only the keys in
table
, otherwise remove them.
Returns: Filtered dataset.
Return type: - table (
-
filter_variants_expr
(expr, keep=True)[source]¶ Filter variants with the expression language.
Examples
Keep variants in the gene CHD8 (assumes the variant annotation
va.gene
exists):>>> vds_result = vds.filter_variants_expr('va.gene == "CHD8"')
Remove all variants on chromosome 1:
>>> vds_result = vds.filter_variants_expr('v.contig == "1"', keep=False)
Caution
The double quotes on
"1"
are necessary becausev.contig
is of type String.Notes
The following symbols are in scope for
expr
:v
(Variant): Variantva
: variant annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for variantv
For more information, see the Overview and the Expression Language.
Caution
When
expr
evaluates to missing, the variant will be removed regardless of whetherkeep=True
orkeep=False
.Parameters: - expr (str) – Boolean filter expression.
- keep (bool) – Keep variants where
expr
evaluates to true.
Returns: Filtered variant dataset.
Return type:
-
filter_variants_list
(variants, keep=True)[source]¶ Filter variants with a list of variants.
Examples
Filter VDS down to a list of variants:
>>> vds_filtered = vds.filter_variants_list([Variant.parse('20:10626633:G:GC'), ... Variant.parse('20:10019093:A:G')], keep=True)
Notes
This method performs predicate pushdown when
keep=True
, meaning that data shards that don’t overlap with any supplied variant will not be loaded at all. This property enablesfilter_variants_list
to be used for reasonably low-latency queries of one or more variants, even on large datasets.Parameters: - variants (list of
Variant
) – List of variants to keep or remove. - keep (bool) – If true, keep variants in
variants
, otherwise remove them.
Returns: Filtered variant dataset.
Return type: - variants (list of
-
filter_variants_table
(table, keep=True)[source]¶ Filter variants with a Variant keyed key table.
Example
Filter variants of a VDS to those appearing in a text file:
>>> kt = hc.import_table('data/sample_variants.txt', key='Variant', impute=True) >>> filtered_vds = vds.filter_variants_table(kt, keep=True)
Keep all variants whose chromosome and position (locus) appear in a file with a chromosome:position column:
>>> kt = hc.import_table('data/locus-table.tsv', impute=True).key_by('Locus') >>> filtered_vds = vds.filter_variants_table(kt, keep=True)
Remove all variants which overlap an interval in a UCSC BED file:
>>> kt = KeyTable.import_bed('data/file2.bed') >>> filtered_vds = vds.filter_variants_table(kt, keep=False)
Notes
This method takes a key table as an argument, which must be keyed by one of the following:
Interval
Locus
Variant
If the key is a
Variant
, then a variant in the dataset will be kept or removed based on finding a complete match in the table. Be careful, however:1:1:A:T
does not match1:1:A:T,C
, and vice versa.If the key is a
Locus
, then a variant in the dataset will be kept or removed based on finding a locus in the table that matches by chromosome and position.If the key is an
Interval
, then a variant in the dataset will be kept or removed based on finding an interval in the table that contains the variant’s chromosome and position.Parameters: - table (
KeyTable
) – Key table object. - keep (bool) – If true, keep only matches in
table
, otherwise remove them.
Returns: Filtered variant dataset.
Return type:
-
static
from_table
(table)[source]¶ Construct a sites-only variant dataset from a key table.
Examples
Import a text table and construct a sites-only VDS:
>>> table = hc.import_table('data/variant-lof.tsv', types={'v': TVariant()}).key_by('v') >>> sites_vds = VariantDataset.from_table(table)
Notes
The key table must be keyed by one column of type
TVariant
.All columns in the key table become variant annotations in the result. For example, a key table with key column
v
(Variant) and columngene
(String) will produce a sites-only variant dataset with ava.gene
variant annotation.Parameters: table ( KeyTable
) – Variant-keyed table.Returns: Sites-only variant dataset. Return type: VariantDataset
-
genotype_schema
¶ Returns the signature of the genotypes contained in this VDS.
Examples
>>> print(vds.genotype_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.genotype_schema)
Return type: Type
-
genotypes_table
()[source]¶ Generate a fully expanded genotype table.
Examples
>>> gs = vds.genotypes_table()
Notes
This produces a (massive) flat table from all the genotypes in the dataset. The table has columns:
- v (Variant) - Variant (key column).
- va (Variant annotation schema) - Variant annotations.
- s (String) - Sample ID (key column).
- sa (Sample annotation schema) - Sample annotations.
- g (Genotype schema) - Genotype or generic genotype.
Caution
This table has a row for each variant/sample pair. The genotype key table for a dataset with 10M variants and 10K samples will contain 100 billion rows. Writing or exporting this table will produce a file much larger than the equivalent VDS.
Returns: Key table with a row for each genotype. Return type: KeyTable
-
global_schema
¶ Returns the signature of the global annotations contained in this VDS.
Examples
>>> print(vds.global_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.global_schema)
Return type: Type
-
globals
¶ Return global annotations as a Python object.
Returns: Dataset global annotations. Return type: Struct
-
grm
()[source]¶ Compute the Genetic Relatedness Matrix (GRM).
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
>>> km = vds.grm()
Notes
The genetic relationship matrix (GRM) \(G\) encodes genetic correlation between each pair of samples. It is defined by \(G = MM^T\) where \(M\) is a standardized version of the genotype matrix, computed as follows. Let \(C\) be the \(n \times m\) matrix of raw genotypes in the variant dataset, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as
\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in Hardy-Weinberg equilibrium and is further motivated in the paper Patterson, Price and Reich, 2006. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) so that the diagonal entries of the GRM are approximately 1. Equivalently,
\[G_{ik} = \frac{1}{m} \sum_{j=1}^m \frac{(C_{ij}-2p_j)(C_{kj}-2p_j)}{2 p_j (1-p_j)} \]Returns: Genetic Relatedness Matrix for all samples. Return type: KinshipMatrix
-
hardcalls
()[source]¶ Drop all genotype fields except the GT field.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.A hard-called variant dataset is about two orders of magnitude smaller than a standard sequencing dataset. Use this method to create a smaller, faster representation for downstream processing that only requires the GT field.
Returns: Variant dataset with no genotype metadata. Return type: VariantDataset
-
ibd
(maf=None, bounded=True, min=None, max=None)[source]¶ Compute matrix of identity-by-descent estimations.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
To calculate a full IBD matrix, using minor allele frequencies computed from the variant dataset itself:
>>> vds.ibd()
To calculate an IBD matrix containing only pairs of samples with
PI_HAT
in [0.2, 0.9], using minor allele frequencies stored inva.panel_maf
:>>> vds.ibd(maf='va.panel_maf', min=0.2, max=0.9)
Notes
The implementation is based on the IBD algorithm described in the PLINK paper.
ibd()
requires the dataset to be bi-allelic (otherwise runsplit_multi()
or otherwise runfilter_multi()
) and does not perform LD pruning. Linkage disequilibrium may bias the result so consider filtering variants first.The resulting
KeyTable
entries have the type: { i: String, j: String, ibd: { Z0: Double, Z1: Double, Z2: Double, PI_HAT: Double }, ibs0: Long, ibs1: Long, ibs2: Long }. The key list is: *i: String, j: String*.Conceptually, the output is a symmetric, sample-by-sample matrix. The output key table has the following form
i j ibd.Z0 ibd.Z1 ibd.Z2 ibd.PI_HAT ibs0 ibs1 ibs2 sample1 sample2 1.0000 0.0000 0.0000 0.0000 ... sample1 sample3 1.0000 0.0000 0.0000 0.0000 ... sample1 sample4 0.6807 0.0000 0.3193 0.3193 ... sample1 sample5 0.1966 0.0000 0.8034 0.8034 ...
Parameters: - maf (str or None) – Expression for the minor allele frequency.
- bounded (bool) – Forces the estimations for Z0, Z1, Z2, and PI_HAT to take on biologically meaningful values (in the range [0,1]).
- min (float or None) – Sample pairs with a PI_HAT below this value will not be included in the output. Must be in [0,1].
- max (float or None) – Sample pairs with a PI_HAT above this value will not be included in the output. Must be in [0,1].
Returns: A
KeyTable
mapping pairs of samples to their IBD statisticsReturn type:
-
ibd_prune
(threshold, tiebreaking_expr=None, maf=None, bounded=True)[source]¶ Prune samples from the
VariantDataset
based onibd()
PI_HAT measures of relatedness.Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Prune samples so that no two have a PI_HAT value greater than or equal to 0.6.
>>> pruned_vds = vds.ibd_prune(0.6)
Prune samples so that no two have a PI_HAT value greater than or equal to 0.5, with a tiebreaking expression that selects cases over controls:
>>> pruned_vds = vds.ibd_prune( ... 0.5, ... tiebreaking_expr="if (sa1.isCase && !sa2.isCase) -1 else if (!sa1.isCase && sa2.isCase) 1 else 0")
Notes
The variant dataset returned may change in near future as a result of algorithmic improvements. The current algorithm is very efficient on datasets with many small families, less so on datasets with large families. Currently, the algorithm works by deleting the person from each family who has the highest number of relatives, and iterating until no two people have a PI_HAT value greater than that specified. If two people within a family have the same number of relatives, the tiebreaking_expr given will be used to determine which sample gets deleted.
The tiebreaking_expr namespace has the following variables available:
s1
: The first sample id.sa1
: The annotations associated with s1.s2
: The second sample id.sa2
: The annotations associated with s2.
The tiebreaking_expr returns an integer expressing the preference for one sample over the other. Any negative integer expresses a preference for keeping
s1
. Any positive integer expresses a preference for keepings2
. A zero expresses no preference. This function must induce a preorder on the samples, in particular:tiebreaking_expr(sample1, sample2)
must equal-1 * tie breaking_expr(sample2, sample1)
, which evokes the common sense understanding that ifx < y
then y > x`.tiebreaking_expr(sample1, sample1)
must equal 0, i.e.x = x
- if sample1 is preferred to sample2 and sample2 is preferred to sample3, then sample1 must also be preferred to sample3
The last requirement is only important if you have three related samples with the same number of relatives and all three are related to one another. In cases like this one, it is important that either:
- one of the three is preferred to both other ones, or
- there is no preference among the three samples
Parameters: - threshold – The desired maximum PI_HAT value between any pair of samples.
- tiebreaking_expr – Expression used to choose between two samples with the same number of relatives.
- maf – Expression for the minor allele frequency.
- bounded – Forces the estimations for Z0, Z1, Z2, and PI_HAT to take on biologically meaningful values (in the range [0,1]).
Returns: A
VariantDataset
containing no samples with a PI_HAT greater than threshold.Return type:
-
impute_sex
(maf_threshold=0.0, include_par=False, female_threshold=0.2, male_threshold=0.8, pop_freq=None)[source]¶ Impute sex of samples by calculating inbreeding coefficient on the X chromosome.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Remove samples where imputed sex does not equal reported sex:
>>> imputed_sex_vds = (vds.impute_sex() ... .annotate_samples_expr('sa.sexcheck = sa.pheno.isFemale == sa.imputesex.isFemale') ... .filter_samples_expr('sa.sexcheck || isMissing(sa.sexcheck)'))
Notes
We have used the same implementation as PLINK v1.7.
- X chromosome variants are selected from the VDS:
v.contig == "X" || v.contig == "23"
- Variants with a minor allele frequency less than the threshold given by
maf-threshold
are removed - Variants in the pseudoautosomal region (X:60001-2699520) || (X:154931044-155260560) are included if the
include_par
optional parameter is set to true. - The minor allele frequency (maf) per variant is calculated.
- For each variant and sample with a non-missing genotype call, \(E\), the expected number of homozygotes (from population MAF), is computed as \(1.0 - (2.0*maf*(1.0-maf))\).
- For each variant and sample with a non-missing genotype call, \(O\), the observed number of homozygotes, is computed as 0 = heterozygote; 1 = homozygote
- For each variant and sample with a non-missing genotype call, \(N\) is incremented by 1
- For each sample, \(E\), \(O\), and \(N\) are combined across variants
- \(F\) is calculated by \((O - E) / (N - E)\)
- A sex is assigned to each sample with the following criteria: F < 0.2 => Female; F > 0.8 => Male. Use
female-threshold
andmale-threshold
to change this behavior.
Annotations
The below annotations can be accessed with
sa.imputesex
.- isFemale (Boolean) – True if the imputed sex is female, false if male, missing if undetermined
- Fstat (Double) – Inbreeding coefficient
- nTotal (Long) – Total number of variants considered
- nCalled (Long) – Number of variants with a genotype call
- expectedHoms (Double) – Expected number of homozygotes
- observedHoms (Long) – Observed number of homozygotes
Parameters: - maf_threshold (float) – Minimum minor allele frequency threshold.
- include_par (bool) – Include pseudoautosomal regions.
- female_threshold (float) – Samples are called females if F < femaleThreshold
- male_threshold (float) – Samples are called males if F > maleThreshold
- pop_freq (str) – Variant annotation for estimate of MAF. If None, MAF will be computed.
Returns: Annotated dataset.
Return type: - X chromosome variants are selected from the VDS:
-
join
(right)[source]¶ Join two variant datasets.
Notes
This method performs an inner join on variants, concatenates samples, and takes variant and global annotations from the left dataset (self).
The datasets must have distinct samples, the same sample schema, and the same split status (both split or both multi-allelic).
Parameters: right ( VariantDataset
) – right-hand variant datasetReturns: Joined variant dataset Return type: VariantDataset
-
ld_matrix
(force_local=False)[source]¶ Computes the linkage disequilibrium (correlation) matrix for the variants in this VDS.
Examples
>>> ld_mat = vds.ld_matrix()
Notes
Each entry (i, j) in the LD matrix gives the \(r\) value between variants i and j, defined as Pearson’s correlation coefficient \(\rho_{x_i,x_j}\) between the two genotype vectors \(x_i\) and \(x_j\).
\[\rho_{x_i,x_j} = \frac{\mathrm{Cov}(X_i,X_j)}{\sigma_{X_i} \sigma_{X_j}}\]Also note that variants with zero variance (\(\sigma = 0\)) will be dropped from the matrix.
Caution
The matrix returned by this function can easily be very large with most entries near zero (for example, entries between variants on different chromosomes in a homogenous population). Most likely you’ll want to reduce the number of variants with methods like
sample_variants()
,filter_variants_expr()
, orld_prune()
before calling this unless your dataset is very small.Parameters: force_local (bool) – If true, the LD matrix is computed using local matrix multiplication on the Spark driver. This may improve performance when the genotype matrix is small enough to easily fit in local memory. If false, the LD matrix is computed using distributed matrix multiplication if the number of genotypes exceeds \(5000^2\) and locally otherwise. Returns: Matrix of r values between pairs of variants. Return type: LDMatrix
-
ld_prune
(r2=0.2, window=1000000, memory_per_core=256, num_cores=1)[source]¶ Prune variants in linkage disequilibrium (LD).
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Requires
was_split
equals True.Examples
Export the set of common LD pruned variants to a file:
>>> vds_result = (vds.variant_qc() ... .filter_variants_expr("va.qc.AF >= 0.05 && va.qc.AF <= 0.95") ... .ld_prune() ... .export_variants("output/ldpruned.variants", "v"))
Notes
Variants are pruned in each contig from smallest to largest start position. The LD pruning algorithm is as follows:
pruned_set = [] for v1 in contig: keep = True for v2 in pruned_set: if ((v1.position - v2.position) <= window and correlation(v1, v2) >= r2): keep = False if keep: pruned_set.append(v1)
The parameter
window
defines the maximum distance in base pairs between two variants to check whether the variants are independent (\(R^2\) <r2
) wherer2
is the maximum \(R^2\) allowed. \(R^2\) is defined as the square of Pearson’s correlation coefficient \({\rho}_{x,y}\) between the two genotype vectors \({\mathbf{x}}\) and \({\mathbf{y}}\).\[{\rho}_{x,y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y}\]ld_prune()
with default arguments is equivalent toplink --indep-pairwise 1000kb 1 0.2
. The list of pruned variants returned by Hail and PLINK will differ because Hail mean-imputes missing values and tests pairs of variants in a different order than PLINK.Be sure to provide enough disk space per worker because
ld_prune()
persists up to 3 copies of the data to both memory and disk. The amount of disk space required will depend on the size and minor allele frequency of the input data and the prune parametersr2
andwindow
. The number of bytes stored in memory per variant is aboutnSamples / 4 + 50
.Warning
The variants in the pruned set are not guaranteed to be identical each time
ld_prune()
is run. We recommend runningld_prune()
once and exporting the list of LD pruned variants usingexport_variants()
for future use.Parameters: - r2 (float) – Maximum \(R^2\) threshold between two variants in the pruned set within a given window.
- window (int) – Width of window in base-pairs for computing pair-wise \(R^2\) values.
- memory_per_core (int) – Total amount of memory available for each core in MB. If unsure, use the default value.
- num_cores (int) – The number of cores available. Equivalent to the total number of workers times the number of cores per worker.
Returns: Variant dataset filtered to those variants which remain after LD pruning.
Return type:
-
linreg
(y, covariates=[], root='va.linreg', use_dosages=False, min_ac=1, min_af=0.0)[source]¶ Test each variant for association using linear regression.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Run linear regression per variant using a phenotype and two covariates stored in sample annotations:
>>> vds_result = vds.linreg('sa.pheno.height', covariates=['sa.pheno.age', 'sa.pheno.isFemale'])
Notes
The
linreg()
method computes, for each variant, statistics of the \(t\)-test for the genotype coefficient of the linear function of best fit from sample genotype and covariates to quantitative phenotype or case-control status. Hail only includes samples for which phenotype and all covariates are defined. For each variant, missing genotypes as the mean of called genotypes.By default, genotypes values are given by hard call genotypes (
g.gt
). Ifuse_dosages=True
, then genotype values are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.Assuming there are sample annotations
sa.pheno.height
,sa.pheno.age
,sa.pheno.isFemale
, andsa.cov.PC1
, the code:>>> vds_result = vds.linreg('sa.pheno.height', covariates=['sa.pheno.age', 'sa.pheno.isFemale', 'sa.cov.PC1'])
considers a model of the form
\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{isFemale} + \beta_4 \, \mathrm{PC1} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]where the genotype \(\mathrm{gt}\) is coded as \(0\) for HomRef, \(1\) for Het, and \(2\) for HomVar, and the Boolean covariate \(\mathrm{isFemale}\) is coded as \(1\) for true (female) and \(0\) for false (male). The null model sets \(\beta_1 = 0\).
Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations. One can further restrict computation to those variants with at least \(k\) observed alternate alleles (AC) or alternate allele frequency (AF) at least \(p\) in the included samples using the options
min_ac=k
ormin_af=p
, respectively. Unlike thefilter_variants_expr()
method, these filters do not remove variants from the underlying variant dataset; rather the linear regression annotations for variants with low AC or AF are set to missing. Adding both filters is equivalent to applying the more stringent of the two.Phenotype and covariate sample annotations may also be specified using programmatic expressions without identifiers, such as:
>>> vds_result = vds.linreg('if (sa.pheno.isFemale) sa.pheno.age else (2 * sa.pheno.age + 10)')
For Boolean covariate types, true is coded as 1 and false as 0. In particular, for the sample annotation
sa.fam.isCase
added by importing a FAM file with case-control phenotype, case is 1 and control is 0.The standard least-squares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the t-statistic which follows the t-distribution with \(n - k - 2\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to genotype and intercept.
Annotations
With the default root, the following four variant annotations are added.
- va.linreg.beta (Double) – fit genotype coefficient, \(\hat\beta_1\)
- va.linreg.se (Double) – estimated standard error, \(\widehat{\mathrm{se}}\)
- va.linreg.tstat (Double) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
- va.linreg.pval (Double) – \(p\)-value
Parameters: - y (str) – Response expression
- covariates (list of str) – list of covariate expressions
- root (str) – Variant annotation path to store result of linear regression.
- use_dosages (bool) – If true, use dosages genotypes rather than hard call genotypes.
- min_ac (int) – Minimum alternate allele count.
- min_af (float) – Minimum alternate allele frequency.
Returns: Variant dataset with linear regression variant annotations.
Return type:
-
linreg3
(ys, covariates=[], root='va.linreg', use_dosages=False, variant_block_size=16)[source]¶ Test each variant for association with multiple phenotypes using linear regression.
This method runs linear regression for multiple phenotypes more efficiently than looping over
linreg()
. This method is more efficient thanlinreg_multi_pheno()
but doesn’t implicitly filter on allele count or allele frequency.Warning
linreg3()
uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.Annotations
With the default root, the following four variant annotations are added. The indexing of the array annotations corresponds to that of
y
.- va.linreg.nCompleteSamples (Int) – number of samples used
- va.linreg.AC (Double) – sum of the genotype values
x
- va.linreg.ytx (Array[Double]) – array of dot products of each phenotype vector
y
with the genotype vectorx
- va.linreg.beta (Array[Double]) – array of fit genotype coefficients, \(\hat\beta_1\)
- va.linreg.se (Array[Double]) – array of estimated standard errors, \(\widehat{\mathrm{se}}\)
- va.linreg.tstat (Array[Double]) – array of \(t\)-statistics, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
- va.linreg.pval (Array[Double]) – array of \(p\)-values
Parameters: - ys – list of one or more response expressions.
- covariates (list of str) – list of covariate expressions.
- root (str) – Variant annotation path to store result of linear regression.
- use_dosages (bool) – If true, use dosage genotypes rather than hard call genotypes.
- variant_block_size (int) – Number of variant regressions to perform simultaneously. Larger block size requires more memmory.
Returns: Variant dataset with linear regression variant annotations.
Return type:
-
linreg_burden
(key_name, variant_keys, single_key, agg_expr, y, covariates=[])[source]¶ Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the linear regression model.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Run a gene burden test using linear regression on the maximum genotype per gene. Here
va.genes
is a variant annotation of type Set[String] giving the set of genes containing the variant (see Extended example below for a deep dive):>>> linreg_kt, sample_kt = (hc.read('data/example_burden.vds') ... .linreg_burden(key_name='gene', ... variant_keys='va.genes', ... single_key=False, ... agg_expr='gs.map(g => g.gt).max()', ... y='sa.burden.pheno', ... covariates=['sa.burden.cov1', 'sa.burden.cov2']))
Run a gene burden test using linear regression on the weighted sum of genotypes per gene. Here
va.gene
is a variant annotation of type String giving a single gene per variant (or no gene if missing), andva.weight
is a numeric variant annotation:>>> linreg_kt, sample_kt = (hc.read('data/example_burden.vds') ... .linreg_burden(key_name='gene', ... variant_keys='va.gene', ... single_key=True, ... agg_expr='gs.map(g => va.weight * g.gt).sum()', ... y='sa.burden.pheno', ... covariates=['sa.burden.cov1', 'sa.burden.cov2']))
To use a weighted sum of genotypes with missing genotypes mean-imputed rather than ignored, set
agg_expr='gs.map(g => va.weight * orElse(g.gt.toDouble, 2 * va.qc.AF)).sum()'
whereva.qc.AF
is the allele frequency over those samples that have no missing phenotype or covariates.Caution
With
single_key=False
,variant_keys
expects a variant annotation of Set or Array type, in order to allow each variant to have zero, one, or more keys (for example, the same variant may appear in multiple genes). Unlike with type Set, if the same key appears twice in a variant annotation of type Array, then that variant will be counted twice in that key’s group. Withsingle_key=True
,variant_keys
expects a variant annotation whose value is itself the key of interest. In bose cases, variants with missing keys are ignored.Notes
This method modifies
linreg()
by replacing the genotype covariate per variant and sample with an aggregated (i.e., collapsed) score per key and sample. This numeric score is computed from the sample’s genotypes and annotations over all variants with that key. Conceptually, the method proceeds as follows:Filter to the set of samples for which all phenotype and covariates are defined.
For each key and sample, aggregate genotypes across variants with that key to produce a numeric score.
agg_expr
must be of numeric type and has the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for samples
Note that
v
,va
, andg
are accessible through Aggregable methods ongs
.The resulting sample key table has key column
key_name
and a numeric column of scores for each sample named by the sample ID.For each key, fit the linear regression model using the supplied phenotype and covariates. The model is that of
linreg()
with sample genotypegt
replaced by the score in the sample key table. For each key, missing scores are mean-imputed across all samples.The resulting linear regression key table has the following columns:
- value of
key_name
(String) – descriptor of variant group key (key column) - beta (Double) – fit coefficient, \(\hat\beta_1\)
- se (Double) – estimated standard error, \(\widehat{\mathrm{se}}\)
- tstat (Double) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
- pval (Double) – \(p\)-value
- value of
linreg_burden()
returns both the linear regression key table and the sample key table.Extended example
Let’s walk through these steps in the
max()
toy example above. There are six samples with the following annotations:Sample pheno cov1 cov2 A 0 0 -1 B 0 2 3 C 1 1 5 D 1 -2 0 E 1 -2 -4 F 1 4 3 There are three variants with the following
gt
values:Variant A B C D E F 1:1:A:C 0 1 0 0 0 1 1:2:C:T . 2 . 2 0 0 1:3:G:C 0 . 1 1 1 . The
va.genes
annotation of type Set[String] onexample_burden.vds
was created usingannotate_variants_table()
withproduct=True
on the interval list:1 1 2 + geneA 1 2 2 + geneB 1 1 3 + geneC
So there are three overlapping genes: gene A contains two variants, gene B contains one variant, and gene C contains all three variants.
gene 1:1:A:C 1:2:C:T 1:3:G:C geneA X X geneB X geneC X X X Therefore
annotate_variants_table()
withproduct=True
creates a variant annotation of type Set[String] with valuesSet('geneA', 'geneB')
,Set('geneB')
, andSet('geneA', 'geneB', 'geneC')
.So the sample aggregation key table is:
gene A B C D E F geneA 0 2 0 2 0 1 geneB NA 2 NA 2 0 0 geneC 0 2 1 2 1 1 Linear regression is done for each row using the supplied phenotype, covariates, and implicit intercept. The resulting linear regression key table is:
gene beta se tstat pval geneA -0.084 0.368 -0.227 0.841 geneB -0.542 0.335 -1.617 0.247 geneC 0.075 0.515 0.145 0.898 Parameters: - key_name (str) – Name to assign to key column of returned key tables.
- variant_keys (str) – Variant annotation path for the TArray or TSet of keys associated to each variant.
- single_key (bool) – if true,
variant_keys
is interpreted as a single (or missing) key per variant, rather than as a collection of keys. - agg_expr (str) – Sample aggregation expression (per key).
- y (str) – Response expression.
- covariates (list of str) – list of covariate expressions.
Returns: Tuple of linear regression key table and sample aggregation key table.
Return type:
-
linreg_multi_pheno
(ys, covariates=[], root='va.linreg', use_dosages=False, min_ac=1, min_af=0.0)[source]¶ Test each variant for association with multiple phenotypes using linear regression.
This method runs linear regression for multiple phenotypes more efficiently than looping over
linreg()
.Warning
linreg_multi_pheno()
uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.Annotations
With the default root, the following four variant annotations are added. The indexing of these annotations corresponds to that of
y
.- va.linreg.beta (Array[Double]) – array of fit genotype coefficients, \(\hat\beta_1\)
- va.linreg.se (Array[Double]) – array of estimated standard errors, \(\widehat{\mathrm{se}}\)
- va.linreg.tstat (Array[Double]) – array of \(t\)-statistics, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
- va.linreg.pval (Array[Double]) – array of \(p\)-values
Parameters: - ys – list of one or more response expressions.
- covariates (list of str) – list of covariate expressions.
- root (str) – Variant annotation path to store result of linear regression.
- use_dosages (bool) – If true, use dosage genotypes rather than hard call genotypes.
- min_ac (int) – Minimum alternate allele count.
- min_af (float) – Minimum alternate allele frequency.
Returns: Variant dataset with linear regression variant annotations.
Return type:
-
lmmreg
(kinshipMatrix, y, covariates=[], global_root='global.lmmreg', va_root='va.lmmreg', run_assoc=True, use_ml=False, delta=None, sparsity_threshold=1.0, use_dosages=False, n_eigs=None, dropped_variance_fraction=None)[source]¶ Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Suppose the variant dataset saved at data/example_lmmreg.vds has a Boolean variant annotation
va.useInKinship
and numeric or Boolean sample annotationssa.pheno
,sa.cov1
,sa.cov2
. Then thelmmreg()
function in>>> assoc_vds = hc.read("data/example_lmmreg.vds") >>> kinship_matrix = assoc_vds.filter_variants_expr('va.useInKinship').rrm() >>> lmm_vds = assoc_vds.lmmreg(kinship_matrix, 'sa.pheno', ['sa.cov1', 'sa.cov2'])
will execute the following four steps in order:
- filter to samples in given kinship matrix to those for which
sa.pheno
,sa.cov
, andsa.cov2
are all defined - compute the eigendecomposition \(K = USU^T\) of the kinship matrix
- fit covariate coefficients and variance parameters in the sample-covariates-only (global) model using restricted maximum likelihood (REML), storing results in global annotations under
global.lmmreg
- test each variant for association, storing results under
va.lmmreg
in variant annotations
This plan can be modified as follows:
- Set
run_assoc=False
to not test any variants for association, i.e. skip Step 5. - Set
use_ml=True
to use maximum likelihood instead of REML in Steps 4 and 5. - Set the
delta
argument to manually set the value of \(\delta\) rather that fitting \(\delta\) in Step 4. - Set the
global_root
argument to change the global annotation root in Step 4. - Set the
va_root
argument to change the variant annotation root in Step 5.
lmmreg()
adds 9 or 13 global annotations in Step 4, depending on whether \(\delta\) is set or fit.Annotation Type Value global.lmmreg.useML
Boolean true if fit by ML, false if fit by REML global.lmmreg.beta
Dict[String, Double] map from intercept and the given covariates
expressions to the corresponding fit \(\beta\) coefficientsglobal.lmmreg.sigmaG2
Double fit coefficient of genetic variance, \(\hat{\sigma}_g^2\) global.lmmreg.sigmaE2
Double fit coefficient of environmental variance \(\hat{\sigma}_e^2\) global.lmmreg.delta
Double fit ratio of variance component coefficients, \(\hat{\delta}\) global.lmmreg.h2
Double fit narrow-sense heritability, \(\hat{h}^2\) global.lmmreg.nEigs
Int number of eigenvectors of kinship matrix used to fit model global.lmmreg.dropped_variance_fraction
Double specified value of dropped_variance_fraction
global.lmmreg.evals
Array[Double] all eigenvalues of the kinship matrix in descending order global.lmmreg.fit.seH2
Double standard error of \(\hat{h}^2\) under asymptotic normal approximation global.lmmreg.fit.normLkhdH2
Array[Double] likelihood function of \(h^2\) normalized on the discrete grid 0.01, 0.02, ..., 0.99
. Indexi
is the likelihood for percentagei
.global.lmmreg.fit.maxLogLkhd
Double (restricted) maximum log likelihood corresponding to \(\hat{\delta}\) global.lmmreg.fit.logDeltaGrid
Array[Double] values of \(\mathrm{ln}(\delta)\) used in the grid search global.lmmreg.fit.logLkhdVals
Array[Double] (restricted) log likelihood of \(y\) given \(X\) and \(\mathrm{ln}(\delta)\) at the (RE)ML fit of \(\beta\) and \(\sigma_g^2\) These global annotations are also added to
hail.log
, with the ranked evals and \(\delta\) grid with values in .tsv tabular form. Usegrep 'lmmreg:' hail.log
to find the lines just above each table.If Step 5 is performed,
lmmreg()
also adds four linear regression variant annotations.Annotation Type Value va.lmmreg.beta
Double fit genotype coefficient, \(\hat\beta_0\) va.lmmreg.sigmaG2
Double fit coefficient of genetic variance component, \(\hat{\sigma}_g^2\) va.lmmreg.chi2
Double \(\chi^2\) statistic of the likelihood ratio test va.lmmreg.pval
Double \(p\)-value Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
The simplest way to export all resulting annotations is:
>>> lmm_vds.export_variants('output/lmmreg.tsv.bgz', 'variant = v, va.lmmreg.*') >>> lmmreg_results = lmm_vds.globals['lmmreg']
By default, genotypes values are given by hard call genotypes (
g.gt
). Ifuse_dosages=True
, then genotype values for per-variant association are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.Performance
Hail’s initial version of
lmmreg()
scales beyond 15k samples and to an essentially unbounded number of variants, making it particularly well-suited to modern sequencing studies and complementary to tools designed for SNP arrays. Analysts have usedlmmreg()
in research to compute kinship from 100k common variants and test 32 million non-rare variants on 8k whole genomes in about 10 minutes on Google cloud.While
lmmreg()
computes the kinship matrix \(K\) using distributed matrix multiplication (Step 2), the full eigendecomposition (Step 3) is currently run on a single core of master using the LAPACK routine DSYEVD, which we empirically find to be the most performant of the four available routines; laptop performance plots showing cubic complexity in \(n\) are available here. On Google cloud, eigendecomposition takes about 2 seconds for 2535 sampes and 1 minute for 8185 samples. If you see worse performance, check that LAPACK natives are being properly loaded (see “BLAS and LAPACK” in Getting Started).Given the eigendecomposition, fitting the global model (Step 4) takes on the order of a few seconds on master. Association testing (Step 5) is fully distributed by variant with per-variant time complexity that is completely independent of the number of sample covariates and dominated by multiplication of the genotype vector \(v\) by the matrix of eigenvectors \(U^T\) as described below, which we accelerate with a sparse representation of \(v\). The matrix \(U^T\) has size about \(8n^2\) bytes and is currently broadcast to each Spark executor. For example, with 15k samples, storing \(U^T\) consumes about 3.6GB of memory on a 16-core worker node with two 8-core executors. So for large \(n\), we recommend using a high-memory configuration such as
highmem
workers.Linear mixed model
lmmreg()
estimates the genetic proportion of residual phenotypic variance (narrow-sense heritability) under a kinship-based linear mixed model, and then optionally tests each variant for association using the likelihood ratio test. Inference is exact.We first describe the sample-covariates-only model used to estimate heritability, which we simply refer to as the global model. With \(n\) samples and \(c\) sample covariates, we define:
- \(y = n \times 1\) vector of phenotypes
- \(X = n \times c\) matrix of sample covariates and intercept column of ones
- \(K = n \times n\) kinship matrix
- \(I = n \times n\) identity matrix
- \(\beta = c \times 1\) vector of covariate coefficients
- \(\sigma_g^2 =\) coefficient of genetic variance component \(K\)
- \(\sigma_e^2 =\) coefficient of environmental variance component \(I\)
- \(\delta = \frac{\sigma_e^2}{\sigma_g^2} =\) ratio of environmental and genetic variance component coefficients
- \(h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2} = \frac{1}{1 + \delta} =\) genetic proportion of residual phenotypic variance
Under a linear mixed model, \(y\) is sampled from the \(n\)-dimensional multivariate normal distribution with mean \(X \beta\) and variance components that are scalar multiples of \(K\) and \(I\):
\[y \sim \mathrm{N}\left(X\beta, \sigma_g^2 K + \sigma_e^2 I\right)\]Thus the model posits that the residuals \(y_i - X_{i,:}\beta\) and \(y_j - X_{j,:}\beta\) have covariance \(\sigma_g^2 K_{ij}\) and approximate correlation \(h^2 K_{ij}\). Informally: phenotype residuals are correlated as the product of overall heritability and pairwise kinship. By contrast, standard (unmixed) linear regression is equivalent to fixing \(\sigma_2\) (equivalently, \(h^2\)) at 0 above, so that all phenotype residuals are independent.
Caution: while it is tempting to interpret \(h^2\) as the narrow-sense heritability of the phenotype alone, note that its value depends not only the phenotype and genetic data, but also on the choice of sample covariates.
Fitting the global model
The core algorithm is essentially a distributed implementation of the spectral approach taken in FastLMM. Let \(K = USU^T\) be the eigendecomposition of the real symmetric matrix \(K\). That is:
- \(U = n \times n\) orthonormal matrix whose columns are the eigenvectors of \(K\)
- \(S = n \times n\) diagonal matrix of eigenvalues of \(K\) in descending order. \(S_{ii}\) is the eigenvalue of eigenvector \(U_{:,i}\)
- \(U^T = n \times n\) orthonormal matrix, the transpose (and inverse) of \(U\)
A bit of matrix algebra on the multivariate normal density shows that the linear mixed model above is mathematically equivalent to the model
\[U^Ty \sim \mathrm{N}\left(U^TX\beta, \sigma_g^2 (S + \delta I)\right)\]for which the covariance is diagonal (e.g., unmixed). That is, rotating the phenotype vector (\(y\)) and covariate vectors (columns of \(X\)) in \(\mathbb{R}^n\) by \(U^T\) transforms the model to one with independent residuals. For any particular value of \(\delta\), the restricted maximum likelihood (REML) solution for the latter model can be solved exactly in time complexity that is linear rather than cubic in \(n\). In particular, having rotated, we can run a very efficient 1-dimensional optimization procedure over \(\delta\) to find the REML estimate \((\hat{\delta}, \hat{\beta}, \hat{\sigma}_g^2)\) of the triple \((\delta, \beta, \sigma_g^2)\), which in turn determines \(\hat{\sigma}_e^2\) and \(\hat{h}^2\).
We first compute the maximum log likelihood on a \(\delta\)-grid that is uniform on the log scale, with \(\mathrm{ln}(\delta)\) running from -8 to 8 by 0.01, corresponding to \(h^2\) decreasing from 0.9995 to 0.0005. If \(h^2\) is maximized at the lower boundary then standard linear regression would be more appropriate and Hail will exit; more generally, consider using standard linear regression when \(\hat{h}^2\) is very small. A maximum at the upper boundary is highly suspicious and will also cause Hail to exit. In any case, the log file records the table of grid values for further inspection, beginning under the info line containing “lmmreg: table of delta”.
If the optimal grid point falls in the interior of the grid as expected, we then use Brent’s method to find the precise location of the maximum over the same range, with initial guess given by the optimal grid point and a tolerance on \(\mathrm{ln}(\delta)\) of 1e-6. If this location differs from the optimal grid point by more than 0.01, a warning will be displayed and logged, and one would be wise to investigate by plotting the values over the grid.
Note that \(h^2\) is related to \(\mathrm{ln}(\delta)\) through the sigmoid function. More precisely,
\[h^2 = 1 - \mathrm{sigmoid}(\mathrm{ln}(\delta)) = \mathrm{sigmoid}(-\mathrm{ln}(\delta))\]Hence one can change variables to extract a high-resolution discretization of the likelihood function of \(h^2\) over \([0,1]\) at the corresponding REML estimators for \(\beta\) and \(\sigma_g^2\), as well as integrate over the normalized likelihood function using change of variables and the sigmoid differential equation.
For convenience,
global.lmmreg.fit.normLkhdH2
records the the likelihood function of \(h^2\) normalized over the discrete grid0.01, 0.02, ..., 0.98, 0.99
. The length of the array is 101 so that indexi
contains the likelihood at percentagei
. The values at indices 0 and 100 are left undefined.By the theory of maximum likelihood estimation, this normalized likelihood function is approximately normally distributed near the maximum likelihood estimate. So we estimate the standard error of the estimator of \(h^2\) as follows. Let \(x_2\) be the maximum likelihood estimate of \(h^2\) and let \(x_ 1\) and \(x_3\) be just to the left and right of \(x_2\). Let \(y_1\), \(y_2\), and \(y_3\) be the corresponding values of the (unnormalized) log likelihood function. Setting equal the leading coefficient of the unique parabola through these points (as given by Lagrange interpolation) and the leading coefficient of the log of the normal distribution, we have:
\[\frac{x_3 (y_2 - y_1) + x_2 (y_1 - y_3) + x_1 (y_3 - y_2))}{(x_2 - x_1)(x_1 - x_3)(x_3 - x_2)} = -\frac{1}{2 \sigma^2}\]The standard error \(\hat{\sigma}\) is then estimated by solving for \(\sigma\).
Note that the mean and standard deviation of the (discretized or continuous) distribution held in
global.lmmreg.fit.normLkhdH2
will not coincide with \(\hat{h}^2\) and \(\hat{\sigma}\), since this distribution only becomes normal in the infinite sample limit. One can visually assess normality by plotting this distribution against a normal distribution with the same mean and standard deviation, or use this distribution to approximate credible intervals under a flat prior on \(h^2\).Testing each variant for association
Fixing a single variant, we define:
- \(v = n \times 1\) vector of genotypes, with missing genotypes imputed as the mean of called genotypes
- \(X_v = \left[v | X \right] = n \times (1 + c)\) matrix concatenating \(v\) and \(X\)
- \(\beta_v = (\beta^0_v, \beta^1_v, \ldots, \beta^c_v) = (1 + c) \times 1\) vector of covariate coefficients
Fixing \(\delta\) at the global REML estimate \(\hat{\delta}\), we find the REML estimate \((\hat{\beta}_v, \hat{\sigma}_{g,v}^2)\) via rotation of the model
\[y \sim \mathrm{N}\left(X_v\beta_v, \sigma_{g,v}^2 (K + \hat{\delta} I)\right)\]Note that the only new rotation to compute here is \(U^T v\).
To test the null hypothesis that the genotype coefficient \(\beta^0_v\) is zero, we consider the restricted model with parameters \(((0, \beta^1_v, \ldots, \beta^c_v), \sigma_{g,v}^2)\) within the full model with parameters \((\beta^0_v, \beta^1_v, \ldots, \beta^c_v), \sigma_{g_v}^2)\), with \(\delta\) fixed at \(\hat\delta\) in both. The latter fit is simply that of the global model, \(((0, \hat{\beta}^1, \ldots, \hat{\beta}^c), \hat{\sigma}_g^2)\). The likelihood ratio test statistic is given by
\[\chi^2 = n \, \mathrm{ln}\left(\frac{\hat{\sigma}^2_g}{\hat{\sigma}_{g,v}^2}\right)\]and follows a chi-squared distribution with one degree of freedom. Here the ratio \(\hat{\sigma}^2_g / \hat{\sigma}_{g,v}^2\) captures the degree to which adding the variant \(v\) to the global model reduces the residual phenotypic variance.
Kinship Matrix
FastLMM uses the Realized Relationship Matrix (RRM) for kinship. This can be computed with
rrm()
. However, any instance ofKinshipMatrix
may be used, so long assample_list
contains the complete samples of the caller variant dataset in the same order.Low-rank approximation of kinship for improved performance
lmmreg()
can implicitly use a low-rank approximation of the kinship matrix to more rapidly fit delta and the statistics for each variant. The computational complexity per variant is proportional to the number of eigenvectors used. This number can be specified in two ways. Specify the parametern_eigs
to use only the topn_eigs
eigenvectors. Alternatively, specifydropped_variance_fraction
to use as many eigenvectors as necessary to capture all but at most this fraction of the sample variance (also known as the trace, or the sum of the eigenvalues). For example,dropped_variance_fraction=0.01
will use the minimal number of eigenvectors to account for 99% of the sample variance. Specifying both parameters will apply the more stringent (fewest eigenvectors) of the two.Further background
For the history and mathematics of linear mixed models in genetics, including FastLMM, see Christoph Lippert’s PhD thesis. For an investigation of various approaches to defining kinship, see Comparison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data.
Parameters: - kinshipMatrix (
KinshipMatrix
) – Kinship matrix to be used. - y (str) – Response sample annotation.
- covariates (list of str) – List of covariate sample annotations.
- global_root (str) – Global annotation root, a period-delimited path starting with global.
- va_root (str) – Variant annotation root, a period-delimited path starting with va.
- run_assoc (bool) – If true, run association testing in addition to fitting the global model.
- use_ml (bool) – Use ML instead of REML throughout.
- delta (float or None) – Fixed delta value to use in the global model, overrides fitting delta.
- sparsity_threshold (float) – Genotype vector sparsity at or below which to use sparse genotype vector in rotation (advanced).
- use_dosages (bool) – If true, use dosages rather than hard call genotypes.
- n_eigs (int) – Number of eigenvectors of the kinship matrix used to fit the model.
- dropped_variance_fraction (float) – Upper bound on fraction of sample variance lost by dropping eigenvectors with small eigenvalues.
Returns: Variant dataset with linear mixed regression annotations.
Return type: - filter to samples in given kinship matrix to those for which
-
logreg
(test, y, covariates=[], root='va.logreg', use_dosages=False)[source]¶ Test each variant for association using logistic regression.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Run the logistic regression Wald test per variant using a Boolean phenotype and two covariates stored in sample annotations:
>>> vds_result = vds.logreg('wald', 'sa.pheno.isCase', covariates=['sa.pheno.age', 'sa.pheno.isFemale'])
Notes
The
logreg()
method performs, for each variant, a significance test of the genotype in predicting a binary (case-control) phenotype based on the logistic regression model. The phenotype type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes samples for which the phenotype and all covariates are defined. For each variant, Hail imputes missing genotypes as the mean of called genotypes.
By default, genotypes values are given by hard call genotypes (
g.gt
). Ifuse_dosages=True
, then genotype values are defined by the dosage \(\mathrm{P}(\mathrm{Het}) + 2 \cdot \mathrm{P}(\mathrm{HomVar})\). For Phred-scaled values, \(\mathrm{P}(\mathrm{Het})\) and \(\mathrm{P}(\mathrm{HomVar})\) are calculated by normalizing the PL likelihoods (converted from the Phred-scale) to sum to 1.The example above considers a model of the form
\[\mathrm{Prob}(\mathrm{isCase}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{isFemale} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{isFemale}\) is coded as 1 for true (female) and 0 for false (male). The null model sets \(\beta_1 = 0\).
The resulting variant annotations depend on the test statistic as shown in the tables below.
Test Annotation Type Value Wald va.logreg.beta
Double fit genotype coefficient, \(\hat\beta_1\) Wald va.logreg.se
Double estimated standard error, \(\widehat{\mathrm{se}}\) Wald va.logreg.zstat
Double Wald \(z\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\) Wald va.logreg.pval
Double Wald p-value testing \(\beta_1 = 0\) LRT, Firth va.logreg.beta
Double fit genotype coefficient, \(\hat\beta_1\) LRT, Firth va.logreg.chi2
Double deviance statistic LRT, Firth va.logreg.pval
Double LRT / Firth p-value testing \(\beta_1 = 0\) Score va.logreg.chi2
Double score statistic Score va.logreg.pval
Double score p-value testing \(\beta_1 = 0\) For the Wald and likelihood ratio tests, Hail fits the logistic model for each variant using Newton iteration and only emits the above annotations when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three variant annotations which summarize the iterative fitting process:
Test Annotation Type Value Wald, LRT, Firth va.logreg.fit.nIter
Int number of iterations until convergence, explosion, or reaching the max (25 for Wald, LRT; 100 for Firth) Wald, LRT, Firth va.logreg.fit.converged
Boolean true if iteration converged Wald, LRT, Firth va.logreg.fit.exploded
Boolean true if iteration exploded We consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{-6}\). For Wald and LRT, up to 25 iterations are attempted; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to low-level numerical linear algebra exceptions caused by manipulating ill-conditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.
A more common situation in genetics is quasi-complete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) p-values. To not miss such variants, consider using Firth logistic regression, linear regression, or group-based tests.
Here’s a concrete illustration of quasi-complete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:
Status HomRef Het HomVar Case 1000 10 0 Control 1000 0 0 The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where
x
is genotype,y
is phenotype, andlogistf
is from the logistf package:x <- c(rep(0,1000), rep(1,1000), rep(1,10) y <- c(rep(0,1000), rep(0,1000), rep(1,10)) logfit <- glm(y ~ x, family=binomial()) firthfit <- logistf(y ~ x) linfit <- lm(y ~ x)
The resulting p-values for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasi-complete separation. Moving one of the 10 hets from case to control eliminates this quasi-complete separation; the p-values from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.
The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the
sa.lmmreg.fit
annotations reflect the null model; otherwise, they reflect the full model.See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.
Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
Phenotype and covariate sample annotations may also be specified using programmatic expressions without identifiers, such as:
if (sa.isFemale) sa.cov.age else (2 * sa.cov.age + 10)
For Boolean covariate types, true is coded as 1 and false as 0. In particular, for the sample annotation
sa.fam.isCase
added by importing a FAM file with case-control phenotype, case is 1 and control is 0.Hail’s logistic regression tests correspond to the
b.wald
,b.lrt
, andb.score
tests in EPACTS. For each variant, Hail imputes missing genotypes as the mean of called genotypes, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.Parameters: - test (str) – Statistical test, one of: ‘wald’, ‘lrt’, ‘score’, or ‘firth’.
- y (str) – Response expression. Must evaluate to Boolean or numeric with all values 0 or 1.
- covariates (list of str) – list of covariate expressions
- root (str) – Variant annotation path to store result of logistic regression.
- use_dosages (bool) – If true, use genotype dosage rather than hard call.
Returns: Variant dataset with logistic regression variant annotations.
Return type:
-
logreg_burden
(key_name, variant_keys, single_key, agg_expr, test, y, covariates=[])[source]¶ Test each keyed group of variants for association by aggregating (collapsing) genotypes and applying the logistic regression model.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Run a gene burden test using the logistic Wald test on the maximum genotype per gene. Here
va.genes
is a variant annotation of type Set[String] giving the set of genes containing the variant (see Extended example inlinreg_burden()
for a deeper dive in the context of linear regression):>>> logreg_kt, sample_kt = (hc.read('data/example_burden.vds') ... .logreg_burden(key_name='gene', ... variant_keys='va.genes', ... single_key=False, ... agg_expr='gs.map(g => g.gt).max()', ... test='wald', ... y='sa.burden.pheno', ... covariates=['sa.burden.cov1', 'sa.burden.cov2']))
Run a gene burden test using the logistic score test on the weighted sum of genotypes per gene. Here
va.gene
is a variant annotation of type String giving a single gene per variant (or no gene if missing), andva.weight
is a numeric variant annotation:>>> logreg_kt, sample_kt = (hc.read('data/example_burden.vds') ... .logreg_burden(key_name='gene', ... variant_keys='va.gene', ... single_key=True, ... agg_expr='gs.map(g => va.weight * g.gt).sum()', ... test='score', ... y='sa.burden.pheno', ... covariates=['sa.burden.cov1', 'sa.burden.cov2']))
To use a weighted sum of genotypes with missing genotypes mean-imputed rather than ignored, set
agg_expr='gs.map(g => va.weight * orElse(g.gt.toDouble, 2 * va.qc.AF)).sum()'
whereva.qc.AF
is the allele frequency over those samples that have no missing phenotype or covariates.Caution
With
single_key=False
,variant_keys
expects a variant annotation of Set or Array type, in order to allow each variant to have zero, one, or more keys (for example, the same variant may appear in multiple genes). Unlike with type Set, if the same key appears twice in a variant annotation of type Array, then that variant will be counted twice in that key’s group. Withsingle_key=True
,variant_keys
expects a variant annotation whose value is itself the key of interest. In bose cases, variants with missing keys are ignored.Notes
This method modifies
logreg()
by replacing the genotype covariate per variant and sample with an aggregated (i.e., collapsed) score per key and sample. This numeric score is computed from the sample’s genotypes and annotations over all variants with that key. The phenotype type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’) as the
test
parameter. Conceptually, the method proceeds as follows:Filter to the set of samples for which all phenotype and covariates are defined.
For each key and sample, aggregate genotypes across variants with that key to produce a numeric score.
agg_expr
must be of numeric type and has the following symbols are in scope:s
(Sample): samplesa
: sample annotationsglobal
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype for samples
Note that
v
,va
, andg
are accessible through Aggregable methods ongs
.The resulting sample key table has key column
key_name
and a numeric column of scores for each sample named by the sample ID.For each key, fit the logistic regression model using the supplied phenotype, covariates, and test. The model and tests are those of
logreg()
with sample genotypegt
replaced by the score in the sample key table. For each key, missing scores are mean-imputed across all samples.The resulting logistic regression key table has key column of type String given by the
key_name
parameter and additional columns corresponding to the fields of theva.logreg
schema given fortest
inlogreg()
.
logreg_burden()
returns both the logistic regression key table and the sample key table.Parameters: - key_name (str) – Name to assign to key column of returned key tables.
- variant_keys (str) – Variant annotation path for the TArray or TSet of keys associated to each variant.
- single_key (bool) – if true,
variant_keys
is interpreted as a single (or missing) key per variant, rather than as a collection of keys. - agg_expr (str) – Sample aggregation expression (per key).
- test (str) – Statistical test, one of: ‘wald’, ‘lrt’, ‘score’, or ‘firth’.
- y (str) – Response expression.
- covariates (list of str) – list of covariate expressions.
Returns: Tuple of logistic regression key table and sample aggregation key table.
Return type:
-
make_table
(variant_expr, genotype_expr, key=[], separator='.')[source]¶ Produce a key with one row per variant and one or more columns per sample.
Examples
Consider a
VariantDataset
vds
with 2 variants and 3 samples:Variant FORMAT A B C 1:1:A:T GT:GQ 0/1:99 ./. 0/0:99 1:2:G:C GT:GQ 0/1:89 0/1:99 1/1:93
Then
>>> kt = vds.make_table('v = v', ['gt = g.gt', 'gq = g.gq'])
returns a
KeyTable
with schemav: Variant A.gt: Int A.gq: Int B.gt: Int B.gq: Int C.gt: Int C.gq: Int
and values
v A.gt A.gq B.gt B.gq C.gt C.gq 1:1:A:T 1 99 NA NA 0 99 1:2:G:C 1 89 1 99 2 93
The above table can be generated and exported as a TSV using
KeyTable
export()
.Notes
Per sample field names in the result are formed by concatenating the sample ID with the
genotype_expr
left hand side withseparator
. If the left hand side is empty:`` = expr
then the dot (.) is omitted.
Parameters: - variant_expr (str or list of str) – Variant annotation expressions.
- genotype_expr (str or list of str) – Genotype annotation expressions.
- key (str or list of str) – List of key columns.
- separator (str) – Separator to use between sample IDs and genotype expression left-hand side identifiers.
Return type:
-
mendel_errors
(pedigree)[source]¶ Find Mendel errors; count per variant, individual and nuclear family.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Find all violations of Mendelian inheritance in each (dad, mom, kid) trio in a pedigree and return four tables:
>>> ped = Pedigree.read('data/trios.fam') >>> all, per_fam, per_sample, per_variant = vds.mendel_errors(ped)
Export all mendel errors to a text file:
>>> all.export('output/all_mendel_errors.tsv')
Annotate samples with the number of Mendel errors:
>>> annotated_vds = vds.annotate_samples_table(per_sample, root="sa.mendel")
Annotate variants with the number of Mendel errors:
>>> annotated_vds = vds.annotate_variants_table(per_variant, root="va.mendel")
Notes
This method assumes all contigs apart from X and Y are fully autosomal; mitochondria, decoys, etc. are not given special treatment.
The example above returns four tables, which contain Mendelian violations grouped in various ways. These tables are modeled after the PLINK mendel formats. The four tables contain the following columns:
First table: all Mendel errors. This table contains one row per Mendel error in the dataset; it is possible that a variant or sample may be found on more than one row. This table closely reflects the structure of the “.mendel” PLINK format detailed below.
Columns:
- fid (String) – Family ID.
- s (String) – Proband ID.
- v (Variant) – Variant in which the error was found.
- code (Int) – Mendel error code, see below.
- error (String) – Readable representation of Mendel error.
Second table: errors per nuclear family. This table contains one row per nuclear family in the dataset. This table closely reflects the structure of the “.fmendel” PLINK format detailed below.
Columns:
- fid (String) – Family ID.
- father (String) – Paternal ID.
- mother (String) – Maternal ID.
- nChildren (Int) – Number of children in this nuclear family.
- nErrors (Int) – Number of Mendel errors in this nuclear family.
- nSNP (Int) – Number of Mendel errors at SNPs in this nuclear family.
Third table: errors per individual. This table contains one row per individual in the dataset, including founders. This table closely reflects the structure of the “.imendel” PLINK format detailed below.
Columns:
- s (String) – Sample ID (key column).
- fid (String) – Family ID.
- nErrors (Int) – Number of Mendel errors found involving this individual.
- nSNP (Int) – Number of Mendel errors found involving this individual at SNPs.
- error (String) – Readable representation of Mendel error.
Fourth table: errors per variant. This table contains one row per variant in the dataset.
Columns:
- v (Variant) – Variant (key column).
- nErrors (Int) – Number of Mendel errors in this variant.
PLINK Mendel error formats:
*.mendel
– all mendel errors: FID KID CHR SNP CODE ERROR*.fmendel
– error count per nuclear family: FID PAT MAT CHLD N*.imendel
– error count per individual: FID IID N*.lmendel
– error count per variant: CHR SNP N
In the PLINK formats, FID, KID, PAT, MAT, and IID refer to family, kid, dad, mom, and individual ID, respectively, with missing values set to
0
. SNP denotes the variant identifierchr:pos:ref:alt
. N is the error count. CHLD is the number of children in a nuclear family.The CODE of each Mendel error is determined by the table below, extending the Plink classification.
Those individuals implicated by each code are in bold.
The copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR).
- HemiX – in non-PAR of X, male child
- HemiY – in non-PAR of Y, male child
- Auto – otherwise (in autosome or PAR, or female child)
Any refers to \(\{ HomRef, Het, HomVar, NoCall \}\) and ! denotes complement in this set.
Code Dad Mom Kid Copy State 1 HomVar HomVar Het Auto 2 HomRef HomRef Het Auto 3 HomRef ! HomRef HomVar Auto 4 ! HomRef HomRef HomVar Auto 5 HomRef HomRef HomVar Auto 6 HomVar ! HomVar HomRef Auto 7 ! HomVar HomVar HomRef Auto 8 HomVar HomVar HomRef Auto 9 Any HomVar HomRef HemiX 10 Any HomRef HomVar HemiX 11 HomVar Any HomRef HemiY 12 HomRef Any HomVar HemiY This method only considers children with two parents and a defined sex.
PAR is currently defined with respect to reference GRCh37:
- X: 60001 - 2699520, 154931044 - 155260560
- Y: 10001 - 2649520, 59034050 - 59363566
Parameters: pedigree ( Pedigree
) – Sample pedigree.Returns: Four tables with Mendel error statistics. Return type: ( KeyTable
,KeyTable
,KeyTable
,KeyTable
)
-
min_rep
(max_shift=100)[source]¶ Gives minimal, left-aligned representation of alleles. Note that this can change the variant position.
Examples
1. Simple trimming of a multi-allelic site, no change in variant position 1:10000:TAA:TAA,AA => 1:10000:TA:T,A
2. Trimming of a bi-allelic site leading to a change in position 1:10000:AATAA,AAGAA => 1:10002:T:G
Parameters: max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered. Return type: VariantDataset
-
naive_coalesce
(max_partitions)[source]¶ Naively descrease the number of partitions.
Warning
naive_coalesce()
simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlikerepartition()
, so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.Parameters: max_partitions (int) – Desired number of partitions. If the current number of partitions is less than max_partitions
, do nothing.Returns: Variant dataset with the number of partitions equal to at most max_partitions
Return type: VariantDataset
-
num_partitions
()[source]¶ Number of partitions.
Notes
The data in a variant dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. Partitions are a core concept of distributed computation in Spark, see here for details.
Return type: int
-
num_samples
¶ Number of samples.
Return type: int
-
pc_relate
(k, maf, block_size=512, min_kinship=-inf, statistics='all')[source]¶ Compute relatedness estimates between individuals using a variant of the PC-Relate method.
Danger
This method is experimental. We neither guarantee interface stability nor that the results are viable for any particular use.
Examples
Estimate kinship, identity-by-descent two, identity-by-descent one, and identity-by-descent zero for every pair of samples, using 5 prinicpal components to correct for ancestral populations, and a minimum minor allele frequency filter of 0.01:
>>> rel = vds.pc_relate(5, 0.01)
Calculate values as above, but when performing distributed matrix multiplications use a matrix-block-size of 1024 by 1024.
>>> rel = vds.pc_relate(5, 0.01, 1024)
Calculate values as above, excluding sample-pairs with kinship less than 0.1. This is more efficient than producing the full key table and filtering using
filter()
.>>> rel = vds.pc_relate(5, 0.01, min_kinship=0.1)
Method
The traditional estimator for kinship between a pair of individuals \(i\) and \(j\), sharing the set \(S_{ij}\) of single-nucleotide variants, from a population with allele frequencies \(p_s\), is given by:
\[\widehat{\phi_{ij}} := \frac{1}{|S_{ij}|}\sum_{s \in S_{ij}}\frac{(g_{is} - 2 p_s) (g_{js} - 2 p_s)}{4 * \sum_{s \in S_{ij} p_s (1 - p_s)}}\]This estimator is true under the model that the sharing of common (relative to the population) alleles is not very informative to relatedness (because they’re common) and the sharing of rare alleles suggests a recent common ancestor from which the allele was inherited by descent.
When multiple ancestry groups are mixed in a sample, this model breaks down. Alleles that are rare in all but one ancestry group are treated as very informative to relatedness. However, these alleles are simply markers of the ancestry group. The PC-Relate method corrects for this situation and the related situation of admixed individuals.
PC-Relate slightly modifies the usual estimator for relatedness: occurences of population allele frequency are replaced with an “individual-specific allele frequency”. This modification allows the method to correctly weight an allele according to an individual’s unique ancestry profile.
The “individual-specific allele frequency” at a given genetic locus is modeled by PC-Relate as a linear function of their first
k
principal component coordinates. As such, the efficacy of this method rests on two assumptions:- an individual’s first
k
principal component coordinates fully describe their allele-frequency-relevant ancestry, and - the relationship between ancestry (as described by principal component coordinates) and population allele frequency is linear
The estimators for kinship, and identity-by-descent zero, one, and two follow. Let:
- \(S_{ij}\) be the set of genetic loci at which both individuals \(i\) and \(j\) have a defined genotype
- \(g_{is} \in {0, 1, 2}\) be the number of alternate alleles that individual \(i\) has at gentic locus \(s\)
- \(\widehat{\mu_{is}} \in [0, 1]\) be the individual-specific allele frequency for individual \(i\) at genetic locus \(s\)
- \({\widehat{\sigma^2_{is}}} := \widehat{\mu_{is}} (1 - \widehat{\mu_{is}})\), the binomial variance of \(\widehat{\mu_{is}}\)
- \(\widehat{\sigma_{is}} := \sqrt{\widehat{\sigma^2_{is}}}\), the binomial standard deviation of \(\widehat{\mu_{is}}\)
- \(\text{IBS}^{(0)}_{ij} := \sum_{s \in S_{ij}} \mathbb{1}_{||g_{is} - g_{js} = 2||}\), the number of genetic loci at which individuals \(i\) and \(j\) share no alleles
- \(\widehat{f_i} := 2 \widehat{\phi_{ii}} - 1\), the inbreeding coefficient for individual \(i\)
- \(g^D_{is}\) be a dominance encoding of the genotype matrix, and \(X_{is}\) be a normalized dominance-coded genotype matrix
\[ \begin{align}\begin{aligned}\begin{split}g^D_{is} := \begin{cases} \widehat{\mu_{is}} & g_{is} = 0 \\ 0 & g_{is} = 1 \\ 1 - \widehat{\mu_{is}} & g_{is} = 2 \end{cases}\end{split}\\X_{is} := g^D_{is} - \widehat{\sigma^2_{is}} (1 - \widehat{f_i})\end{aligned}\end{align} \]The estimator for kinship is given by:
\[\widehat{\phi_{ij}} := \frac{\sum_{s \in S_{ij}}(g - 2 \mu)_{is} (g - 2 \mu)_{js}}{4 * \sum_{s \in S_{ij}}\widehat{\sigma_{is}} \widehat{\sigma_{js}}}\]The estimator for identity-by-descent two is given by:
\[\widehat{k^{(2)}_{ij}} := \frac{\sum_{s \in S_{ij}}X_{is} X_{js}}{\sum_{s \in S_{ij}}\widehat{\sigma^2_{is}} \widehat{\sigma^2_{js}}}\]The estimator for identity-by-descent zero is given by:
\[\begin{split}\widehat{k^{(0)}_{ij}} := \begin{cases} \frac{\text{IBS}^{(0)}_{ij}} {\sum_{s \in S_{ij}} \widehat{\mu_{is}}^2(1 - \widehat{\mu_{js}})^2 + (1 - \widehat{\mu_{is}})^2\widehat{\mu_{js}}^2} & \widehat{\phi_{ij}} > 2^{-5/2} \\ 1 - 4 \widehat{\phi_{ij}} + k^{(2)}_{ij} & \widehat{\phi_{ij}} \le 2^{-5/2} \end{cases}\end{split}\]The estimator for identity-by-descent one is given by:
\[\widehat{k^{(1)}_{ij}} := 1 - \widehat{k^{(2)}_{ij}} - \widehat{k^{(0)}_{ij}}\]Details
The PC-Relate method is described in “Model-free Estimation of Recent Genetic Relatedness”. Conomos MP, Reiner AP, Weir BS, Thornton TA. in American Journal of Human Genetics. 2016 Jan 7. The reference implementation is available in the GENESIS Bioconductor package .
pc_relate()
differs from the reference implementation in a couple key ways:- the principal components analysis does not use an unrelated set of individuals
- the estimators do not perform small sample correction
- the algorithm does not provide an option to use population-wide allele frequency estimates
- the algorithm does not provide an option to not use “overall
standardization” (see R
pcrelate
documentation)
Notes
The
block_size
controls memory usage and parallelism. If it is large enough to hold an entire sample-by-sample matrix of 64-bit doubles in memory, then only one Spark worker node can be used to compute matrix operations. If it is too small, communication overhead will begin to dominate the computation’s time. The author has found that on Google Dataproc (where each core has about 3.75GB of memory), settingblock_size
larger than 512 tends to cause memory exhaustion errors.The minimum allele frequency filter is applied per-pair: if either of the two individual’s individual-specific minor allele frequency is below the threshold, then the variant’s contribution to relatedness estimates is zero.
Under the PC-Relate model, kinship, [ phi_{ij} ], ranges from 0 to 0.5, and is precisely half of the fraction-of-genetic-material-shared. Listed below are the statistics for a few pairings:
- Monozygotic twins share all their genetic material so their kinship statistic is 0.5 in expection.
- Parent-child and sibling pairs both have kinship 0.25 in expectation and are separated by the identity-by-descent-zero, [ k^{(2)}_{ij} ], statistic which is zero for parent-child pairs and 0.25 for sibling pairs.
- Avuncular pairs and grand-parent/-child pairs both have kinship 0.125 in expectation and both have identity-by-descent-zero 0.5 in expectation
- “Third degree relatives” are those pairs sharing [ 2^{-3} = 12.5 % ] of their genetic material, the results of PCRelate are often too noisy to reliably distinguish these pairs from higher-degree-relative-pairs or unrelated pairs.
The resulting
KeyTable
entries have the type: { i: String, j: String, kin: Double, k2: Double, k1: Double, k0: Double }. The key list is: i: String, j: String.Parameters: - k (int) – The number of principal components to use to distinguish ancestries.
- maf (float) – The minimum individual-specific allele frequency for an allele used to measure relatedness.
- block_size (int) – the side length of the blocks of the block- distributed matrices; this should be set such that at least three of these matrices fit in memory (in addition to all other objects necessary for Spark and Hail).
- min_kinship (float) – Pairs of samples with kinship lower than
min_kinship
are excluded from the results. - statistics (str) – the set of statistics to compute, ‘phi’ will only compute the kinship statistic, ‘phik2’ will compute the kinship and identity-by-descent two statistics, ‘phik2k0’ will compute the kinship statistics and both identity-by-descent two and zero, ‘all’ computes the kinship statistic and all three identity-by-descent statistics.
Returns: A
KeyTable
mapping pairs of samples to estimations of their kinship and identity-by-descent zero, one, and two.Return type: - an individual’s first
-
pca
(scores, loadings=None, eigenvalues=None, k=10, as_array=False)[source]¶ Run Principal Component Analysis (PCA) on the matrix of genotypes.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Compute the top 5 principal component scores, stored as sample annotations
sa.scores.PC1
, …,sa.scores.PC5
of type Double:>>> vds_result = vds.pca('sa.scores', k=5)
Compute the top 5 principal component scores, loadings, and eigenvalues, stored as annotations
sa.scores
,va.loadings
, andglobal.evals
of type Array[Double]:>>> vds_result = vds.pca('sa.scores', 'va.loadings', 'global.evals', 5, as_array=True)
Notes
Hail supports principal component analysis (PCA) of genotype data, a now-standard procedure Patterson, Price and Reich, 2006. This method expects a variant dataset with biallelic autosomal variants. Scores are computed and stored as sample annotations of type Struct by default; variant loadings and eigenvalues can optionally be computed and stored in variant and global annotations, respectively.
PCA is based on the singular value decomposition (SVD) of a standardized genotype matrix \(M\), computed as follows. An \(n \times m\) matrix \(C\) records raw genotypes, with rows indexed by \(n\) samples and columns indexed by \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as
\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{2p_j(1-p_j)m}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes genotype variances to a common value \(1/m\) for variants in Hardy-Weinberg equilibrium and is further motivated in the paper cited above. (The resulting amplification of signal from the low end of the allele frequency spectrum will also introduce noise for rare variants; common practice is to filter out variants with minor allele frequency below some cutoff.) The factor \(1/m\) gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the sample correlation or genetic relationship matrix (GRM) as simply \(MM^T\).
PCA then computes the SVD
\[M = USV^T\]where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.
From the perspective of the samples or rows of \(M\) as data, \(V_k\) contains the variant loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the GRM \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the
loadings
parameter is specified.Note: In PLINK/GCTA the GRM is taken as the starting point and it is computed slightly differently with regard to missing data. Here the \(ij\) entry of \(MM^T\) is simply the dot product of rows \(i\) and \(j\) of \(M\); in terms of \(C\) it is
\[\frac{1}{m}\sum_{l\in\mathcal{C}_i\cap\mathcal{C}_j}\frac{(C_{il}-2p_l)(C_{jl} - 2p_l)}{2p_l(1-p_l)}\]where \(\mathcal{C}_i = \{l \mid C_{il} \text{ is non-missing}\}\). In PLINK/GCTA the denominator \(m\) is replaced with the number of terms in the sum \(\lvert\mathcal{C}_i\cap\mathcal{C}_j\rvert\), i.e. the number of variants where both samples have non-missing genotypes. While this is arguably a better estimator of the true GRM (trading shrinkage for noise), it has the drawback that one loses the clean interpretation of the loadings and scores as features and projections.
Separately, for the PCs PLINK/GCTA output the eigenvectors of the GRM; even ignoring the above discrepancy that means the left singular vectors \(U_k\) instead of the component scores \(U_k S_k\). While this is just a matter of the scale on each PC, the scores have the advantage of representing true projections of the data onto features with the variance of a score reflecting the variance explained by the corresponding feature. (In PC bi-plots this amounts to a change in aspect ratio; for use of PCs as covariates in regression it is immaterial.)
Annotations
Given root
scores='sa.scores'
andas_array=False
,pca()
adds a Struct to sample annotations:- sa.scores (Struct) – Struct of sample scores
With
k=3
, the Struct has three field:- sa.scores.PC1 (Double) – Score from first PC
- sa.scores.PC2 (Double) – Score from second PC
- sa.scores.PC3 (Double) – Score from third PC
Analogous variant and global annotations of type Struct are added by specifying the
loadings
andeigenvalues
arguments, respectively.Given roots
scores='sa.scores'
,loadings='va.loadings'
, andeigenvalues='global.evals'
, andas_array=True
,pca()
adds the following annotations:- sa.scores (Array[Double]) – Array of sample scores from the top k PCs
- va.loadings (Array[Double]) – Array of variant loadings in the top k PCs
- global.evals (Array[Double]) – Array of the top k eigenvalues
Parameters: - scores (str) – Sample annotation path to store scores.
- loadings (str or None) – Variant annotation path to store site loadings.
- eigenvalues (str or None) – Global annotation path to store eigenvalues.
- k (bool or None) – Number of principal components.
- as_array (bool) – Store annotations as type Array rather than Struct
Returns: Dataset with new PCA annotations.
Return type:
-
persist
(storage_level='MEMORY_AND_DISK')[source]¶ Persist this variant dataset to memory and/or disk.
Examples
Persist the variant dataset to both memory and disk:
>>> vds_result = vds.persist()
Notes
The
persist()
andcache()
methods allow you to store the current dataset on disk or in memory to avoid redundant computation and improve the performance of Hail pipelines.cache()
is an alias forpersist("MEMORY_ONLY")
. Most users will want “MEMORY_AND_DISK”. See the Spark documentation for a more in-depth discussion of persisting data.Warning
Persist, like all other
VariantDataset
functions, is functional. Its output must be captured. This is wrong:>>> vds = vds.linreg('sa.phenotype') >>> vds.persist()
The above code does NOT persist
vds
. Instead, it copiesvds
and persists that result. The proper usage is this:>>> vds = vds.pca().persist()
Parameters: storage_level – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP Return type: VariantDataset
-
query_genotypes
(exprs)[source]¶ Performs aggregation queries over genotypes, and returns Python object(s).
Examples
Compute global GQ histogram
>>> gq_hist = vds.query_genotypes('gs.map(g => g.gq).hist(0, 100, 100)')
Compute call rate
>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled)')
Compute GQ and DP histograms
>>> [gq_hist, dp_hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)', ... 'gs.map(g => g.dp).hist(0, 60, 60)'])
Parameters: exprs (str or list of str) – query expressions Return type: annotation or list of annotation
-
query_genotypes_typed
(exprs)[source]¶ Performs aggregation queries over genotypes, and returns Python object(s) and type(s).
Examples
>>> gq_hist, t = vds.query_genotypes_typed('gs.map(g => g.gq).hist(0, 100, 100)')
>>> [gq_hist, dp_hist], [t1, t2] = vds.query_genotypes_typed(['gs.map(g => g.gq).hist(0, 100, 100)', ... 'gs.map(g => g.dp).hist(0, 60, 60)'])
See
query_genotypes()
for more information.This method evaluates Hail expressions over genotypes, along with all variant and sample metadata for that genotype. The
exprs
argument requires either a list of strings or a single string The method returns a list of results and a list of types (which each contain one element if the input parameter was a single str).The namespace of the expressions includes:
global
: global annotationsgs
(Aggregable[Genotype]): aggregable of Genotype
Map and filter expressions on this aggregable have the following namespace:
global
: global annotationsg
: Genotypev
: Variantva
: variant annotationss
: samplesa
: sample annotations
Performance Note It is far faster to execute multiple queries in one method than to execute multiple query methods. This:
>>> result1 = vds.query_genotypes('gs.count()') >>> result2 = vds.query_genotypes('gs.filter(g => v.altAllele.isSNP() && g.isHet).count()')
will be nearly twice as slow as this:
>>> exprs = ['gs.count()', 'gs.filter(g => v.altAllele.isSNP() && g.isHet).count()'] >>> [geno_count, snp_hets] = vds.query_genotypes(exprs)
Parameters: exprs (str or list of str) – query expressions Return type: (annotation or list of annotation, Type
or list ofType
)
-
query_samples
(exprs)[source]¶ Performs aggregation queries over samples and sample annotations, and returns Python object(s).
Examples
>>> low_callrate_samples = vds.query_samples('samples.filter(s => sa.qc.callRate < 0.95).collect()')
Notes
This method evaluates Hail expressions over samples and sample annotations. The
exprs
argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.The namespace of the expressions includes:
global
: global annotationssamples
(Aggregable[Sample]): aggregable of sample
Map and filter expressions on this aggregable have the additional namespace:
global
: global annotationss
: samplesa
: sample annotations
Parameters: exprs (str or list of str) – query expressions Return type: annotation or list of annotation
-
query_samples_typed
(exprs)[source]¶ Performs aggregation queries over samples and sample annotations, and returns Python object(s) and type(s).
Examples
>>> low_callrate_samples, t = vds.query_samples_typed( ... 'samples.filter(s => sa.qc.callRate < 0.95).collect()')
See
query_samples()
for more information.Parameters: exprs (str or list of str) – query expressions Return type: (annotation or list of annotation, Type
or list ofType
)
-
query_variants
(exprs)[source]¶ Performs aggregation queries over variants and variant annotations, and returns Python object(s).
Examples
>>> lof_variant_count = vds.query_variants('variants.filter(v => va.consequence == "LOF").count()')
>>> [lof_variant_count, missense_count] = vds.query_variants([ ... 'variants.filter(v => va.consequence == "LOF").count()', ... 'variants.filter(v => va.consequence == "Missense").count()'])
Notes
This method evaluates Hail expressions over variants and variant annotations. The
exprs
argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.The namespace of the expressions includes:
global
: global annotationsvariants
(Aggregable[Variant]): aggregable of Variant
Map and filter expressions on this aggregable have the additional namespace:
global
: global annotationsv
: Variantva
: variant annotations
Performance Note It is far faster to execute multiple queries in one method than to execute multiple query methods. The combined query:
>>> exprs = ['variants.count()', 'variants.filter(v => v.altAllele.isSNP()).count()'] >>> [num_variants, num_snps] = vds.query_variants(exprs)
will be nearly twice as fast as the split query:
>>> result1 = vds.query_variants('variants.count()') >>> result2 = vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')
Parameters: exprs (str or list of str) – query expressions Return type: annotation or list of annotation
-
query_variants_typed
(exprs)[source]¶ Performs aggregation queries over variants and variant annotations, and returns Python object(s) and type(s).
Examples
>>> lof_variant_count, t = vds.query_variants_typed( ... 'variants.filter(v => va.consequence == "LOF").count()')
>>> [lof_variant_count, missense_count], [t1, t2] = vds.query_variants_typed([ ... 'variants.filter(v => va.consequence == "LOF").count()', ... 'variants.filter(v => va.consequence == "Missense").count()'])
See
query_variants()
for more information.Parameters: exprs (str or list of str) – query expressions Return type: (annotation or list of annotation, Type
or list ofType
)
-
rename_samples
(mapping)[source]¶ Rename samples.
Examples
>>> vds_result = vds.rename_samples({'ID1': 'id1', 'ID2': 'id2'})
Use a file with an “old_id” and “new_id” column to rename samples:
>>> mapping_table = hc.import_table('data/sample_mapping.txt') >>> mapping_dict = {row.old_id: row.new_id for row in mapping_table.collect()} >>> vds_result = vds.rename_samples(mapping_dict)
Parameters: mapping (dict) – Mapping from old to new sample IDs. Returns: Dataset with remapped sample IDs. Return type: VariantDataset
-
repartition
(num_partitions, shuffle=True)[source]¶ Increase or decrease the number of variant dataset partitions.
Examples
Repartition the variant dataset to have 500 partitions:
>>> vds_result = vds.repartition(500)
Notes
Check the current number of partitions with
num_partitions()
.The data in a variant dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a variant dataset with \(M\) variants is first imported, each of the \(k\) partition will contain about \(M/k\) of the variants. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores.
Partitions are a core concept of distributed computation in Spark, see here for details. With
shuffle=True
, Hail does a full shuffle of the data and creates equal sized partitions. Withshuffle=False
, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to therepartition
andcoalesce
commands in Spark, respectively. In particular, whenshuffle=False
,num_partitions
cannot exceed current number of partitions.Parameters: - num_partitions (int) – Desired number of partitions, must be less than the current number if
shuffle=False
- shuffle (bool) – If true, use full shuffle to repartition.
Returns: Variant dataset with the number of partitions equal to at most
num_partitions
Return type: - num_partitions (int) – Desired number of partitions, must be less than the current number if
-
rowkey_schema
¶ Returns the signature of the row key (variant) contained in this VDS.
Examples
>>> print(vds.rowkey_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.rowkey_schema)
Return type: Type
-
rrm
(force_block=False, force_gramian=False)[source]¶ Computes the Realized Relationship Matrix (RRM).
Examples
>>> kinship_matrix = vds.rrm()
Notes
The Realized Relationship Matrix is defined as follows. Consider the \(n \times m\) matrix \(C\) of raw genotypes, with rows indexed by \(n\) samples and columns indexed by the \(m\) bialellic autosomal variants; \(C_{ij}\) is the number of alternate alleles of variant \(j\) carried by sample \(i\), which can be 0, 1, 2, or missing. For each variant \(j\), the sample alternate allele frequency \(p_j\) is computed as half the mean of the non-missing entries of column \(j\). Entries of \(M\) are then mean-centered and variance-normalized as
\[M_{ij} = \frac{C_{ij}-2p_j}{\sqrt{\frac{m}{n} \sum_{k=1}^n (C_{ij}-2p_j)^2}},\]with \(M_{ij} = 0\) for \(C_{ij}\) missing (i.e. mean genotype imputation). This scaling normalizes each variant column to have empirical variance \(1/m\), which gives each sample row approximately unit total variance (assuming linkage equilibrium) and yields the \(n \times n\) sample correlation or realized relationship matrix (RRM) \(K\) as simply
\[K = MM^T\]Note that the only difference between the Realized Relationship Matrix and the Genetic Relationship Matrix (GRM) used in
grm()
is the variant (column) normalization: where RRM uses empirical variance, GRM uses expected variance under Hardy-Weinberg Equilibrium.Parameters: - force_block (bool) – Force using Spark’s BlockMatrix to compute kinship (advanced).
- force_gramian (bool) – Force using Spark’s RowMatrix.computeGramian to compute kinship (advanced).
Returns: Realized Relationship Matrix for all samples.
Return type:
-
same
(other, tolerance=1e-06)[source]¶ True if the two variant datasets have the same variants, samples, genotypes, and annotation schemata and values.
Examples
This will return True:
>>> vds.same(vds)
Notes
The
tolerance
parameter sets the tolerance for equality when comparing floating-point fields. More precisely, \(x\) and \(y\) are equal if\[bs{x - y} \leq tolerance * \max{bs{x}, bs{y}}\]Parameters: - other (
VariantDataset
) – variant dataset to compare against - tolerance (float) – floating-point tolerance for equality
Return type: bool
- other (
-
sample_annotations
¶ Return a dict of sample annotations.
The keys of this dictionary are the sample IDs (strings). The values are sample annotations.
Returns: dict
-
sample_ids
¶ Return sampleIDs.
Returns: List of sample IDs. Return type: list of str
-
sample_qc
(root='sa.qc', keep_star=False)[source]¶ Compute per-sample QC metrics.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Annotations
sample_qc()
computes 20 sample statistics from the genotype data and stores the results as sample annotations that can be accessed withsa.qc.<identifier>
(or<root>.<identifier>
if a non-default root was passed):Name Type Description callRate
Double Fraction of genotypes called nHomRef
Int Number of homozygous reference genotypes nHet
Int Number of heterozygous genotypes nHomVar
Int Number of homozygous alternate genotypes nCalled
Int Sum of nHomRef
+nHet
+nHomVar
nNotCalled
Int Number of uncalled genotypes nSNP
Int Number of SNP alternate alleles nInsertion
Int Number of insertion alternate alleles nDeletion
Int Number of deletion alternate alleles nSingleton
Int Number of private alleles nTransition
Int Number of transition (A-G, C-T) alternate alleles nTransversion
Int Number of transversion alternate alleles nNonRef
Int Sum of nHet
andnHomVar
rTiTv
Double Transition/Transversion ratio rHetHomVar
Double Het/HomVar genotype ratio rInsertionDeletion
Double Insertion/Deletion ratio dpMean
Double Depth mean across all genotypes dpStDev
Double Depth standard deviation across all genotypes gqMean
Double The average genotype quality across all genotypes gqStDev
Double Genotype quality standard deviation across all genotypes Missing values
NA
may result (for example, due to division by zero) and are handled properly in filtering and written as “NA” in export modules. The empirical standard deviation is computed with zero degrees of freedom.Parameters: - root (str) – Sample annotation root for the computed struct.
- keep_star (bool) – Count star alleles as non-reference alleles
Returns: Annotated variant dataset with new sample qc annotations.
Return type:
-
sample_schema
¶ Returns the signature of the sample annotations contained in this VDS.
Examples
>>> print(vds.sample_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.sample_schema)
Return type: Type
-
sample_variants
(fraction, seed=1)[source]¶ Downsample variants to a given fraction of the dataset.
Examples
>>> small_vds = vds.sample_variants(0.01)
Notes
This method may not sample exactly
(fraction * n_variants)
variants from the dataset.Parameters: - fraction (float) – (Expected) fraction of variants to keep.
- seed (int) – Random seed.
Returns: Downsampled variant dataset.
Return type:
-
samples_table
()[source]¶ Convert samples and sample annotations to KeyTable.
The resulting KeyTable has schema:
Struct { s: Sample sa: sample annotations }
with a single key
s
.Returns: Key table with samples and sample annotations. Return type: KeyTable
-
set_va_attributes
(ann_path, attributes)[source]¶ Sets attributes for a variant annotation. Attributes are key/value pairs that can be attached to a variant annotation field.
The following attributes are read from the VCF header when importing a VCF and written to the VCF header when exporting a VCF:
- INFO fields attributes (attached to (va.info.*)):
- ‘Number’: The arity of the field. Can take values
- 0 (Boolean flag),
- 1 (single value),
- R (one value per allele, including the reference),
- A (one value per non-reference allele),
- G (one value per genotype), and
- . (any number of values)
- When importing: The value in read from the VCF INFO field definition
- When exporting: The default value is 0 for Boolean, . for Arrays and 1 for all other types
- ‘Description’ (default is ‘’)
- ‘Number’: The arity of the field. Can take values
- FILTER entries in the VCF header are generated based on the attributes of va.filters. Each key/value pair in the attributes will generate a FILTER entry in the VCF with ID = key and Description = value.
Examples
Consider the following command which adds a filter and an annotation to the VDS (we’re assuming a split VDS for simplicity):
- an INFO field AC_HC, which stores the allele count of high confidence genotypes (DP >= 10, GQ >= 20) for each non-reference allele,
- a filter HardFilter that filters all sites with the GATK suggested hard filters:
- For SNVs: QD < 2.0 || FS < 60 || MQ < 40 || MQRankSum < -12.5 || ReadPosRankSum < -8.0
- For Indels (and other complex): QD < 2.0 || FS < 200.0 || ReadPosRankSum < 20.0
>>> annotated_vds = vds.annotate_variants_expr([ ... 'va.info.AC_HC = gs.filter(g => g.dp >= 10 && g.gq >= 20).callStats(g => v).AC[1:]', ... 'va.filters = if((v.altAllele.isSNP && (va.info.QD < 2.0 || va.info.FS < 60 || va.info.MQ < 40 || ' + ... 'va.info.MQRankSum < -12.5 || va.info.ReadPosRankSum < -8.0)) || ' + ... '(va.info.QD < 2.0 || va.info.FS < 200.0 || va.info.ReadPosRankSum < 20.0)) va.filters.add("HardFilter") else va.filters'])
If we now export this VDS as VCF, it would produce the following header (for these new fields):
##INFO=<ID=AC_HC,Number=.,Type=String,Description=""
This header doesn’t contain all information that should be present in an optimal VCF header: 1) There is no FILTER entry for HardFilter 2) Since AC_HC has one entry per non-reference allele, its Number should be A 3) AC_HC should have a Description
We can fix this by setting the attributes of these fields:
>>> annotated_vds = (annotated_vds ... .set_va_attributes( ... 'va.info.AC_HC', ... {'Description': 'Allele count for high quality genotypes (DP >= 10, GQ >= 20)', ... 'Number': 'A'}) ... .set_va_attributes( ... 'va.filters', ... {'HardFilter': 'This site fails GATK suggested hard filters.'}))
Exporting the VDS with the attributes now prints the following header lines:
##INFO=<ID=test,Number=A,Type=String,Description="Allele count for high quality genotypes (DP >= 10, GQ >= 20)" ##FILTER=<ID=HardFilter,Description="This site fails GATK suggested hard filters.">
Parameters: - ann_path (str) – Path to variant annotation beginning with va.
- attributes (dict) – A str-str dict containing the attributes to set
Returns: Annotated dataset with the attribute added to the variant annotation.
Return type: - INFO fields attributes (attached to (va.info.*)):
-
split_multi
(propagate_gq=False, keep_star_alleles=False, max_shift=100)[source]¶ Split multiallelic variants.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
>>> vds.split_multi().write('output/split.vds')
Notes
We will explain by example. Consider a hypothetical 3-allelic variant:
A C,T 0/2:7,2,6:15:45:99,50,99,0,45,99
split_multi will create two biallelic variants (one for each alternate allele) at the same position
A C 0/0:13,2:15:45:0,45,99 A T 0/1:9,6:15:50:50,0,99
Each multiallelic GT field is downcoded once for each alternate allele. A call for an alternate allele maps to 1 in the biallelic variant corresponding to itself and 0 otherwise. For example, in the example above, 0/2 maps to 0/0 and 0/1. The genotype 1/2 maps to 0/1 and 0/1.
The biallelic alt AD entry is just the multiallelic AD entry corresponding to the alternate allele. The ref AD entry is the sum of the other multiallelic entries.
The biallelic DP is the same as the multiallelic DP.
The biallelic PL entry for for a genotype g is the minimum over PL entries for multiallelic genotypes that downcode to g. For example, the PL for (A, T) at 0/1 is the minimum of the PLs for 0/1 (50) and 1/2 (45), and thus 45.
Fixing an alternate allele and biallelic variant, downcoding gives a map from multiallelic to biallelic alleles and genotypes. The biallelic AD entry for an allele is just the sum of the multiallelic AD entries for alleles that map to that allele. Similarly, the biallelic PL entry for a genotype is the minimum over multiallelic PL entries for genotypes that map to that genotype.
By default, GQ is recomputed from PL. If
propagate_gq=True
is passed, the biallelic GQ field is simply the multiallelic GQ field, that is, genotype qualities are unchanged.Here is a second example for a het non-ref
A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
splits as
A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99
VCF Info Fields
Hail does not split annotations in the info field. This means that if a multiallelic site with
info.AC
value[10, 2]
is split, each split site will contain the same array[10, 2]
. The provided allele index annotationva.aIndex
can be used to select the value corresponding to the split allele’s position:>>> vds_result = (vds.split_multi() ... .filter_variants_expr('va.info.AC[va.aIndex - 1] < 10', keep = False))
VCFs split by Hail and exported to new VCFs may be incompatible with other tools, if action is not taken first. Since the “Number” of the arrays in split multiallelic sites no longer matches the structure on import (“A” for 1 per allele, for example), Hail will export these fields with number “.”.
If the desired output is one value per site, then it is possible to use annotate_variants_expr to remap these values. Here is an example:
>>> (vds.split_multi() ... .annotate_variants_expr('va.info.AC = va.info.AC[va.aIndex - 1]') ... .export_vcf('output/export.vcf'))
The info field AC in data/export.vcf will have
Number=1
.Annotations
split_multi()
adds the following annotations:- va.wasSplit (Boolean) – true if this variant was originally multiallelic, otherwise false.
- va.aIndex (Int) – The original index of this
alternate allele in the multiallelic representation (NB: 1
is the first alternate allele or the only alternate allele
in a biallelic variant). For example, 1:100:A:T,C splits
into two variants: 1:100:A:T with
aIndex = 1
and 1:100:A:C withaIndex = 2
.
Parameters: - propagate_gq (bool) – Set the GQ of output (split)
genotypes to be the GQ of the input (multi-allelic) variants
instead of recompute GQ as the difference between the two
smallest PL values. Intended to be used in conjunction with
import_vcf(store_gq=True)
. This option will be obviated in the future by generic genotype schemas. Experimental. - keep_star_alleles (bool) – Do not filter out * alleles.
- max_shift (int) – maximum number of base pairs by which a split variant can move. Affects memory usage, and will cause Hail to throw an error if a variant that moves further is encountered.
Returns: A biallelic variant dataset.
Return type:
-
storage_level
()[source]¶ Returns the storage (persistence) level of the variant dataset.
Notes
See the Spark documentation for details on persistence levels.
Return type: str
-
summarize
()[source]¶ Returns a summary of useful information about the dataset.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
>>> s = vds.summarize() >>> print(s.contigs) >>> print('call rate is %.2f' % s.call_rate) >>> s.report()
The following information is contained in the summary:
- samples (int) - Number of samples.
- variants (int) - Number of variants.
- call_rate (float) - Fraction of all genotypes called.
- contigs (list of str) - List of all unique contigs found in the dataset.
- multiallelics (int) - Number of multiallelic variants.
- snps (int) - Number of SNP alternate alleles.
- mnps (int) - Number of MNP alternate alleles.
- insertions (int) - Number of insertion alternate alleles.
- deletions (int) - Number of deletions alternate alleles.
- complex (int) - Number of complex alternate alleles.
- star (int) - Number of star (upstream deletion) alternate alleles.
- max_alleles (int) - The highest number of alleles at any variant.
Returns: Object containing summary information. Return type: Summary
-
tdt
(pedigree, root='va.tdt')[source]¶ Find transmitted and untransmitted variants; count per variant and nuclear family.
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
Compute TDT association results:
>>> pedigree = Pedigree.read('data/trios.fam') >>> (vds.tdt(pedigree) ... .export_variants("output/tdt_results.tsv", "Variant = v, va.tdt.*"))
Notes
The transmission disequilibrium test tracks the number of times the alternate allele is transmitted (t) or not transmitted (u) from a heterozgyous parent to an affected child under the null that the rate of such transmissions is 0.5. For variants where transmission is guaranteed (i.e., the Y chromosome, mitochondria, and paternal chromosome X variants outside of the PAR), the test cannot be used.
The TDT statistic is given by
\[(t-u)^2 \over (t+u)\]and follows a 1 degree of freedom chi-squared distribution under the null hypothesis.
The number of transmissions and untransmissions for each possible set of genotypes is determined from the table below. The copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR).
- HemiX – in non-PAR of X and child is male
- Auto – otherwise (in autosome or PAR, or child is female)
Kid Dad Mom Copy State T U HomRef Het Het Auto 0 2 HomRef HomRef Het Auto 0 1 HomRef Het HomRef Auto 0 1 Het Het Het Auto 1 1 Het HomRef Het Auto 1 0 Het Het HomRef Auto 1 0 Het HomVar Het Auto 0 1 Het Het HomVar Auto 0 1 HomVar Het Het Auto 2 0 HomVar Het HomVar Auto 1 0 HomVar HomVar Het Auto 1 0 HomRef HomRef Het HemiX 0 1 HomRef HomVar Het HemiX 0 1 HomVar HomRef Het HemiX 1 0 HomVar HomVar Het HemiX 1 0 tdt()
only considers complete trios (two parents and a proband) with defined sex.PAR is currently defined with respect to reference GRCh37:
- X: 60001-2699520
- X: 154931044-155260560
- Y: 10001-2649520
- Y: 59034050-59363566
tdt()
assumes all contigs apart from X and Y are fully autosomal; decoys, etc. are not given special treatment.Annotations
tdt()
adds the following annotations:- tdt.nTransmitted (Int) – Number of transmitted alternate alleles.
- va.tdt.nUntransmitted (Int) – Number of untransmitted alternate alleles.
- va.tdt.chi2 (Double) – TDT statistic.
- va.tdt.pval (Double) – p-value.
Parameters: - pedigree (
Pedigree
) – Sample pedigree. - root – Variant annotation root to store TDT result.
Returns: Variant dataset with TDT association results added to variant annotations.
Return type:
-
union
()[source]¶ Take the union of datasets vertically (include all variants).
Examples
Union two datasets:
>>> vds_union = vds_autosomal.union(vds_chromX)
Given a list of datasets, union them all:
>>> all_vds = [vds_autosomal, vds_chromX, vds_chromY]
The following three syntaxes are equivalent:
>>> vds_union1 = vds_autosomal.union(vds_chromX, vds_chromY) >>> vds_union2 = all_vds[0].union(*all_vds[1:]) >>> vds_union3 = VariantDataset.union(*all_vds)
Notes
- In order to combine two datasets, these requirements must be met:
- the samples must match
- the variant annotation schemas must match (field order within structs matters).
- the cell (genotype) schemas must match (field order within structs matters).
The column annotations in the resulting dataset are simply the column annotations from the first dataset; the column annotation schemas do not need to match.
This method can trigger a shuffle, if partitions from two datasets overlap.
Parameters: vds_type (tuple of VariantDataset
) – Datasets to combine.Returns: Dataset with variants from all datasets. Return type: VariantDataset
-
unpersist
()[source]¶ Unpersists this VDS from memory/disk.
Notes This function will have no effect on a VDS that was not previously persisted.
There’s nothing stopping you from continuing to use a VDS that has been unpersisted, but doing so will result in all previous steps taken to compute the VDS being performed again since the VDS must be recomputed. Only unpersist a VDS when you are done with it.
-
variant_qc
(root='va.qc')[source]¶ Compute common variant statistics (quality control metrics).
Important
The
genotype_schema()
must be of typeTGenotype
in order to use this method.Examples
>>> vds_result = vds.variant_qc()
Annotations
variant_qc()
computes 18 variant statistics from the genotype data and stores the results as variant annotations that can be accessed withva.qc.<identifier>
(or<root>.<identifier>
if a non-default root was passed):Name Type Description callRate
Double Fraction of samples with called genotypes AF
Double Calculated alternate allele frequency (q) AC
Int Count of alternate alleles rHeterozygosity
Double Proportion of heterozygotes rHetHomVar
Double Ratio of heterozygotes to homozygous alternates rExpectedHetFrequency
Double Expected rHeterozygosity based on HWE pHWE
Double p-value from Hardy Weinberg Equilibrium null model nHomRef
Int Number of homozygous reference samples nHet
Int Number of heterozygous samples nHomVar
Int Number of homozygous alternate samples nCalled
Int Sum of nHomRef
,nHet
, andnHomVar
nNotCalled
Int Number of uncalled samples nNonRef
Int Sum of nHet
andnHomVar
rHetHomVar
Double Het/HomVar ratio across all samples dpMean
Double Depth mean across all samples dpStDev
Double Depth standard deviation across all samples gqMean
Double The average genotype quality across all samples gqStDev
Double Genotype quality standard deviation across all samples Missing values
NA
may result (for example, due to division by zero) and are handled properly in filtering and written as “NA” in export modules. The empirical standard deviation is computed with zero degrees of freedom.Parameters: root (str) – Variant annotation root for computed struct. Returns: Annotated variant dataset with new variant QC annotations. Return type: VariantDataset
-
variant_schema
¶ Returns the signature of the variant annotations contained in this VDS.
Examples
>>> print(vds.variant_schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(vds.variant_schema)
Return type: Type
-
variants_table
()[source]¶ Convert variants and variant annotations to a KeyTable.
The resulting KeyTable has schema:
Struct { v: Variant va: variant annotations }
with a single key
v
.Returns: Key table with variants and variant annotations. Return type: KeyTable
-
vep
(config, block_size=1000, root='va.vep', csq=False)[source]¶ Annotate variants with VEP.
vep()
runs Variant Effect Predictor with the LOFTEE plugin on the current variant dataset and adds the result as a variant annotation.Examples
Add VEP annotations to the dataset:
>>> vds_result = vds.vep("data/vep.properties")
Configuration
vep()
needs a configuration file to tell it how to run VEP. The format is a .properties file. Roughly, each line defines a property as a key-value pair of the form key = value. vep supports the following properties:- hail.vep.perl – Location of Perl. Optional, default: perl.
- hail.vep.perl5lib – Value for the PERL5LIB environment variable when invoking VEP. Optional, by default PERL5LIB is not set.
- hail.vep.path – Value of the PATH environment variable when invoking VEP. Optional, by default PATH is not set.
- hail.vep.location – Location of the VEP Perl script. Required.
- hail.vep.cache_dir – Location of the VEP cache dir, passed to VEP with the –dir option. Required.
- hail.vep.fasta – Location of the FASTA file to use to look up the reference sequence, passed to VEP with the –fasta option. Required.
- hail.vep.assembly – Genome assembly version to use. Optional, default: GRCh37
- hail.vep.plugin – VEP plugin, passed to VEP with the –plugin option. Optional. Overrides hail.vep.lof.human_ancestor and hail.vep.lof.conservation_file.
- hail.vep.lof.human_ancestor – Location of the human ancestor file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.
- hail.vep.lof.conservation_file – Location of the conservation file for the LOFTEE plugin. Ignored if hail.vep.plugin is set. Required otherwise.
Here is an example vep.properties configuration file
hail.vep.perl = /usr/bin/perl hail.vep.path = /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin hail.vep.location = /path/to/vep/ensembl-tools-release-81/scripts/variant_effect_predictor/variant_effect_predictor.pl hail.vep.cache_dir = /path/to/vep hail.vep.lof.human_ancestor = /path/to/loftee_data/human_ancestor.fa.gz hail.vep.lof.conservation_file = /path/to/loftee_data//phylocsf.sql
VEP Invocation
<hail.vep.perl> <hail.vep.location> --format vcf --json --everything --allele_number --no_stats --cache --offline --dir <hail.vep.cache_dir> --fasta <hail.vep.fasta> --minimal --assembly <hail.vep.assembly> --plugin LoF,human_ancestor_fa:$<hail.vep.lof.human_ancestor>,filter_position:0.05,min_intron_size:15,conservation_file:<hail.vep.lof.conservation_file> -o STDOUT
Annotations
Annotations with the following schema are placed in the location specified by
root
. The full resulting dataset schema can be queried withvariant_schema
.Struct{ assembly_name: String, allele_string: String, colocated_variants: Array[Struct{ aa_allele: String, aa_maf: Double, afr_allele: String, afr_maf: Double, allele_string: String, amr_allele: String, amr_maf: Double, clin_sig: Array[String], end: Int, eas_allele: String, eas_maf: Double, ea_allele: String,, ea_maf: Double, eur_allele: String, eur_maf: Double, exac_adj_allele: String, exac_adj_maf: Double, exac_allele: String, exac_afr_allele: String, exac_afr_maf: Double, exac_amr_allele: String, exac_amr_maf: Double, exac_eas_allele: String, exac_eas_maf: Double, exac_fin_allele: String, exac_fin_maf: Double, exac_maf: Double, exac_nfe_allele: String, exac_nfe_maf: Double, exac_oth_allele: String, exac_oth_maf: Double, exac_sas_allele: String, exac_sas_maf: Double, id: String, minor_allele: String, minor_allele_freq: Double, phenotype_or_disease: Int, pubmed: Array[Int], sas_allele: String, sas_maf: Double, somatic: Int, start: Int, strand: Int }], end: Int, id: String, input: String, intergenic_consequences: Array[Struct{ allele_num: Int, consequence_terms: Array[String], impact: String, minimised: Int, variant_allele: String }], most_severe_consequence: String, motif_feature_consequences: Array[Struct{ allele_num: Int, consequence_terms: Array[String], high_inf_pos: String, impact: String, minimised: Int, motif_feature_id: String, motif_name: String, motif_pos: Int, motif_score_change: Double, strand: Int, variant_allele: String }], regulatory_feature_consequences: Array[Struct{ allele_num: Int, biotype: String, consequence_terms: Array[String], impact: String, minimised: Int, regulatory_feature_id: String, variant_allele: String }], seq_region_name: String, start: Int, strand: Int, transcript_consequences: Array[Struct{ allele_num: Int, amino_acids: String, biotype: String, canonical: Int, ccds: String, cdna_start: Int, cdna_end: Int, cds_end: Int, cds_start: Int, codons: String, consequence_terms: Array[String], distance: Int, domains: Array[Struct{ db: String name: String }], exon: String, gene_id: String, gene_pheno: Int, gene_symbol: String, gene_symbol_source: String, hgnc_id: String, hgvsc: String, hgvsp: String, hgvs_offset: Int, impact: String, intron: String, lof: String, lof_flags: String, lof_filter: String, lof_info: String, minimised: Int, polyphen_prediction: String, polyphen_score: Double, protein_end: Int, protein_start: Int, protein_id: String, sift_prediction: String, sift_score: Double, strand: Int, swissprot: String, transcript_id: String, trembl: String, uniparc: String, variant_allele: String }], variant_class: String }
Parameters: - config (str) – Path to VEP configuration file.
- block_size (int) – Number of variants to annotate per VEP invocation.
- root (str) – Variant annotation path to store VEP output.
- csq (bool) – If
True
, annotates VCF CSQ field as a String. IfFalse
, annotates with the full nested struct schema
Returns: An annotated with variant annotations from VEP.
Return type:
-
was_split
()[source]¶ True if multiallelic variants have been split into multiple biallelic variants.
Result is True if
split_multi()
orfilter_multi()
has been called on this variant dataset, or if the variant dataset was imported withimport_plink()
,import_gen()
, orimport_bgen()
, or if the variant dataset was simulated withbalding_nichols_model()
.Return type: bool
-
write
(output, overwrite=False, parquet_genotypes=False)[source]¶ Write variant dataset as VDS file.
Examples
Import data from a VCF file and then write the data to a VDS file:
>>> vds.write("output/sample.vds")
Parameters: - output (str) – Path of VDS file to write.
- overwrite (bool) – If true, overwrite any existing VDS file. Cannot be used to read from and write to the same path.
- parquet_genotypes (bool) – If true, store genotypes as Parquet rather than Hail’s serialization. The resulting VDS will be larger and slower in Hail but the genotypes will be accessible from other tools that support Parquet.
-