Genetics
This page tailored how-to guides for small but commonly-used patterns appearing in genetics pipelines. For documentation on the suite of genetics functions built into Hail, see the genetics methods page.
Formatting
Convert variants in string format to separate locus and allele fields
- code:
>>> ht = ht.key_by(**hl.parse_variant(ht.variant))
- dependencies:
- understanding:
If your variants are strings of the format ‘chr:pos:ref:alt’, you may want to convert them to separate locus and allele fields. This is useful if you have imported a table with variants in string format and you would like to join this table with other Hail tables that are keyed by locus and alleles.
hl.parse_variant(ht.variant)
constructs aStructExpression
containing two nested fields for the locus and alleles. The ** syntax unpacks this struct so that the resulting table has two new fields,locus
andalleles
.
Liftover variants from one coordinate system to another
- tags:
liftover
- description:
Liftover a Table or MatrixTable from one reference genome to another.
- code:
First, we need to set up the two reference genomes (source and destination):
>>> rg37 = hl.get_reference('GRCh37') >>> rg38 = hl.get_reference('GRCh38') >>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38)
Then we can liftover the locus coordinates in a Table or MatrixTable (here, ht) from reference genome
'GRCh37'
to'GRCh38'
:>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38')) >>> ht = ht.filter(hl.is_defined(ht.new_locus)) >>> ht = ht.key_by(locus=ht.new_locus)
Note that this approach does not retain the old locus, nor does it verify that the allele has not changed strand. We can keep the old one for reference and filter out any liftover that changed strands using:
>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38', include_strand=True), ... old_locus=ht.locus) >>> ht = ht.filter(hl.is_defined(ht.new_locus) & ~ht.new_locus.is_negative_strand) >>> ht = ht.key_by(locus=ht.new_locus.result)
- dependencies:
Filtering and Pruning
Filter loci by a list of locus intervals
From a table of intervals
- tags:
genomic region, genomic range
- description:
Import a text file of locus intervals as a table, then use this table to filter the loci in a matrix table.
- code:
>>> interval_table = hl.import_locus_intervals('data/gene.interval_list', reference_genome='GRCh37') >>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))
- dependencies:
- understanding:
We have a matrix table
mt
containing the loci we would like to filter, and a list of locus intervals stored in a file. We can import the intervals into a table withimport_locus_intervals()
.Hail supports implicit joins between locus intervals and loci, so we can filter our dataset to the rows defined in the join between the interval table and our matrix table.
interval_table[mt.locus]
joins the matrix table with the table of intervals based on locus and interval<locus> matches. This is a StructExpression, which will be defined if the locus was found in any interval, or missing if the locus is outside all intervals.To do our filtering, we can filter to the rows of our matrix table where the struct expression
interval_table[mt.locus]
is defined.This method will also work to filter a table of loci, as well as a matrix table.
From a UCSC BED file
- description:
Import a UCSC BED file as a table of intervals, then use this table to filter the loci in a matrix table.
- code:
>>> interval_table = hl.import_bed('data/file1.bed', reference_genome='GRCh37') >>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))
- dependencies:
Using hl.filter_intervals
- description:
Filter using an interval table, suitable for a small list of intervals.
- code:
>>> filtered_mt = hl.filter_intervals(mt, interval_table['interval'].collect())
- dependencies:
Declaring intervals with hl.parse_locus_interval
- description:
Filter to declared intervals.
- code:
>>> intervals = ['1:100M-200M', '16:29.1M-30.2M', 'X'] >>> filtered_mt = hl.filter_intervals( ... mt, ... [hl.parse_locus_interval(x, reference_genome='GRCh37') for x in intervals])
- dependencies:
Pruning Variants in Linkage Disequilibrium
- tags:
LD Prune
- description:
Remove correlated variants from a matrix table.
- code:
>>> biallelic_mt = mt.filter_rows(hl.len(mt.alleles) == 2) >>> pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000) >>> filtered_mt = mt.filter_rows( ... hl.is_defined(pruned_variant_table[mt.row_key]))
- dependencies:
- understanding:
Hail’s
ld_prune()
method takes a matrix table and returns a table with a subset of variants which are uncorrelated with each other. The method requires a biallelic dataset, so we first filter our dataset to biallelic variants. Next, we get a table of independent variants usingld_prune()
, which we can use to filter the rows of our original dataset.Note that it is more efficient to do the final filtering step on the original dataset, rather than on the biallelic dataset, so that the biallelic dataset does not need to be recomputed.
Analysis
Linear Regression
Single Phenotype
- tags:
Linear Regression
- description:
Compute linear regression statistics for a single phenotype.
- code:
Approach #1: Use the
linear_regression_rows()
method>>> ht = hl.linear_regression_rows(y=mt.pheno.height, ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
Approach #2: Use the
aggregators.linreg()
aggregator>>> mt_linreg = mt.annotate_rows(linreg=hl.agg.linreg(y=mt.pheno.height, ... x=[1, mt.GT.n_alt_alleles()]))
- dependencies:
- understanding:
The
linear_regression_rows()
method is more efficient than using theaggregators.linreg()
aggregator. However, theaggregators.linreg()
aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics.
Multiple Phenotypes
- tags:
Linear Regression
- description:
Compute linear regression statistics for multiple phenotypes.
- code:
Approach #1: Use the
linear_regression_rows()
method for all phenotypes simultaneously>>> ht_result = hl.linear_regression_rows(y=[mt.pheno.height, mt.pheno.blood_pressure], ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
Approach #2: Use the
linear_regression_rows()
method for each phenotype sequentially>>> ht1 = hl.linear_regression_rows(y=mt.pheno.height, ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
>>> ht2 = hl.linear_regression_rows(y=mt.pheno.blood_pressure, ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
Approach #3: Use the
aggregators.linreg()
aggregator>>> mt_linreg = mt.annotate_rows( ... linreg_height=hl.agg.linreg(y=mt.pheno.height, ... x=[1, mt.GT.n_alt_alleles()]), ... linreg_bp=hl.agg.linreg(y=mt.pheno.blood_pressure, ... x=[1, mt.GT.n_alt_alleles()]))
- dependencies:
- understanding:
The
linear_regression_rows()
method is more efficient than using theaggregators.linreg()
aggregator, especially when analyzing many phenotypes. However, theaggregators.linreg()
aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics. Thelinear_regression_rows()
method drops samples that have a missing value for any of the phenotypes. Therefore, Approach #1 may not be suitable for phenotypes with differential patterns of missingness. Approach #2 will do two passes over the data while Approaches #1 and #3 will do one pass over the data and compute the regression statistics for each phenotype simultaneously.
Using Variants (SNPs) as Covariates
- tags:
sample genotypes covariate
- description:
Use sample genotype dosage at specific variant(s) as covariates in regression routines.
- code:
Create a sample annotation from the genotype dosage for each variant of interest by combining the filter and collect aggregators:
>>> mt_annot = mt.annotate_cols( ... snp1 = hl.agg.filter(hl.parse_variant('20:13714384:A:C') == mt.row_key, ... hl.agg.collect(mt.GT.n_alt_alleles()))[0], ... snp2 = hl.agg.filter(hl.parse_variant('20:17479730:T:C') == mt.row_key, ... hl.agg.collect(mt.GT.n_alt_alleles()))[0])
Run the GWAS with
linear_regression_rows()
using variant dosages as covariates:>>> gwas = hl.linear_regression_rows( ... x=mt_annot.GT.n_alt_alleles(), ... y=mt_annot.pheno.blood_pressure, ... covariates=[1, mt_annot.pheno.age, mt_annot.snp1, mt_annot.snp2])
- dependencies:
linear_regression_rows()
,aggregators.collect()
,parse_variant()
,variant_str()
Stratified by Group
- tags:
Linear Regression
- description:
Compute linear regression statistics for a single phenotype stratified by group.
- code:
Approach #1: Use the
linear_regression_rows()
method for each group>>> female_pheno = (hl.case() ... .when(mt.pheno.is_female, mt.pheno.height) ... .or_missing())
>>> linreg_female = hl.linear_regression_rows(y=female_pheno, ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
>>> male_pheno = (hl.case() ... .when(~mt.pheno.is_female, mt.pheno.height) ... .or_missing())
>>> linreg_male = hl.linear_regression_rows(y=male_pheno, ... x=mt.GT.n_alt_alleles(), ... covariates=[1])
Approach #2: Use the
aggregators.group_by()
andaggregators.linreg()
aggregators>>> mt_linreg = mt.annotate_rows( ... linreg=hl.agg.group_by(mt.pheno.is_female, ... hl.agg.linreg(y=mt.pheno.height, ... x=[1, mt.GT.n_alt_alleles()])))
- dependencies:
linear_regression_rows()
,aggregators.group_by()
,aggregators.linreg()
- understanding:
We have presented two ways to compute linear regression statistics for each value of a grouping variable. The first approach utilizes the
linear_regression_rows()
method and must be called separately for each group even though it can compute statistics for multiple phenotypes simultaneously. This is because thelinear_regression_rows()
method drops samples that have a missing value for any of the phenotypes. When the groups are mutually exclusive, such as ‘Male’ and ‘Female’, no samples remain! Note that we cannot define male_pheno = ~female_pheno because we subsequently need male_pheno to be an expression on the mt_linreg matrix table rather than mt. Lastly, the argument to root must be specified for both cases – otherwise the ‘Male’ output will overwrite the ‘Female’ output.The second approach uses the
aggregators.group_by()
andaggregators.linreg()
aggregators. The aggregation expression generates a dictionary where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group. The result of the aggregation expression is then used to annotate the matrix table.The
linear_regression_rows()
method is more efficient than theaggregators.linreg()
aggregator and can be extended to multiple phenotypes, but theaggregators.linreg()
aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.
PLINK Conversions
Polygenic Score Calculation
- plink:
>>> plink --bfile data --score scores.txt sum
- tags:
PRS
- description:
This command is analogous to plink’s –score command with the sum option. Biallelic variants are required.
- code:
>>> mt = hl.import_plink( ... bed="data/ldsc.bed", bim="data/ldsc.bim", fam="data/ldsc.fam", ... quant_pheno=True, missing='-9') >>> mt = hl.variant_qc(mt) >>> scores = hl.import_table('data/scores.txt', delimiter=' ', key='rsid', ... types={'score': hl.tfloat32}) >>> mt = mt.annotate_rows(**scores[mt.rsid]) >>> flip = hl.case().when(mt.allele == mt.alleles[0], True).when( ... mt.allele == mt.alleles[1], False).or_missing() >>> mt = mt.annotate_rows(flip=flip) >>> mt = mt.annotate_rows( ... prior=2 * hl.if_else(mt.flip, mt.variant_qc.AF[0], mt.variant_qc.AF[1])) >>> mt = mt.annotate_cols( ... prs=hl.agg.sum( ... mt.score * hl.coalesce( ... hl.if_else(mt.flip, 2 - mt.GT.n_alt_alleles(), ... mt.GT.n_alt_alleles()), mt.prior)))
- dependencies:
import_plink()
,variant_qc()
,import_table()
,coalesce()
,case()
,cond()
,Call.n_alt_alleles()