# Genetics¶

## Formatting¶

### Convert variants in string format to separate locus and allele fields¶

code
>>> ht = ht.key_by(**hl.parse_variant(ht.variant))

dependencies
understanding

If your variants are strings of the format ‘chr:pos:ref:alt’, you may want to convert them to separate locus and allele fields. This is useful if you have imported a table with variants in string format and you would like to join this table with other Hail tables that are keyed by locus and alleles.

hl.parse_variant(ht.variant) constructs a StructExpression containing two nested fields for the locus and alleles. The ** syntax unpacks this struct so that the resulting table has two new fields, locus and alleles.

### Liftover variants from one coordinate system to another¶

tags

liftover

description

Liftover a Table or MatrixTable from one reference genome to another.

code

First, we need to set up the two reference genomes (source and destination):

>>> rg37 = hl.get_reference('GRCh37')  # doctest: +SKIP
>>> rg38 = hl.get_reference('GRCh38')  # doctest: +SKIP
>>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38)  # doctest: +SKIP


Then we can liftover the locus coordinates in a Table or MatrixTable (here, ht) from reference genome 'GRCh37' to 'GRCh38':

>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38'))  # doctest: +SKIP
>>> ht = ht.filter(hl.is_defined(ht.new_locus))  # doctest: +SKIP
>>> ht = ht.key_by(locus=ht.new_locus)  # doctest: +SKIP


Note that this approach does not retain the old locus, nor does it verify that the allele has not changed strand. We can keep the old one for reference and filter out any liftover that changed strands using:

>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38', include_strand=True),
...                  old_locus=ht.locus)  # doctest: +SKIP
>>> ht = ht.filter(hl.is_defined(ht.new_locus) & ~ht.new_locus.is_negative_strand)  # doctest: +SKIP
>>> ht = ht.key_by(locus=ht.new_locus.result)  # doctest: +SKIP

dependencies

## Filtering and Pruning¶

### Filter loci by a list of locus intervals¶

#### From a table of intervals¶

tags

genomic region, genomic range

description

Import a text file of locus intervals as a table, then use this table to filter the loci in a matrix table.

code
>>> interval_table = hl.import_locus_intervals('data/gene.interval_list', reference_genome='GRCh37')
>>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))

dependencies
understanding

We have a matrix table mt containing the loci we would like to filter, and a list of locus intervals stored in a file. We can import the intervals into a table with import_locus_intervals().

Hail supports implicit joins between locus intervals and loci, so we can filter our dataset to the rows defined in the join between the interval table and our matrix table.

interval_table[mt.locus] joins the matrix table with the table of intervals based on locus and interval<locus> matches. This is a StructExpression, which will be defined if the locus was found in any interval, or missing if the locus is outside all intervals.

To do our filtering, we can filter to the rows of our matrix table where the struct expression interval_table[mt.locus] is defined.

This method will also work to filter a table of loci, as well as a matrix table.

#### From a UCSC BED file¶

description

Import a UCSC BED file as a table of intervals, then use this table to filter the loci in a matrix table.

code
>>> interval_table = hl.import_bed('data/file1.bed', reference_genome='GRCh37')
>>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))

dependencies

#### Using hl.filter_intervals¶

description

Filter using an interval table, suitable for a small list of intervals.

code
>>> filtered_mt = hl.filter_intervals(mt, interval_table['interval'].collect())

dependencies

filter_intervals()

#### Declaring intervals with hl.parse_locus_interval¶

description

Filter to declared intervals.

code
>>> intervals = ['1:100M-200M', '16:29.1M-30.2M', 'X']
>>> filtered_mt = hl.filter_intervals(
...     mt,
...     [hl.parse_locus_interval(x, reference_genome='GRCh37') for x in intervals])

dependencies

### Pruning Variants in Linkage Disequilibrium¶

tags

LD Prune

description

Remove correlated variants from a matrix table.

code
>>> biallelic_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000)
>>> filtered_mt = mt.filter_rows(
...     hl.is_defined(pruned_variant_table[mt.row_key]))

dependencies

ld_prune()

understanding

Hail’s ld_prune() method takes a matrix table and returns a table with a subset of variants which are uncorrelated with each other. The method requires a biallelic dataset, so we first filter our dataset to biallelic variants. Next, we get a table of independent variants using ld_prune(), which we can use to filter the rows of our original dataset.

Note that it is more efficient to do the final filtering step on the original dataset, rather than on the biallelic dataset, so that the biallelic dataset does not need to be recomputed.

## Analysis¶

### Linear Regression¶

#### Single Phenotype¶

tags

Linear Regression

description

Compute linear regression statistics for a single phenotype.

code

Approach #1: Use the linear_regression_rows() method

>>> ht = hl.linear_regression_rows(y=mt.pheno.height,
...                                x=mt.GT.n_alt_alleles(),
...                                covariates=[1])


Approach #2: Use the aggregators.linreg() aggregator

>>> mt_linreg = mt.annotate_rows(linreg=hl.agg.linreg(y=mt.pheno.height,
...                                                   x=[1, mt.GT.n_alt_alleles()]))

dependencies
understanding

The linear_regression_rows() method is more efficient than using the aggregators.linreg() aggregator. However, the aggregators.linreg() aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics.

#### Multiple Phenotypes¶

tags

Linear Regression

description

Compute linear regression statistics for multiple phenotypes.

code

Approach #1: Use the linear_regression_rows() method for all phenotypes simultaneously

>>> ht_result = hl.linear_regression_rows(y=[mt.pheno.height, mt.pheno.blood_pressure],
...                                       x=mt.GT.n_alt_alleles(),
...                                       covariates=[1])


Approach #2: Use the linear_regression_rows() method for each phenotype sequentially

>>> ht1 = hl.linear_regression_rows(y=mt.pheno.height,
...                                 x=mt.GT.n_alt_alleles(),
...                                 covariates=[1])

>>> ht2 = hl.linear_regression_rows(y=mt.pheno.blood_pressure,
...                                 x=mt.GT.n_alt_alleles(),
...                                 covariates=[1])


Approach #3: Use the aggregators.linreg() aggregator

>>> mt_linreg = mt.annotate_rows(
...     linreg_height=hl.agg.linreg(y=mt.pheno.height,
...                                 x=[1, mt.GT.n_alt_alleles()]),
...     linreg_bp=hl.agg.linreg(y=mt.pheno.blood_pressure,
...                             x=[1, mt.GT.n_alt_alleles()]))

dependencies
understanding

The linear_regression_rows() method is more efficient than using the aggregators.linreg() aggregator, especially when analyzing many phenotypes. However, the aggregators.linreg() aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics. The linear_regression_rows() method drops samples that have a missing value for any of the phenotypes. Therefore, Approach #1 may not be suitable for phenotypes with differential patterns of missingness. Approach #2 will do two passes over the data while Approaches #1 and #3 will do one pass over the data and compute the regression statistics for each phenotype simultaneously.

#### Using Variants (SNPs) as Covariates¶

tags

sample genotypes covariate

description

Use sample genotype dosage at specific variant(s) as covariates in regression routines.

code

List the variants of interest:

>>> my_snps = ['20:13714384:A:C', '20:17479730:T:C']


Annotate the variants as a global field:

>>> mt_filt = mt.annotate_globals(
...     snps = hl.set([hl.parse_variant(x) for x in my_snps]))


Filter rows to these variants:

>>> mt_filt = mt_filt.filter_rows(mt_filt.snps.contains(mt_filt.row_key))


Aggregate to collect a dictionary per sample of SNP to allele dosage:

>>> sample_genos = mt_filt.annotate_cols(
...     genotypes = hl.dict(hl.agg.collect( (hl.variant_str(mt_filt.row_key), mt_filt.GT.n_alt_alleles()) )))
>>> mt_annot = mt.annotate_cols(snp_covs = sample_genos.cols()[mt.s].genotypes)


Run the GWAS with linear_regression_rows() using variant dosages as covariates:

>>> gwas = hl.linear_regression_rows(
...     x=mt_annot.GT.n_alt_alleles(),
...     y=mt_annot.pheno.blood_pressure,
...     covariates=[1, mt_annot.pheno.age, *(mt_annot.snp_covs.get(x) for x in my_snps)])

dependencies

#### Stratified by Group¶

tags

Linear Regression

description

Compute linear regression statistics for a single phenotype stratified by group.

code

Approach #1: Use the linear_regression_rows() method for each group

>>> female_pheno = (hl.case()
...                   .when(mt.pheno.is_female, mt.pheno.height)
...                   .or_missing())

>>> linreg_female = hl.linear_regression_rows(y=female_pheno,
...                                           x=mt.GT.n_alt_alleles(),
...                                           covariates=[1])

>>> male_pheno = (hl.case()
...                 .when(~mt.pheno.is_female, mt.pheno.height)
...                 .or_missing())

>>> linreg_male = hl.linear_regression_rows(y=male_pheno,
...                                         x=mt.GT.n_alt_alleles(),
...                                         covariates=[1])


Approach #2: Use the aggregators.group_by() and aggregators.linreg() aggregators

>>> mt_linreg = mt.annotate_rows(
...     linreg=hl.agg.group_by(mt.pheno.is_female,
...                            hl.agg.linreg(y=mt.pheno.height,
...                                          x=[1, mt.GT.n_alt_alleles()])))

dependencies
understanding

We have presented two ways to compute linear regression statistics for each value of a grouping variable. The first approach utilizes the linear_regression_rows() method and must be called separately for each group even though it can compute statistics for multiple phenotypes simultaneously. This is because the linear_regression_rows() method drops samples that have a missing value for any of the phenotypes. When the groups are mutually exclusive, such as ‘Male’ and ‘Female’, no samples remain! Note that we cannot define male_pheno = ~female_pheno because we subsequently need male_pheno to be an expression on the mt_linreg matrix table rather than mt. Lastly, the argument to root must be specified for both cases – otherwise the ‘Male’ output will overwrite the ‘Female’ output.

The second approach uses the aggregators.group_by() and aggregators.linreg() aggregators. The aggregation expression generates a dictionary where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group. The result of the aggregation expression is then used to annotate the matrix table.

The linear_regression_rows() method is more efficient than the aggregators.linreg() aggregator and can be extended to multiple phenotypes, but the aggregators.linreg() aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.