import hail as hl
mt = hl.import_vcf('gs://bucket/path/myVCF.vcf.bgz')
mt.write('gs://bucket/path/dataset.mt', overwrite=True)
# read matrix into env
mt = hl.read_matrix_table('gs://bucket/path/dataset.mt')
mt1 = hl.import_vcf('/path/to/my.vcf.bgz')
mt2 = hl.import_bgen('/path/to/my.bgen')
mt3 = hl.import_plink(bed='/path/to/my.bed',
bim='/path/to/my.bim',
fam='/path/to/my.fam')
Input Unification
Import formats such as bed, bgen, plink, or vcf, and manipulate them using a common dataframe-like interface.
Genomic Dataframes
For large and dense structured matrices, like sequencing data, coordinate representations are both hard to work with and computationally inefficient. A core piece of Hail functionality is the MatrixTable, a 2-dimensional generalization of Table. The MatrixTable makes it possible to filter, annotate, and aggregate symmetrically over rows and columns.
# What is a MatrixTable?
mt.describe(widget=True)
# filter to rare, loss-of-function variants
mt = mt.filter_rows(mt.variant_qc.AF[1] < 0.005)
mt = mt.filter_rows(mt.csq == 'LOF')
# run sample QC and save into matrix table
mt = hl.sample_qc(mt)
# filter for samples that are > 95% call rate
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95)
# run variant QC and save into matrix table
mt = hl.variant_qc(mt)
# filter for variants that are >95% call rate and >1% frequency
mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95)
mt = mt.filter_rows(mt.variant_qc_.AF[1] > 0.01)
Simplified Analysis
Hail makes it easy to analyze your data. Let's start by filtering a dataset by variant and sample quality metrics, like call rate and allele frequency.
Quality Control Procedures
Quality control procedures, like sex check, are made easy using Hail's declarative syntax
imputed_sex = hl.impute_sex(mt.GT)
mt = mt.annotate_cols(
sex_check = imputed_sex[mt.s].is_female == mt.reported_female
)
# must use Google cloud platform for this to work
# annotation with vep
mt = hl.vep(mt)
Variant Effect Predictor
Annotating variants with Variant effect predictor has never been easier.
Rare-Variant Association Testing
Perform Gene Burden Tests on sequencing data with just a few lines of Python.
gene_intervals = hl.read_table("gs://my_bucket/gene_intervals.t")
mt = mt.annotate_rows(
gene = gene_intervals.index(mt.locus, all_matches=True).gene_name
)
mt = mt.explode_rows(mt.gene)
mt = (mt.group_rows_by(mt.gene)
.aggregate(burden = hl.agg.count_where(mt.GT.is_non_ref())))
result = hl.linear_regression_rows(y=mt.phenotype, x=mt.burden)
# generate and save PC scores
eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(mt.GT, k=4)
# run linear regression for the first 4 PCs
mt = mt.annotate_cols(scores = pca_scores[mt.sample_id].scores)
results = hl.linear_regression_rows(
y=mt.phenotype,
x=mt.GT.n_alt_alleles(),
covariates=[
1, mt.scores[0], mt.scores[1], mt.scores[2], mt.scores[3]]
)
Principal Component Analysis (PCA)
Adjusting GWAS models with principal components as covariates has never been easier.