Import, prototype, scale

Perform analyses with distributed dataframe-like collections
import hail as hl
mt = hl.import_vcf('gs://bucket/path/myVCF.vcf.bgz')
mt.write('gs://bucket/path/', overwrite=True)
# read matrix into env
mt = hl.read_matrix_table('gs://bucket/path/')
mt1 = hl.import_vcf('/path/to/my.vcf.bgz')
mt2 = hl.import_bgen('/path/to/my.bgen')
mt3 = hl.import_plink(bed='/path/to/my.bed',

Input Unification

Import formats such as bed, bgen, plink, or vcf, and manipulate them using a common dataframe-like interface.

Genomic Dataframes

For large and dense structured matrices, like sequencing data, coordinate representations are both hard to work with and computationally inefficient. A core piece of Hail functionality is the MatrixTable, a 2-dimensional generalization of Table. The MatrixTable makes it possible to filter, annotate, and aggregate symmetrically over rows and columns.

# What is a MatrixTable?

# filter to rare, loss-of-function variants
mt = mt.filter_rows(mt.variant_qc.AF[1] < 0.005)
mt = mt.filter_rows(mt.csq == 'LOF')
# run sample QC and save into matrix table
mt = hl.sample_qc(mt)

# filter for samples that are > 95% call rate
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95) 

# run variant QC and save into matrix table
mt = hl.variant_qc(mt)

# filter for variants that are >95% call rate and >1% frequency
mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95)
mt = mt.filter_rows(mt.variant_qc_.AF[1] > 0.01)

Simplified Analysis

Hail makes it easy to analyze your data. Let's start by filtering a dataset by variant and sample quality metrics, like call rate and allele frequency.

Quality Control Procedures

Quality control procedures, like sex check, are made easy using Hail's declarative syntax

imputed_sex = hl.impute_sex(mt.GT)
mt = mt.annotate_cols(
    sex_check = imputed_sex[mt.s].is_female == mt.reported_female
# must use Google cloud platform for this to work 
# annotation with vep
mt = hl.vep(mt)

Variant Effect Predictor

Annotating variants with Variant effect predictor has never been easier.

Rare-Variant Association Testing

Perform Gene Burden Tests on sequencing data with just a few lines of Python.

gene_intervals = hl.read_table("gs://my_bucket/gene_intervals.t")
mt = mt.annotate_rows(
    gene = gene_intervals.index(, all_matches=True).gene_name

mt = mt.explode_rows(mt.gene)
mt = (mt.group_rows_by(mt.gene)
        .aggregate(burden = hl.agg.count_where(mt.GT.is_non_ref())))

result = hl.linear_regression_rows(y=mt.phenotype, x=mt.burden)
# generate and save PC scores
eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(mt.GT, k=4)

# run linear regression for the first 4 PCs
mt = mt.annotate_cols(scores = pca_scores[mt.sample_id].scores)
results = hl.linear_regression_rows(
        1, mt.scores[0], mt.scores[1], mt.scores[2], mt.scores[3]]

Principal Component Analysis (PCA)

Adjusting GWAS models with principal components as covariates has never been easier.