Powering genomic analysis, at every scale

Cloud-native genomic dataframes and batch computing

Install Hail Query Hail Batch Get Help

          
          import hail as hl

mt = hl.read_matrix_table('resources/post_qc.mt')
mt = mt.filter_rows(hl.agg.call_stats(mt.GT, mt.alleles).AF[1] > 0.01)
pca_scores = hl.hwe_normalized_pca(mt.GT, k = 5, True)[1]
mt = mt.annotate_cols(pca = pca_scores[mt.s])

gwas = hl.linear_regression_rows(
y=mt.pheno.caffeine_consumption,
x=mt.GT.n_alt_alleles(),
covariates=[1.0, mt.pheno.is_female,
mt.pca.scores[0], mt.pca.scores[1],
mt.pca.scores[2]])

p = hl.plot.manhattan(gwas.p_value)
show(p)

GWAS with Hail (click to show code)

          pip install hail

Hail requires Python 3 and the Java 11 JRE.

GNU/Linux will also need the C and C++ standard libraries if not already installed.

Detailed instructions

Simplified Analysis

Hail Query provides powerful, easy-to-use data science tools. Interrogate data at every scale: small datasets on a laptop through to biobank-scale datasets (e.g. UK Biobank, gnomAD, TopMed, FinnGen, and Biobank Japan) in the cloud.

Genomic Dataframes

Modern data science is driven by numeric matrices (see Numpy) and tables (see R dataframes and Pandas). While sufficient for many tasks, none of these tools adequately capture the structure of genetic data. Genetic data combines the multiple axes of a matrix (e.g. variants and samples) with the structured data of tables (e.g. genotypes). To support genomic analysis, Hail introduces a powerful and distributed data structure combining features of matrices and dataframes called MatrixTable.

Input Unification

The Hail MatrixTable unifies a wide range of input formats (e.g. vcf, bgen, plink, tsv, gtf, bed files), and supports scalable queries, even on petabyte-size datasets. Hail's MatrixTable abstraction provides an integrated and scalable analysis platform for science.

Learn More

Arbitrary Tools

Hail Batch enables massively parallel execution and composition of arbitrary GNU/Linux tools like PLINK, SAIGE, sed, and even Python scripts that use Hail Query!

Cost-efficiency and Ease-of-use

Hail Batch is cost-efficient and easy-to-use because it automatically and cooperatively manages cloud resources for all users. As an end-user you need only describe which programs to run, with what arguments, and the dependencies between programs.

Scalability and Cost Control

Hail Batch automatically scales to fit the needs of your job. Instead of queueing for limited resources on a fixed-size cluster, your jobs only queue while the service requests more cores from the cloud. Hail Batch also optionally enforces spending limits which protect users from cost overruns.

Learn More

The Hail team has several sources of funding at the Broad Institute:

The Stanley Center for Psychiatric Research, which together with Neale Lab has provided an incredibly supportive and stimulating home.
Principal Investigator Benjamin Neale, whose scientific leadership has been essential for solving the right problems.
Principal Investigator Daniel MacArthur and the other members of the gnomAD council.
Jeremy Wertheimer, whose strategic advice and generous philanthropy have been essential for growing the impact of Hail.

We are grateful for generous support from:

The National Institute of Diabetes and Digestive and Kidney Diseases
The National Institute of Mental Health
The National Human Genome Research Institute

We are grateful for generous past support from:

The Chan Zuckerburg Initiative

We would like to thank Zulip for supporting open-source by providing free hosting, and YourKit, LLC for generously providing free licenses for YourKit Java Profiler for open-source development.

Powering genomic analysis, at every scale

Install

Hail Query