Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format, Hail can, for example:

This functionality and more is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster, without the need to manually chop up data or manage job failures. Users can script pipelines or explore data interactively through Jupyter notebooks that flow between Hail with methods for genomics, PySpark with scalable SQL and machine learning algorithms, and pandas with scikit-learn and Matplotlib for results that fit on one machine. Hail also provides a flexible domain language to express complex quality control and analysis pipelines with concise, readable code.

The Hail project began in Fall 2015 to empower the worldwide genetics community to harness the flood of genomes to discover the biology of human disease. Hail has been used for dozens of major studies and is the core analysis platform of large-scale genomics efforts such as gnomAD.

Hail talk at Spark Summit West 2017

Getting Started

To get started using Hail on your data or public data:

Hail Team

The Hail team is embedded in the Neale lab at the Stanley Center for Psychiatric Research of the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital.

Citing Hail

