Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from sequencing or microarray data in VCF and other formats, Hail can, for example:

All this functionality is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on an on-prem cluster or in the cloud.

Hail is used in published research and as the core analysis platform of large-scale genomics efforts such as gnomAD. The project began in Fall 2015 to harness the flood of genomic data and is under very active development as we work toward a stable release, so we do not guarantee forward compatibility of formats and interfaces.

Want to get involved in development? Check out the Github repo, chat with us in the Gitter dev room, view our keynote at Spark Summit East 2017, or connect with us June 6-7 at Spark Summit West 2017.

Or come join us full-time at the Broad Institute of MIT and Harvard! We are founding a new Initiative in Scalable Analytics and recruiting software engineers at multiple levels of experience. Details here.

Getting Started

To get started using Hail on your data or public data:

We encourage use of the Discussion Forum for user and dev support, feature requests, and sharing your Hail-powered science. Follow Hail on Twitter @hailgenetics. Please report any suspected bugs to github issues.

Hail Team

The Hail team is based in the Neale lab at the Stanley Center for Psychiatric Research of the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital.

Contact the Hail team at hail@broadinstitute.org.

Citing Hail

If you use Hail for published work, please cite the software:

and either the forthcoming manuscript describing Hail (if possible):

or the following paper which includes a brief introduction to Hail in the online methods:

And we'd love to hear about your work in the Science category of the discussion forum!