Powering biobank-scale genomics

An open-source library for scalable genomic data exploration

Features

Simplified Analysis

Hail is an open-source Python library that simplifies genomic data analysis. It provides powerful, easy-to-use data science tools that can be used to interrogate even biobank-scale genomic data (e.g UK Biobank, TopMed, FinnGen, and Biobank Japan).

Genomic Dataframes

Modern data science is driven by table-like data structures, often called dataframes (see Pandas). While convenient, they don't capture the structure of genetic data, which has row (variant) and column (genotype) groups. To remedy this, Hail introduces a distributed, dataframe-like structure called MatrixTable.

Input Unification

The Hail MatrixTable unifies a wide range of input formats (e.g. vcf, bgen, plink, tsv, gtf, bed files), and supports scalable queries, even on petabyte-size datasets. By leveraging MatrixTable, Hail provides an integrated, scalable analysis platform for science.

Acknowledgments

The Hail team has several sources of funding at the Broad Institute:

We are grateful for generous support from:

We would like to thank Zulip for supporting open-source by providing free hosting, and YourKit, LLC for generously providing free licenses for YourKit Java Profiler for open-source development.