The Hail Project

Cotton Seed and Jon Bloom co-founded the Hail project in the Neale lab in Fall 2015 to help the genetics community harness the flood of sequenced genomes in order to unravel the genetic architecture of disease. Our open-source framework is already being used to analyze the largest genetic data sets in existence, to power dozens of major academic studies, and to meet the exploding needs of hospitals, diagnostic labs, and industry.

The Hail Team is embedded inside one of the world’s leading biomedical and genomics research institutes, anchoring the global heart of biotech right across from MIT. We implement distributed algorithms on top of our custom-built language, compiler, and run-time system to support querying, aggregation, and linear algebra on hundreds of thousands of human genomes. We thrive on diverse challenges: language and compiler design, low-level performance optimization, architecture of distributed systems, scaling of established methods and invention of new ones, visualization, interoperability with other powerful tools, and close collaboration with the scientists around us.

Why Hail? Why now?

Like particle physics, astronomy, and tech before, biology has firmly entered the fourth paradigm of data-intensive science in which we measure everything and run computational experiments on the data. Genetic datasets for disease association studies now run in the tens of terabytes, doubling every eight months. RNA-sequencing datasets measuring gene expression at single-cell resolution are measured in gigabytes but doubling far faster in the quest for a Human Cell Atlas. Much of this data comes from the Broad Genomics Platform, the largest producer of human genomics information in the world.

With such staggering advances and investment in high-throughput perturbation and measurement, technical barriers to discovery are rapidly shifting from biological to computational. We believe there is a unique opportunity to transform the practice of computational biology by applying deep ideas from computer science and mathematics to build the next generation of modular, scalable tools for analyzing massive genetic and biological data. These tools will drive the development of new treatments and biotechnologies and fundamentally advance our understanding of life itself.

Models, Inference, and Algorithms Initiative

We run the Models, Inference & Algorithms Initiative to foster community and pedagogy in greater Boston at the interface of computational biology, mathematical theory, and computer science.