Installing Hail


Regardless of installation method, you will need:

  • Java 8 JDK Note: it must be version eight. Hail does not support Java versions nine, ten, or eleven due to our dependency on Spark.
  • Python 3.6 or later, we recommend Anaconda’s Python 3

For all methods other than using pip, you will additionally need Spark 2.2.x.


Installing Hail on Mac OS X or GNU/Linux with pip

If you have Mac OS X, this is the recommended installation method for running hail locally (i.e. not on a cluster).

Create a conda enviroment named hail and install the Hail python library in that environment:

conda create --name hail python>=3.6
conda activate hail
pip install hail

Building your own Jar

To use Hail with other Hail versions of Spark 2, you’ll need to build your own JAR instead of using a pre-compiled distribution. To build against a different version, such as Spark 2.3.0, run the following command inside the directory where Hail is located:

./gradlew -Dspark.version=2.3.0 shadowJar

The Spark version in this command should match whichever version of Spark you would like to build against.

The SPARK_HOME environment variable should point to an installation of the desired version of Spark, such as spark-2.3.0-bin-hadoop2.7

The version of the Py4J ZIP file in the hail alias must match the version in $SPARK_HOME/python/lib in your version of Spark.

Running on a Spark cluster

Hail can run on any Spark 2.2 cluster. For example, Google and Amazon offer optimized Spark performance and exceptional scalability to thousands of cores without the overhead of installing and managing an on-premesis cluster.

On Google Cloud Dataproc, we provide pre-built JARs and a Python package cloudtools to simplify running Hail, whether through an interactive Jupyter notebook or by submitting Python scripts.

For Cloudera-specific instructions, see Running on a Cloudera cluster.

For all other Spark clusters, you will need to build Hail from the source code.

Hail should be built on the master node of the Spark cluster with the following command, replacing 2.2.0 with the version of Spark available on your cluster:

./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

Python and IPython need a few environment variables to correctly find Spark and the Hail jar. We recommend you set these environment variables in the relevant profile file for your shell (e.g. ~/.bash_profile).

export SPARK_HOME=/path/to/spark-2.2.0/
export HAIL_HOME=/path/to/hail/
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$HAIL_HOME/build/distributions/"
export PYTHONPATH="$PYTHONPATH:$SPARK_HOME/python/lib/py4j-*"
## PYSPARK_SUBMIT_ARGS is used by ipython and jupyter
  --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=\"$HAIL_HOME/build/libs/hail-all-spark.jar\" \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator

If the previous environment variables are set correctly, an IPython shell which can run Hail backed by the cluster can be started with the following command:


When using ipython, you can import hail and start interacting directly

>>> import hail as hl
>>> mt = hl.balding_nichols_model(3, 100, 100)
>>> mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))

You can also interact with hail via a pyspark session, but you will need to pass the configuration from PYSPARK_SUBMIT_ARGS directly as well as adding extra configuration parameters specific to running Hail through pyspark:

pyspark \
  --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=$HAIL_HOME/build/libs/hail-all-spark.jar \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator

Moreover, unlike in ipython, pyspark provides a Spark Context via the global variable sc. For Hail to interact properly with the Spark cluster, you must tell hail about this special Spark Context

>>> import hail as hl
>>> hl.init(sc) 

After this initialization step, you can interact as you would in ipython

>>> mt = hl.balding_nichols_model(3, 100, 100)
>>> mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))

It is also possible to run Hail non-interactively, by passing a Python script to spark-submit. Again, you will need to explicitly pass several configuration parameters to spark-submit

spark-submit \
  --jars "$HAIL_HOME/build/libs/hail-all-spark.jar" \
  --py-files "$HAIL_HOME/build/distributions/" \
  --conf spark.driver.extraClassPath="$HAIL_HOME/build/libs/hail-all-spark.jar" \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \

Running on a Cloudera cluster

These instructions explain how to install Spark 2 on a Cloudera cluster. You should work on a gateway node on the cluster that has the Hadoop and Spark packages installed on it.

Once Spark is installed, building and running Hail on a Cloudera cluster is exactly the same as above, except:

  • On a Cloudera cluster, when building a Hail JAR, you must specify a Cloudera version of Spark. The following example builds a Hail JAR for Cloudera’s 2.2.0 version of Spark:

    ./gradlew shadowJar -Dspark.version=2.2.0.cloudera
  • On a Cloudera cluster, SPARK_HOME should be set as: SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2,

  • On Cloudera, you can create an interactive Python shell using pyspark:

    pyspark --jars build/libs/hail-all-spark.jar \
            --py-files build/distributions/ \
            --conf spark.driver.extraClassPath="build/libs/hail-all-spark.jar" \
            --conf spark.executor.extraClassPath=./hail-all-spark.jar \
            --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
            --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \

Common Installation Issues


Hail uses BLAS and LAPACK optimized linear algebra libraries. These should load automatically on recent versions of Mac OS X and Google Dataproc. On Linux, these must be explicitly installed; on Ubuntu 14.04, run

apt-get install libatlas-base-dev

If natives are not found, hail.log will contain the warnings

Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS

See netlib-java for more information.