Install Hail on a Spark Cluster

If you are using Google Dataproc, please see these simpler instructions.

Hail should work with any Spark 2.4.x cluster built with Scala 2.11.

Hail needs to be built from source on the leader node. Building Hail from source requires:

  • Java 8 JDK.
  • Python 3.6 or 3.7.
  • A recent C and a C++ compiler, GCC 5.0, LLVM 3.4, or later versions of either suffice.
  • BLAS and LAPACK.

On a Debian-like system, the following should suffice:

apt-get install \
    openjdk-8-jdk-headless \
    g++ \
    python3 python3-pip \
    libopenblas-dev liblapack-dev

The next block of commands downloads, builds, and installs Hail from source.

git clone https://github.com/hail-is/hail.git
cd hail/hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5

On every worker node of the cluster, you must install a BLAS and LAPACK library such as the Intel MKL or OpenBLAS. On a Debian-like system you might try the following on every worker node.

apt-get install libopenblas liblapack3

Hail is now installed! You can use ipython, python, and jupyter notebook without any further configuration. We recommend against using the pyspark command.

Let’s take Hail for a spin! Create a file called “hail-script.py” and place the following analysis of a randomly generated dataset with five-hundred samples and half-a-million variants.

import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=500,
                              n_variants=500_000,
                              n_partitions=32)
mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33))
gwas = hl.linear_regression_rows(y=mt.drinks_coffee,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0])
gwas.order_by(gwas.p_value).show(25)

Run the script and wait for the results. You should not have to wait more than a minute.

python3 hail-script.py

Slightly more configuration is necessary to spark-submit a Hail script:

HAIL_HOME=$(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')
spark-submit \
  --jars $HAIL_HOME/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=$HAIL_HOME/hail-all-spark.jar \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  hail-script.py

Next Steps