Install Hail on a Spark Cluster¶
If you are using Google Dataproc, please see these simpler instructions.
Hail should work with any Spark 2.4.x cluster built with Scala 2.11.
Hail needs to be built from source on the leader node. Building Hail from source requires:
Java 8 JDK.
Python 3.6 or 3.7.
A recent C and a C++ compiler, GCC 5.0, LLVM 3.4, or later versions of either suffice.
BLAS and LAPACK.
On a Debian-like system, the following should suffice:
apt-get install \
openjdk-8-jdk-headless \
g++ \
python3 python3-pip \
libopenblas-dev liblapack-dev
The next block of commands downloads, builds, and installs Hail from source.
git clone https://github.com/hail-is/hail.git
cd hail/hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
On every worker node of the cluster, you must install a BLAS and LAPACK library such as the Intel MKL or OpenBLAS. On a Debian-like system you might try the following on every worker node.
apt-get install libopenblas liblapack3
Hail is now installed! You can use ipython
, python
, and jupyter
notebook
without any further configuration. We recommend against using the
pyspark
command.
Let’s take Hail for a spin! Create a file called “hail-script.py” and place the following analysis of a randomly generated dataset with five-hundred samples and half-a-million variants.
import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
n_samples=500,
n_variants=500_000,
n_partitions=32)
mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33))
gwas = hl.linear_regression_rows(y=mt.drinks_coffee,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.order_by(gwas.p_value).show(25)
Run the script and wait for the results. You should not have to wait more than a minute.
python3 hail-script.py
Slightly more configuration is necessary to spark-submit
a Hail script:
HAIL_HOME=$(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')
spark-submit \
--jars $HAIL_HOME/hail-all-spark.jar \
--conf spark.driver.extraClassPath=$HAIL_HOME/hail-all-spark.jar \
--conf spark.executor.extraClassPath=./hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
hail-script.py
Next Steps¶
Get the Hail cheatsheets
Follow the Hail GWAS Tutorial