- Java 8 JDK.
- Spark 2.2.0. Hail should work with other versions of Spark 2, see below.
- Anaconda for Python 3.
Running Hail locally with a pre-compiled distribution¶
Hail uploads distributions to Google Storage as part of our continuous integration suite. You can download a pre-built distribution from the below links. Make sure you download the distribution that matches your Spark version!
Unzip the distribution after you download it. Next, edit and copy the below bash
commands to set up the Hail environment variables. You may want to add the
export lines to the appropriate dot-file (we recommend
that you don’t need to rerun these commands in each new session.
Un-tar the Spark distribution.
tar xvf <path to spark.tgz>
Here, fill in the path to the un-tarred Spark package.
export SPARK_HOME=<path to spark>
Unzip the Hail distribution.
unzip <path to hail.zip>
Here, fill in the path to the unzipped Hail distribution.
export HAIL_HOME=<path to hail> export PATH=$PATH:$HAIL_HOME/bin/
To install Python dependencies, create a conda environment for Hail:
conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml source activate hail
Once you’ve set up Hail, we recommend that you run the Python tutorials to get an overview of Hail functionality and learn about the powerful query language. To try Hail out, run the below commands to start a Jupyter Notebook server in the tutorials directory.
cd $HAIL_HOME/tutorials jhail
You can now click on the “01-genome-wide-association-study” notebook to get started!
In the future, if you want to run:
- Hail in Python use hail
- Hail in IPython use ihail
- Hail in a Jupyter Notebook use jhail
Hail will not import correctly from a normal Python interpreter, a normal IPython interpreter, nor a normal Jupyter Notebook.
Running on a Spark cluster¶
Hail can run on any cluster that has Spark 2 installed. The Hail team publishes ready-to-use JARs for Google Cloud Dataproc, see Running in the cloud. For Cloudera specific instructions see Running on a Cloudera Cluster.
For all other Spark clusters, you will need to build Hail from the source code.
Hail should be built on the master node of the Spark cluster with the following
2.2.0 with the version of Spark available on your
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip
An IPython shell which can run Hail backed by the cluster can be started with
the following command, it is important that the Spark located at
has the exact same version as provided to the previous command:
SPARK_HOME=/path/to/spark/ \ HAIL_HOME=/path/to/hail/ \ PYTHONPATH="$PYTHONPATH:$HAIL_HOME/build/distributions/hail-python.zip:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-*-src.zip" \ ipython
Within the interactive shell, check that you can initialize Hail by running the
following commands. Note that you must pass in the existing
sc to the
>>> import hail as hl >>> hl.init(sc)
Files can be accessed from both Hadoop and Google Storage. If you’re running on Google’s Dataproc, you’ll want to store your files in Google Storage. In most on premises clusters, you’ll want to store your files in Hadoop.
To convert sample.vcf stored in Google Storage into Hail’s .vds format, run:
To convert sample.vcf stored in Hadoop into Hail’s .vds format, run:
It is also possible to run Hail non-interactively, by passing a Python script to
spark-submit. In this case, it is not necessary to set any environment
spark-submit --jars build/libs/hail-all-spark.jar \ --py-files build/distributions/hail-python.zip \ hailscript.py
runs the script hailscript.py (which reads and writes files from Hadoop):
import hail as hl hl.import_vcf('/path/to/sample.vcf').write('/output/path/sample.vds')
Running on a Cloudera Cluster¶
These instructions explain how to install Spark 2 on a Cloudera cluster. You should work on a gateway node on the cluster that has the Hadoop and Spark packages installed on it.
Once Spark is installed, building and running Hail on a Cloudera cluster is exactly the same as above, except:
On a Cloudera cluster, when building a Hail JAR, you must specify a Cloudera version of Spark. The following example builds a Hail JAR for Cloudera’s 2.2.0 version of Spark:./gradlew shadowJar -Dspark.version=2.2.0.cloudera
On a Cloudera cluster,
SPARK_HOMEshould be set as:
On Cloudera, you can create an interactive Python shell using
pyspark:pyspark --jars build/libs/hail-all-spark.jar \ --py-files build/distributions/hail-python.zip \ --conf spark.sql.files.openCostInBytes=1099511627776 \ --conf spark.sql.files.maxPartitionBytes=1099511627776 \ --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \ --conf spark.hadoop.parquet.block.size=1099511627776
Running in the cloud¶
Hail publishes pre-built JARs for Google Cloud Platform’s Dataproc Spark clusters. We recommend running Hail on GCP via an interactive Jupyter notebook, as described in Liam’s forum post. If you prefer to submit your own JARs or python files rather than use a Jupyter notebook, see Laurent’s forum post.
Building with other versions of Spark 2¶
Hail should work with other versions of Spark 2. To build against a different version, such as Spark 2.3.0, modify the above instructions as follows:
Set the Spark version in the gradle command./gradlew -Dspark.version=2.3.0 shadowJar
SPARK_HOMEshould point to an installation of the desired version of Spark, such as spark-2.3.0-bin-hadoop2.7
The version of the Py4J ZIP file in the hail alias must match the version in
$SPARK_HOME/python/libin your version of Spark.
BLAS and LAPACK¶
Hail uses BLAS and LAPACK optimized linear algebra libraries. These should load automatically on recent versions of Mac OS X and Google Dataproc. On Linux, these must be explicitly installed; on Ubuntu 14.04, run
apt-get install libatlas-base-dev
If natives are not found,
hail.log will contain the warnings
Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
See netlib-java for more information.