The Java 8 JDK.
Spark 2.0.2. Hail should work with other versions of Spark 2, see below.
Python 2.7 and IPython. We recommend the free Anaconda distribution.
CMake and a C++ compiler that supports
-std=c++11(we recommend at least GCC 4.7 or Clang 3.3).
On a Debian-based Linux OS like Ubuntu, run:
$ sudo apt-get install g++ cmake
$ brew install cmake
$ git clone https://github.com/broadinstitute/hail.git $ cd hail
You can also download the source code directly from Github.
You may also want to install Seaborn, a Python library for statistical data visualization, using
conda install seabornor
pip install seaborn. While not technically necessary, Seaborn is used in the tutorial to make prettier plots.
To install all dependencies for running locally on a fresh Ubuntu installation, use this script.
The following commands are relative to the
Building and running Hail¶
Hail may be built to run locally or on a Spark cluster. Running locally is useful for getting started, analyzing or experimenting with small datasets, and Hail development.
The single command
$ ./gradlew shadowJar
creates a Hail JAR file at
build/libs/hail-all-spark.jar. The initial build takes time as Gradle installs all Hail dependencies.
Add the following environmental variables by filling in the paths to SPARK_HOME and HAIL_HOME below and exporting all four of them (consider adding them to your .bashrc):
$ export SPARK_HOME=/path/to/spark $ export HAIL_HOME=/path/to/hail $ export PYTHONPATH="$PYTHONPATH:$HAIL_HOME/python:$SPARK_HOME/python:`echo $SPARK_HOME/python/lib/py4j*-src.zip`" $ export SPARK_CLASSPATH=$HAIL_HOME/build/libs/hail-all-spark.jar
ipython on the command line will open an interactive Python shell.
Here are a few simple things to try in order. To import the
hail module and start a
>>> from hail import * >>> hc = HailContext()
import the included sample.vcf into Hail’s .vds format, run:
>>> vds = (hc.read('sample.vds') ... .split_multi() ... .sample_qc() ... .variant_qc()) >>> vds.export_variants('variantqc.tsv', 'Variant = v, va.qc.*') >>> vds.write('sample.qc.vds')
count the number of samples, variants, and genotypes, run:
Now let’s get a feel for Hail’s powerful objects, annotation system, and expression language. To print the current annotation schema and use these annotations to filter variants, samples, and genotypes, run:
>>> print('sample annotation schema:') >>> print(vds.sample_schema) >>> print('\nvariant annotation schema:') >>> print(vds.variant_schema) >>> (vds.filter_variants_expr('v.altAllele().isSNP() && va.qc.gqMean >= 20') ... .filter_samples_expr('sa.qc.callRate >= 0.97 && sa.qc.dpMean >= 15') ... .filter_genotypes('let ab = g.ad / g.ad.sum() in ' ... '((g.isHomRef() && ab <= 0.1) || ' ... ' (g.isHet() && ab >= 0.25 && ab <= 0.75) || ' ... ' (g.isHomVar() && ab >= 0.9))') ... .write('sample.filtered.vds'))
Note that during each run Hail writes a
hail.log file in the current directory; this is useful to developers for debugging.
Building with other versions of Spark 2¶
Hail should work with other versions of Spark 2. To build against a different version, such as Spark 2.1.0, modify the above instructions as follows:
- Set the Spark version in the gradle command$ ./gradlew -Dspark.version=2.1.0 shadowJar
SPARK_HOMEshould point to an installation of the desired version of Spark, such as spark-2.1.0-bin-hadoop2.7
- The version of the Py4J ZIP file in the hail alias must match the version in
$SPARK_HOME/python/libin your version of Spark.
Running on a Spark cluster and in the cloud¶
build/libs/hail-all-spark.jar can be submitted using
spark-submit. See the Spark documentation for details.
Google and Amazon offer optimized Spark performance and exceptional scalability to tens of thousands of cores without the overhead of installing and managing an on-prem cluster. To get started running Hail on the Google Cloud Platform, see this forum post.
BLAS and LAPACK¶
Hail uses BLAS and LAPACK optimized linear algebra libraries. These should load automatically on recent versions of Mac OS X and Google Dataproc. On Linux, these must be explicitly installed; on Ubuntu 14.04, run
$ apt-get install libatlas-base-dev
If natives are not found,
hail.log will contain the warnings
Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
See netlib-java for more information.
Running the tests¶
Several Hail tests have additional dependencies:
Other recent versions of QCTOOL and R should suffice, but PLINK 1.7 will not.
To execute all Hail tests, run
$ ./gradlew -Dspark.home=$SPARK_HOME test