Google Cloud Platform

If you’re new to Google Cloud in general, and would like an overview, linked here. is a document written to onboard new users within our lab to cloud computing.

hailctl dataproc

As of version 0.2.15, pip installations of Hail come bundled with a command-line tool, hailctl. This tool has a submodule called dataproc for working with Google Dataproc clusters configured for Hail.

This tool requires the Google Cloud SDK.

Until full documentation for the command-line interface is written, we encourage you to run the following command to see the list of modules:

hailctl dataproc

It is possible to print help for a specific command using the help flag:

hailctl dataproc start --help

To start a cluster, use:

hailctl dataproc start CLUSTER_NAME [optional args...]

To submit a Python job to that cluster, use:

hailctl dataproc submit CLUSTER_NAME SCRIPT [optional args to your python script...]

To connect to a Jupyter notebook running on that cluster, use:

hailctl dataproc connect CLUSTER_NAME notebook [optional args...]

To list active clusters, use:

hailctl dataproc list

Importantly, to shut down a cluster when done with it, use:

hailctl dataproc stop CLUSTER_NAME

Reading from Google Cloud Storage

A dataproc cluster created through hailctl dataproc will automatically be configured to allow hail to read files from Google Cloud Storage (GCS). To allow hail to read from GCS when running locally, you need to install the Cloud Storage Connector. The easiest way to do that is to run the following script from your command line:

curl -sSL | python3

After this is installed, you’ll be able to read from paths beginning with gs directly from you laptop.

Requester Pays

Some google cloud buckets are Requester Pays, meaning that accessing them will incur charges on the requester. Google breaks down the charges in the linked document, but the most important class of charges to be aware of are Network Charges. Specifically, the egress charges. You should always be careful reading data from a bucket in a different region then your own project, as it is easy to rack up a large bill. For this reason, you must specifically enable requester pays on your hailctl dataproc cluster if you’d like to use it.

To allow your cluster to read from any requester pays bucket, use:

hailctl dataproc start CLUSTER_NAME --requester-pays-allow-all

To make it easier to avoid accidentally reading from a requester pays bucket, we also have --requester-pays-allow-buckets. If you’d like to enable only reading from buckets named hail-bucket and big-data, you can specify the following:

hailctl dataproc start  my-cluster --requester-pays-allow-buckets hail-bucket,big-data

Users of the Annotation Database will find that many of the files are stored in requester pays buckets. In order to allow the dataproc cluster to read from them, you can either use --requester-pays-allow-all from above or use the special --requester-pays-allow-annotation-db to enable the specific list of buckets that the annotation database relies on.

Variant Effect Predictor (VEP)

The following cluster configuration enables Hail to run VEP in parallel on every variant in a dataset containing GRCh37 variants:

hailctl dataproc start NAME --vep GRCh37

Hail also supports VEP for GRCh38 variants, but you must start a cluster with the argument --vep GRCh38. A cluster started without the --vep argument is unable to run VEP and cannot be modified to run VEP. You must start a new cluster using --vep.