Use Hail on Google Dataproc

Requirements

Before you can run Hail on Dataproc, you must have both Hail and the Google Cloud CLI installed on your machine.

Installing Hail

First, install Hail on your Mac OS X or Linux laptop or desktop. The Hail pip package includes a tool called hailctl dataproc which starts, stops, and manipulates Hail-enabled Dataproc clusters.

Installing and configuring the Google Cloud SDK

We recommend that you follow the Google Cloud SDK documentation to install the Google Cloud SDK.

You will need to configure your Google Cloud SDK after installation. This is the time to set up your Google Cloud project and billing, if you don’t already have one.

Running Hail on Dataproc requires passing in a Dataproc region.

If you’d like to set your Dataproc region globally, you can do so by running:

gcloud config set dataproc/region <your-region>

Otherwise, you can set your Dataproc region using the hailctl –region command line flag.

Starting your first Dataproc cluster

Start a dataproc cluster named “my-first-cluster”. Cluster names may only contain a mix lowercase letters and dashes. Starting a cluster can take as long as two minutes.

hailctl dataproc start my-first-cluster

Create a file called “hail-script.py” and place the following analysis of a randomly generated dataset with five-hundred samples and half-a-million variants.

import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=500,
                              n_variants=500_000,
                              n_partitions=32)
mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33))
gwas = hl.linear_regression_rows(y=mt.drinks_coffee,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0])
gwas.order_by(gwas.p_value).show(25)

Submit the analysis to the cluster and wait for the results. You should not have to wait more than a minute.

hailctl dataproc submit my-first-cluster hail-script.py

When the script is done running you’ll see 25 rows of variant association results.

You can also start a Jupyter Notebook running on the cluster:

hailctl dataproc connect my-first-cluster notebook

When you are finished with the cluster stop it:

hailctl dataproc stop my-first-cluster

Next Steps