Use Hail on Google Dataproc
Requirements
Before you can run Hail on Dataproc, you must have both Hail and the Google Cloud CLI installed on your machine.
Installing Hail
First, install Hail on your Mac OS X or Linux laptop or
desktop. The Hail pip package includes a tool called hailctl dataproc
which starts, stops, and
manipulates Hail-enabled Dataproc clusters.
Installing and configuring the Google Cloud SDK
We recommend that you follow the Google Cloud SDK documentation to install the Google Cloud SDK.
You will need to configure your Google Cloud SDK after installation. This is the time to set up your Google Cloud project and billing, if you don’t already have one.
Running Hail on Dataproc requires passing in a Dataproc region.
If you’d like to set your Dataproc region globally, you can do so by running:
gcloud config set dataproc/region <your-region>
Otherwise, you can set your Dataproc region using the hailctl –region command line flag.
Starting your first Dataproc cluster
Start a dataproc cluster named “my-first-cluster”. Cluster names may only contain a mix lowercase letters and dashes. Starting a cluster can take as long as two minutes.
hailctl dataproc start my-first-cluster
Create a file called “hail-script.py” and place the following analysis of a randomly generated dataset with five-hundred samples and half-a-million variants.
import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
n_samples=500,
n_variants=500_000,
n_partitions=32)
mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33))
gwas = hl.linear_regression_rows(y=mt.drinks_coffee,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.order_by(gwas.p_value).show(25)
Submit the analysis to the cluster and wait for the results. You should not have to wait more than a minute.
hailctl dataproc submit my-first-cluster hail-script.py
When the script is done running you’ll see 25 rows of variant association results.
You can also start a Jupyter Notebook running on the cluster:
hailctl dataproc connect my-first-cluster notebook
When you are finished with the cluster stop it:
hailctl dataproc stop my-first-cluster
Next Steps
Read more about Hail on Google Cloud
Get the Hail cheatsheets
Follow the Hail GWAS Tutorial