Hail on the Cloud¶
Public clouds are a natural place to run Hail, offering the ability to run on-demand workloads with high elasticity. For example, Google and Amazon make it possible to rent Spark clusters with many thousands of cores on-demand, providing for the elastic compute requirements of scientific research without an up-front capital investment in hardware.
Google Cloud Platform¶
As of version 0.2.15, pip installations of Hail come bundled with a command-line
hailctl. This tool has a submodule called
dataproc, the successor
to Liam Abbott’s cloudtools, for
working with Google Dataproc clusters
configured for Hail.
This tool requires the Google Cloud SDK.
Until full documentation for the command-line interface is written, we encourage you to run the following command to see the list of modules:
It is possible to print help for a specific command using the
hailctl dataproc start --help
To start a cluster, use:
hailctl dataproc start CLUSTER_NAME [optional args...]
To submit a Python job to that cluster, use:
hailctl dataproc submit CLUSTER_NAME SCRIPT [optional args to your python script...]
To connect to a Jupyter notebook running on that cluster, use:
hailctl dataproc connect CLUSTER_NAME notebook [optional args...]
To list active clusters, use:
hailctl dataproc list
Importantly, to shut down a cluster when done with it, use:
hailctl dataproc stop CLUSTER_NAME
Amazon Web Services¶
Other Cloud Providers¶
There are no known open-source resources for working with Hail on cloud providers other than Google and AWS. If you know of one, please submit a pull request to add it here!
If you have scripts for working with Hail on other cloud providers, we may be
interested in including those scripts in
hailctl (see above) as new
modules. Stop by the dev forum to chat!