Microsoft Azure

hailctl hdinsight

As of version 0.2.82, pip installations of Hail come bundled with a command-line tool, hailctl hdinsight for working with Microsoft Azure HDInsight Spark clusters configured for Hail.

This tool requires the Azure CLI.

An HDInsight cluster always consists of two “head” nodes, two or more “worker” nodes, and an Azure Blob Storage container. The head nodes are automatically configured to serve Jupyter Notebooks at https://CLUSTER_NAME.azurehdinsight.net/jupyter . The Jupyter server is protected by a username-password combination. The username and password are printed to the terminal after the cluster is created.

Every HDInsight cluster is associated with one storage account which your Jupyter notebooks may access. In addition, HDInsight will create a container within this storage account (sharing a name with the cluster) for its own purposes. When a cluster is stopped using hailctl hdinsight stop, this container will be deleted.

To start a cluster, you must specify the cluster name, a storage account, and a resource group. The storage account must be in the given resource group.

hailctl hdinsight start CLUSTER_NAME STORAGE_ACCOUNT RESOURCE_GROUP

To submit a Python job to that cluster, use:

hailctl hdinsight submit CLUSTER_NAME STORAGE_ACCOUNT HTTP_PASSWORD SCRIPT [optional args to your python script...]

To list running clusters:

hailctl hdinsight list

Importantly, to shut down a cluster when done with it, use:

hailctl hdinsight stop CLUSTER_NAME STORAGE_ACCOUNT RESOURCE_GROUP

Variant Effect Predictor (VEP)

The following cluster configuration enables Hail to run VEP in parallel on every variant in a dataset containing GRCh37 variants:

hailctl hdinsight start CLUSTER_NAME STORAGE_ACCOUNT RESOURCE_GROUP \
        --vep GRCh37 \
        --vep-loftee-uri https://STORAGE_ACCOUNT.blob.core.windows.net/CONTAINER/loftee-GRCh37 \
        --vep-homo-sapiens-uri https://STORAGE_ACCOUNT.blob.core.windows.net/CONTAINER/homo-sapiens-GRCh37

Those two URIs must point at directories containing the VEP data files. You can populate them by downloading the two tar files using gcloud storage cp, gs://hail-us-central1-vep/loftee-beta/GRCh37.tar and gs://hail-us-central1-vep/homo-sapiens/85_GRCh37.tar, extracting them into a local folder, and uploading that folder to your storage account using az storage copy. The hail-us-central1-vep Google Cloud Storage bucket is a requester pays bucket which means you must pay the cost of transferring them out of Google Cloud. We do not provide these files in Azure because Azure Blob Storage lacks an equivalent cost control mechanism.

Hail also supports VEP for GRCh38 variants. The required tar files are located at gs://hail-REGION-vep/loftee-beta/GRCh38.tar and gs://hail-REGION-vep/homo-sapiens/95_GRCh38.tar.

A cluster started without the --vep argument is unable to run VEP and cannot be modified to run VEP. You must start a new cluster using --vep.