Hail Query-on-Batch


Hail Query-on-Batch (the Batch backend) is currently in beta. This means some functionality is not yet working. Please contact us if you would like to use missing functionality on Query-on-Batch!

Hail Query-on-Batch uses Hail Batch instead of Apache Spark to execute jobs. Instead of a Dataproc cluster, you will need a Hail Batch cluster. For more information on using Hail Batch, see the Hail Batch docs. For more information on deploying a Hail Batch cluster, please contact the Hail Team at our discussion forum.

Getting Started

  1. Install Hail version 0.2.93 or later:

pip install 'hail>=0.2.93'
  1. Sign up for a Hail Batch account (currently only available to Broad affiliates).

  2. Authenticate with Hail Batch.

hailctl auth login
  1. Specify a bucket for Hail to use for temporary intermediate files. In Google Cloud, we recommend using a bucket with automatic deletion after a set period of time.

hailctl config set batch/remote_tmpdir gs://my-auto-delete-bucket/hail-query-temporaries
  1. Specify a Hail Batch billing project (these are different from Google Cloud projects). Every new user has a trial billing project loaded with 10 USD. The name is available on the Hail User account page.

hailctl config set batch/billing_project my-billing-project
  1. Set the default Hail Query backend to batch:

hailctl config set query/backend batch
  1. Now you are ready to try Hail! If you want to switch back to Query-on-Spark, run the previous command again with “spark” in place of “batch”.

Variant Effect Predictor (VEP)

More information coming very soon. If you want to use VEP with Hail Query-on-Batch, please contact the Hail Team at our discussion forum.