Batch Service

Warning

The Batch Service is currently only available to Broad Institute users. If you are interested in using the Batch Service, please send us an email at hail-team@broadinstitute.org.

Warning

Ensure you have installed the Google Cloud SDK as described in the Batch Service section of Getting Started.

What is the Batch Service?

Instead of executing jobs on your local computer (the default in Batch), you can execute your jobs on a multi-tenant compute cluster in Google Cloud that is managed by the Hail team and is called the Batch Service. The Batch Service consists of a scheduler that receives job submission requests from users and then executes jobs in Docker containers on Google Compute Engine VMs (workers) that are shared amongst all Batch users. A UI is available at https://batch.hail.is that allows a user to see job progress and access logs.

Sign Up

For Broad Institute users, you can click the Sign Up button at notebook.hail.is. This will allow you to authorize with your Broad Institute email address and create a Batch Service account. A Google Service Account is created on your behalf. A trial Batch billing project is also created for you at <USERNAME>-trial. You can view these at https://auth.hail.is/user.

File Localization

A job is executed in three separate Docker containers: input, main, output. The input container downloads files from Google Storage to the input container. These input files are either inputs to the batch or are output files that have been generated by a dependent job. The downloaded files are then passed on to the main container via a shared disk where the user’s code is executed. Finally, the output container runs and uploads any files from the shared disk that have been specified to be uploaded by the user. These files can either be specified with Batch.write_output() or are file dependencies for downstream jobs.

_images/file_localization.png

Service Accounts

A Google service account is automatically created for a new Batch user that is used by Batch to download data on your behalf. To get the name of the service account, click on your name on the header bar or go to https://auth.hail.is/user.

To give the service account read and write access to a Google Storage bucket, run the following command substituting SERVICE_ACCOUNT_NAME with the full service account name (ex: test@my-project.iam.gserviceaccount.com) and BUCKET_NAME with your bucket name. See this page for more information about access control.

gsutil iam ch serviceAccount:[SERVICE_ACCOUNT_NAME]:objectAdmin gs://[BUCKET_NAME]

The Google Container Repository (GCR) is a Docker repository hosted by Google that is an alternative to Docker Hub for storing images. It is recommended to use GCR for images that shouldn’t be publically available. If you have a GCR associated with your project, then you can enable the service account to view Docker images with the command below where SERVICE_ACCOUNT_NAME is your full service account name and PROJECT_ID is the name of your project you want to grant access to:

gsutil iam ch serviceAccount:[SERVICE_ACCOUNT_NAME]:objectViewer gs://artifacts.[PROJECT-ID].appspot.com

If you want to run gcloud or gsutil commands within your Batch jobs, the service account file is available at /gsa-key/key.json in the main container. You can authenticate using the service account by adding the following line to your user code and using a Docker image that has gcloud and gsutil installed.

gcloud -q auth activate-service-account --key-file=/gsa-key/key.json

Billing

The cost for executing a job depends on the underlying machine type and how much CPU and memory is being requested. Currently, Batch runs most jobs on 16 core, preemptible, n1 machines with 10 GB of persistent SSD boot disk and 375 GB of local SSD. The costs are as follows:

  • Compute cost

    = $0.01 per core per hour for standard worker types

    = $0.012453 per core per hour for highmem worker types

    = $0.0074578 per core per hour for highcpu worker types

  • Disk cost
    • Boot Disk

      Average number of days per month = 365.25 / 12 = 30.4375
      
      Cost per GB per month = $0.17
      
      Cost per core per hour = $0.17 * 10 / 30.4375 / 24 / 16
      
    • Local SSD

      Average number of days per month = 365.25 / 12 = 30.4375
      
      Cost per GB per month = $0.048
      
      Cost per core per hour = $0.048 * 375 / 30.4375 / 24 / 16
      

    = $0.001685 per core per hour

    • Storage

      Average number of days per month = 365.25 / 12 = 30.4375
      
      Cost per GB per month = $0.17
      
      Cost per GB per hour = $0.17 / 30.4375 / 24
      
  • IP network cost

    = $0.00025 per core per hour

  • Service cost

    = $0.01 per core per hour

The sum of these costs is $0.021935 per core/hour for standard workers, $0.024388 per core/hour for highmem workers, and $0.019393 per core/hour for highcpu workers. There is also an additional cost of $0.00023 per GB per hour of extra storage requested.

At any given moment as many as four cores of the cluster may come from a 4 core machine if the worker type is standard. If a job is scheduled on this machine, then the cost per core hour is $0.02774 plus $0.00023 per GB per hour storage of extra storage requested.

Note

If the memory is specified as either ‘lowmem’, ‘standard’, or ‘highmem’, then the corresponding worker types used are ‘highcpu’, ‘standard’, and ‘highmem’. Otherwise, we will choose the cheapest worker type for you based on the cpu and memory requests. In this case, it is possible a cheaper configuration will round up the cpu requested to the next power of two in order to obtain more memory on a cheaper worker type.

Note

The storage for the root file system (/) is 5 Gi per job for jobs with at least 1 core. If a job requests less than 1 core, then it receives that fraction of 5 Gi. If you need more storage than this, you can request more storage explicitly with the Job.storage() method. The minimum storage request is 10 GB which can be incremented in units of 1 GB maxing out at 64 TB. The additional storage is mounted at /io.

Note

If a worker is preempted by google in the middle of running a job, you will be billed for the time the job was running up until the preemption time. The job will be rescheduled on a different worker and run again. Therefore, if a job takes 5 minutes to run, but was preempted after running for 2 minutes and then runs successfully the next time it is scheduled, the total cost for that job will be 7 minutes.

Setup

We assume you’ve already installed Batch and the Google Cloud SDK as described in the Getting Started section and we have created a user account for you and given you a billing project.

To authenticate your computer with the Batch service, run the following command in a terminal window:

gcloud auth application-default login
hailctl auth login

Executing this command will take you to a login page in your browser window where you can select your google account to authenticate with. If everything works successfully, you should see a message “hailctl is now authenticated.” in your browser window and no error messages in the terminal window.

Submitting a Batch to the Service

To execute a batch on the Batch service rather than locally, first construct a ServiceBackend object with a billing project and bucket for storing intermediate files. Your service account must have read and write access to the bucket.

Warning

To avoid expensive egress charges, make sure your bucket is multi-regional in the United States because Batch runs your job in any US region.

Next, pass the ServiceBackend object to the Batch constructor with the parameter name backend.

An example of running “Hello World” on the Batch service rather than locally is shown below. You can open iPython or a Jupyter notebook and execute the following batch:

>>> import hailtop.batch as hb 
>>> backend = hb.ServiceBackend('my-billing-project', remote_tmpdir='gs://my-bucket/batch/tmp/') 
>>> b = hb.Batch(backend=backend, name='test') 
>>> j = b.new_job(name='hello') 
>>> j.command('echo "hello world"') 
>>> b.run(open=True) 

You may elide the billing_project and remote_tmpdir parameters if you have previously set them with hailctl:

hailctl config set batch/billing_project my-billing-project
hailctl config set batch/remote_tmpdir my-remote-tmpdir

Note

A trial billing project is automatically created for you with the name {USERNAME}-trial

Using the UI

If you have submitted the batch above successfully, then it should open a page in your browser with a UI page for the batch you submitted. This will show a list of all the jobs in the batch with the current state, exit code, duration, and cost. The possible job states are as follows:

  • Pending - A job is waiting for its dependencies to complete

  • Ready - All of a job’s dependencies have completed, but the job has not been scheduled to run

  • Running - A job has been scheduled to run on a worker

  • Success - A job finished with exit code 0

  • Failure - A job finished with exit code not equal to 0

  • Error - The Docker container had an error (ex: out of memory)

Clicking on a specific job will take you to a page with the logs for each of the three containers run per job (see above) as well as a copy of the job spec and detailed information about the job such as where the job was run, how long it took to pull the image for each container, and any error messages.

To see all batches you’ve submitted, go to https://batch.hail.is. Each batch will have a current state, number of jobs total, and the number of pending, succeeded, failed, and cancelled jobs as well as the running cost of the batch (computed from completed jobs only). The possible batch states are as follows:

  • open - Not all jobs in the batch have been successfully submitted.

  • running - All jobs in the batch have been successfully submitted.

  • success - All jobs in the batch have completed with state “Success”

  • failure - Any job has completed with state “Failure” or “Error”

  • cancelled - Any job has been cancelled and no jobs have completed with state “Failure” or “Error”

Note

Jobs can still be running even if the batch has been marked as failure or cancelled. In the case of ‘failure’, other jobs that do not depend on the failed job will still run. In the case of cancelled, it takes time to cancel a batch, especially for larger batches.

Individual jobs cannot be cancelled or deleted. Instead, you can cancel the entire batch with the “Cancel” button next to the row for that batch. You can also delete a batch with the “Delete” button.

Warning

Deleting a batch only removes it from the UI. You will still be billed for a deleted batch.

Important Notes

Warning

To avoid expensive egress charges, input and output files should be located in buckets that are multi-regional in the United States because Batch runs jobs in any US region.