===================== Google Cloud Platform ===================== If you're new to Google Cloud in general, and would like an overview, linked `here `__. is a document written to onboard new users within our lab to cloud computing. ``hailctl dataproc`` -------------------- As of version 0.2.15, pip installations of Hail come bundled with a command-line tool, ``hailctl``. This tool has a submodule called ``dataproc`` for working with `Google Dataproc `__ clusters configured for Hail. This tool requires the `Google Cloud SDK `__. Until full documentation for the command-line interface is written, we encourage you to run the following command to see the list of modules: .. code-block:: text hailctl dataproc It is possible to print help for a specific command using the ``help`` flag: .. code-block:: text hailctl dataproc start --help To start a cluster, use: .. code-block:: text hailctl dataproc start CLUSTER_NAME [optional args...] To submit a Python job to that cluster, use: .. code-block:: text hailctl dataproc submit CLUSTER_NAME SCRIPT [optional args to your python script...] To connect to a Jupyter notebook running on that cluster, use: .. code-block:: text hailctl dataproc connect CLUSTER_NAME notebook [optional args...] To list active clusters, use: .. code-block:: text hailctl dataproc list Importantly, to shut down a cluster when done with it, use: .. code-block:: text hailctl dataproc stop CLUSTER_NAME Reading from Google Cloud Storage --------------------------------- A dataproc cluster created through ``hailctl dataproc`` will automatically be configured to allow hail to read files from Google Cloud Storage (GCS). To allow hail to read from GCS when running locally, you need to install the `Cloud Storage Connector `_. The easiest way to do that is to run the following script from your command line: .. code-block:: text curl -sSL https://broad.io/install-gcs-connector | python3 After this is installed, you'll be able to read from paths beginning with ``gs`` directly from you laptop. .. _GCP Requester Pays: Requester Pays -------------- Some google cloud buckets are `Requester Pays `_, meaning that accessing them will incur charges on the requester. Google breaks down the charges in the linked document, but the most important class of charges to be aware of are `Network Charges `_. Specifically, the egress charges. You should always be careful reading data from a bucket in a different region then your own project, as it is easy to rack up a large bill. For this reason, you must specifically enable requester pays on your ``hailctl dataproc`` cluster if you'd like to use it. To allow your cluster to read from any requester pays bucket, use: .. code-block:: text hailctl dataproc start CLUSTER_NAME --requester-pays-allow-all To make it easier to avoid accidentally reading from a requester pays bucket, we also have ``--requester-pays-allow-buckets``. If you'd like to enable only reading from buckets named ``hail-bucket`` and ``big-data``, you can specify the following: .. code-block:: text hailctl dataproc start my-cluster --requester-pays-allow-buckets hail-bucket,big-data Users of the :ref:`Annotation Database` will find that many of the files are stored in requester pays buckets. In order to allow the dataproc cluster to read from them, you can either use ``--requester-pays-allow-all`` from above or use the special ``--requester-pays-allow-annotation-db`` to enable the specific list of buckets that the annotation database relies on. .. _vep_dataproc: Variant Effect Predictor (VEP) ------------------------------ The following cluster configuration enables Hail to run VEP in parallel on every variant in a dataset containing GRCh37 variants: .. code-block:: text hailctl dataproc start NAME --vep GRCh37 Hail also supports VEP for GRCh38 variants, but you must start a cluster with the argument ``--vep GRCh38``. A cluster started without the ``--vep`` argument is unable to run VEP and cannot be modified to run VEP. You must start a new cluster using ``--vep``.