Python API

This is the API documentation for Hail, and provides detailed information on the Python programming interface.

Use import hail as hl to access this functionality.

Classes

hail.Table

Hail’s distributed implementation of a dataframe or SQL table.

hail.GroupedTable

Table grouped by row that can be aggregated into a new table.

hail.MatrixTable

Hail’s distributed implementation of a structured matrix.

hail.GroupedMatrixTable

Matrix table grouped by row or column that can be aggregated into a new matrix table.

Top-Level Functions

hail.init(sc=None, app_name=None, master=None, local='local[*]', log=None, quiet=False, append=False, min_block_size=0, branching_factor=50, tmp_dir=None, default_reference='GRCh37', idempotent=False, global_seed=6348563392232659379, spark_conf=None, skip_logging_configuration=False, local_tmpdir=None, _optimizer_iterations=None, *, backend=None, driver_cores=None, driver_memory=None, worker_cores=None, worker_memory=None)[source]

Initialize and configure Hail.

This function will be called with default arguments if any Hail functionality is used. If you need custom configuration, you must explicitly call this function before using Hail. For example, to set the default reference genome to GRCh38, import Hail and immediately call init():

>>> import hail as hl
>>> hl.init(default_reference='GRCh38')  

Hail has two backends, spark and batch. Hail selects a backend by consulting, in order, these configuration locations:

  1. The backend parameter of this function.

  2. The HAIL_QUERY_BACKEND environment variable.

  3. The value of hailctl config get query/backend.

If no configuration is found, Hail will select the Spark backend.

Examples

Configure Hail to use the Batch backend:

>>> import hail as hl
>>> hl.init(backend='batch')  

If a pyspark.SparkContext is already running, then Hail must be initialized with it as an argument:

>>> hl.init(sc=sc)  

See also

stop()

Parameters
  • sc (pyspark.SparkContext, optional) – Spark Backend only. Spark context. If not specified, the Spark backend will create a new Spark context.

  • app_name (str) – A name for this pipeline. In the Spark backend, this becomes the Spark application name. In the Batch backend, this is a prefix for the name of every Batch.

  • master (str, optional) – Spark Backend only. URL identifying the Spark leader (master) node or local[N] for local clusters.

  • local (str) – Spark Backend only. Local-mode core limit indicator. Must either be local[N] where N is a positive integer or local[*]. The latter indicates Spark should use all cores available. local[*] does not respect most containerization CPU limits. This option is only used if master is unset and spark.master is not set in the Spark configuration.

  • log (str) – Local path for Hail log file. Does not currently support distributed file systems like Google Storage, S3, or HDFS.

  • quiet (bool) – Print fewer log messages.

  • append (bool) – Append to the end of the log file.

  • min_block_size (int) – Minimum file block size in MB.

  • branching_factor (int) – Branching factor for tree aggregation.

  • tmp_dir (str, optional) – Networked temporary directory. Must be a network-visible file path. Defaults to /tmp in the default scheme.

  • default_reference (str) – Default reference genome. Either 'GRCh37', 'GRCh38', 'GRCm38', or 'CanFam3'.

  • idempotent (bool) – If True, calling this function is a no-op if Hail has already been initialized.

  • global_seed (int, optional) – Global random seed.

  • spark_conf (dict of str to :class`str`, optional) – Spark backend only. Spark configuration parameters.

  • skip_logging_configuration (bool) – Spark Backend only. Skip logging configuration in java and python.

  • local_tmpdir (str, optional) – Local temporary directory. Used on driver and executor nodes. Must use the file scheme. Defaults to TMPDIR, or /tmp.

  • driver_cores (str or int, optional) – Batch backend only. Number of cores to use for the driver process. May be 1 or 8. Default is 1.

  • driver_memory (str, optional) – Batch backend only. Memory tier to use for the driver process. May be standard or highmem. Default is standard.

  • worker_cores (str or int, optional) – Batch backend only. Number of cores to use for the worker processes. May be 1 or 8. Default is 1.

  • worker_memory (str, optional) – Batch backend only. Memory tier to use for the worker processes. May be standard or highmem. Default is standard.

hail.asc(col)[source]

Sort by col ascending.

hail.desc(col)[source]

Sort by col descending.

hail.stop()[source]

Stop the currently running Hail session.

hail.spark_context()[source]

Returns the active Spark context.

Returns

pyspark.SparkContext

hail.tmp_dir()[source]

Returns the Hail shared temporary directory.

Returns

str

hail.default_reference()[source]

Returns the default reference genome 'GRCh37'.

Returns

ReferenceGenome

hail.get_reference(name)[source]

Returns the reference genome corresponding to name.

Notes

Hail’s built-in references are 'GRCh37', GRCh38', 'GRCm38', and 'CanFam3'. The contig names and lengths come from the GATK resource bundle: human_g1k_v37.dict and Homo_sapiens_assembly38.dict.

If name='default', the value of default_reference() is returned.

Parameters

name (str) – Name of a previously loaded reference genome or one of Hail’s built-in references: 'GRCh37', 'GRCh38', 'GRCm38', 'CanFam3', and 'default'.

Returns

ReferenceGenome

hail.set_global_seed(seed)[source]

Sets Hail’s global seed to seed.

Parameters

seed (int) – Integer used to seed Hail’s random number generator

hail.citation(*, bibtex=False)[source]

Generate a Hail citation.

Parameters

bibtex (bool) – Generate a citation in BibTeX form.

Returns

str

hail.version()[source]

Get the installed Hail version.

Returns

str