Python API

This is the API documentation for Hail, and provides detailed information on the Python programming interface.

Use import hail as hl to access this functionality.


hail.Table Hail’s distributed implementation of a dataframe or SQL table.
hail.GroupedTable Table grouped by row that can be aggregated into a new table.
hail.MatrixTable Hail’s distributed implementation of a structured matrix.
hail.GroupedMatrixTable Matrix table grouped by row or column that can be aggregated into a new matrix table.

Top-Level Functions

hail.init(sc=None, app_name='Hail', master=None, local='local[*]', log=None, quiet=False, append=False, min_block_size=0, branching_factor=50, tmp_dir='/tmp', default_reference='GRCh37', idempotent=False, global_seed=6348563392232659379, spark_conf=None, skip_logging_configuration=False, local_tmpdir=None, _optimizer_iterations=None)[source]

Initialize Hail and Spark.


Import and initialize Hail using GRCh38 as the default reference genome:

>>> import hail as hl
>>> hl.init(default_reference='GRCh38')  


Hail is not only a Python library; most of Hail is written in Java/Scala and runs together with Apache Spark in the Java Virtual Machine (JVM). In order to use Hail, a JVM needs to run as well. The init() function is used to initialize Hail and Spark.

This function also sets global configuration parameters used for the Hail session, like the default reference genome and log file location.

This function will be called automatically (with default parameters) if any Hail functionality requiring the backend (most of the libary!) is used. To initialize Hail explicitly with non-default arguments, be sure to do so directly after importing the module, as in the above example.


If a pyspark.SparkContext is already running, then Hail must be initialized with it as an argument:

>>> hl.init(sc=sc)  

See also


  • sc (pyspark.SparkContext, optional) – Spark context. By default, a Spark context will be created.
  • app_name (str) – Spark application name.
  • master (str, optional) – URL identifying the Spark leader (master) node or local[N] for local clusters.
  • local (str) – Local-mode core limit indicator. Must either be local[N] where N is a positive integer or local[*]. The latter indicates Spark should use all cores available. local[*] does not respect most containerization CPU limits. This option is only used if master is unset and spark.master is not set in the Spark configuration.
  • log (str) – Local path for Hail log file. Does not currently support distributed file systems like Google Storage, S3, or HDFS.
  • quiet (bool) – Print fewer log messages.
  • append (bool) – Append to the end of the log file.
  • min_block_size (int) – Minimum file block size in MB.
  • branching_factor (int) – Branching factor for tree aggregation.
  • tmp_dir (str, optional) – Networked temporary directory. Must be a network-visible file path. Defaults to /tmp in the default scheme.
  • default_reference (str) – Default reference genome. Either 'GRCh37', 'GRCh38', 'GRCm38', or 'CanFam3'.
  • idempotent (bool) – If True, calling this function is a no-op if Hail has already been initialized.
  • global_seed (int, optional) – Global random seed.
  • spark_conf (dict[str, str], optional) – Spark configuration parameters.
  • skip_logging_configuration (bool) – Skip logging configuration in java and python.
  • local_tmpdir (str, optional) – Local temporary directory. Used on driver and executor nodes. Must use the file scheme. Defaults to TMPDIR, or /tmp.

Stop the currently running Hail session.


Returns the active Spark context.


Returns the default reference genome 'GRCh37'.

hail.get_reference(name) → hail.genetics.reference_genome.ReferenceGenome[source]

Returns the reference genome corresponding to name.


Hail’s built-in references are 'GRCh37', GRCh38', 'GRCm38', and 'CanFam3'. The contig names and lengths come from the GATK resource bundle: human_g1k_v37.dict and Homo_sapiens_assembly38.dict.

If name='default', the value of default_reference() is returned.

Parameters:name (str) – Name of a previously loaded reference genome or one of Hail’s built-in references: 'GRCh37', 'GRCh38', 'GRCm38', 'CanFam3', and 'default'.

Sets Hail’s global seed to seed.

Parameters:seed (int) – Integer used to seed Hail’s random number generator
hail.citation(*, bibtex=False)[source]

Generate a Hail citation.

Parameters:bibtex (bool) – Generate a citation in BibTeX form.

Get the installed hail version.