Python API

This is the API documentation for Hail, and provides detailed information on the Python programming interface.

Use import hail as hl to access this functionality.

Classes

hail.Table Hail’s distributed implementation of a dataframe or SQL table.
hail.GroupedTable Table grouped by row that can be aggregated into a new table.
hail.MatrixTable Hail’s distributed implementation of a structured matrix.
hail.GroupedMatrixTable Matrix table grouped by row or column that can be aggregated into a new matrix table.

Top-Level Functions

hail.init(sc=None, app_name='Hail', master=None, local='local[*]', log=None, quiet=False, append=False, min_block_size=1, branching_factor=50, tmp_dir='/tmp', default_reference='GRCh37', idempotent=False, global_seed=6348563392232659379, _backend=None)[source]

Initialize Hail and Spark.

Parameters:
  • sc (pyspark.SparkContext, optional) – Spark context. By default, a Spark context will be created.
  • app_name (str) – Spark application name.
  • master (str) – Spark master.
  • local (str) – Local-mode master, used if master is not defined here or in the Spark configuration.
  • log (str) – Local path for Hail log file. Does not currently support distributed file systems like Google Storage, S3, or HDFS.
  • quiet (bool) – Print fewer log messages.
  • append (bool) – Append to the end of the log file.
  • min_block_size (int) – Minimum file block size in MB.
  • branching_factor (int) – Branching factor for tree aggregation.
  • tmp_dir (str) – Temporary directory for Hail files. Must be a network-visible file path.
  • default_reference (str) – Default reference genome. Either 'GRCh37', 'GRCh38', or 'GRCm38'.
  • idempotent (bool) – If True, calling this function is a no-op if Hail has already been initialized.
hail.stop()[source]

Stop the currently running Hail session.

hail.spark_context()[source]

Returns the active Spark context.

Returns:pyspark.SparkContext
hail.default_reference()[source]

Returns the default reference genome 'GRCh37'.

Returns:ReferenceGenome
hail.get_reference(name) → hail.genetics.reference_genome.ReferenceGenome[source]

Returns the reference genome corresponding to name.

Notes

Hail’s built-in references are 'GRCh37', GRCh38', and 'GRCm38'. The contig names and lengths come from the GATK resource bundle: human_g1k_v37.dict and Homo_sapiens_assembly38.dict.

If name='default', the value of default_reference() is returned.

Parameters:name (str) – Name of a previously loaded reference genome or one of Hail’s built-in references: 'GRCh37', 'GRCh38', 'GRCm38', and 'default'.
Returns:ReferenceGenome
hail.set_global_seed(seed)[source]

Sets Hail’s global seed to seed.

Parameters:seed (int) – Integer used to seed Hail’s random number generator
Returns:ReferenceGenome
hail.set_upload_email(email)[source]

Set upload email.

If email is not set, uploads will be anonymous. Upload email can also be set through the HAIL_UPLOAD_EMAIL environment variable or the hail.uploadEmail Spark configuration property.

Parameters:email (str) – Email contact to include with uploaded data. If email is None, uploads will be anonymous.
hail.enable_pipeline_upload()[source]

Upload all subsequent pipelines to the Hail team in order to help improve Hail.

Pipeline upload can also be enabled by setting the environment variable HAIL_ENABLE_PIPELINE_UPLOAD or the Spark configuration property hail.enablePipelineUpload to true.

Warning

Shares potentially sensitive data with the Hail team.

hail.disable_pipeline_upload()[source]

Disable the uploading of pipelines. By default, pipeline upload is disabled.

hail.upload_log()[source]

Uploads the Hail log to the Hail team.

Warning

Shares potentially sensitive data with the Hail team.