Hail Query Python API
This is the API documentation for Hail Query
, and provides detailed information
on the Python programming interface.
Use import hail as hl
to access this functionality.
Classes
Hail's distributed implementation of a dataframe or SQL table. |
|
Table grouped by row that can be aggregated into a new table. |
|
Hail's distributed implementation of a structured matrix. |
|
Matrix table grouped by row or column that can be aggregated into a new matrix table. |
Modules
Top-Level Functions
- hail.init(sc=None, app_name=None, master=None, local='local[*]', log=None, quiet=False, append=False, min_block_size=0, branching_factor=50, tmp_dir=None, default_reference=None, idempotent=False, global_seed=None, spark_conf=None, skip_logging_configuration=False, local_tmpdir=None, _optimizer_iterations=None, *, backend=None, driver_cores=None, driver_memory=None, worker_cores=None, worker_memory=None, gcs_requester_pays_configuration=None, regions=None, gcs_bucket_allow_list=None, copy_spark_log_on_error=False)[source]
Initialize and configure Hail.
This function will be called with default arguments if any Hail functionality is used. If you need custom configuration, you must explicitly call this function before using Hail. For example, to set the global random seed to 0, import Hail and immediately call
init()
:>>> import hail as hl >>> hl.init(global_seed=0)
Hail has two backends,
spark
andbatch
. Hail selects a backend by consulting, in order, these configuration locations:The
backend
parameter of this function.The
HAIL_QUERY_BACKEND
environment variable.The value of
hailctl config get query/backend
.
If no configuration is found, Hail will select the Spark backend.
Examples
Configure Hail to use the Batch backend:
>>> import hail as hl >>> hl.init(backend='batch')
If a
pyspark.SparkContext
is already running, then Hail must be initialized with it as an argument:>>> hl.init(sc=sc)
Configure Hail to bill to my_project when accessing any Google Cloud Storage bucket that has requester pays enabled:
>>> hl.init(gcs_requester_pays_configuration='my-project')
Configure Hail to bill to my_project when accessing the Google Cloud Storage buckets named bucket_of_fish and bucket_of_eels:
>>> hl.init( ... gcs_requester_pays_configuration=('my-project', ['bucket_of_fish', 'bucket_of_eels']) ... )
You may also use hailctl config set gcs_requester_pays/project and hailctl config set gcs_requester_pays/buckets to achieve the same effect.
See also
- Parameters:
sc (pyspark.SparkContext, optional) – Spark Backend only. Spark context. If not specified, the Spark backend will create a new Spark context.
app_name (
str
) – A name for this pipeline. In the Spark backend, this becomes the Spark application name. In the Batch backend, this is a prefix for the name of every Batch.master (
str
, optional) – Spark Backend only. URL identifying the Spark leader (master) node or local[N] for local clusters.local (
str
) – Spark Backend only. Local-mode core limit indicator. Must either be local[N] where N is a positive integer or local[*]. The latter indicates Spark should use all cores available. local[*] does not respect most containerization CPU limits. This option is only used if master is unset and spark.master is not set in the Spark configuration.log (
str
) – Local path for Hail log file. Does not currently support distributed file systems like Google Storage, S3, or HDFS.quiet (
bool
) – Print fewer log messages.append (
bool
) – Append to the end of the log file.min_block_size (
int
) – Minimum file block size in MB.branching_factor (
int
) – Branching factor for tree aggregation.tmp_dir (
str
, optional) – Networked temporary directory. Must be a network-visible file path. Defaults to /tmp in the default scheme.default_reference (
str
) – Deprecated. Please usedefault_reference()
to set the default reference genomeDefault reference genome. Either
'GRCh37'
,'GRCh38'
,'GRCm38'
, or'CanFam3'
.idempotent (
bool
) – IfTrue
, calling this function is a no-op if Hail has already been initialized.global_seed (
int
, optional) – Global random seed.spark_conf (
dict
ofstr
to :class`str`, optional) – Spark backend only. Spark configuration parameters.skip_logging_configuration (
bool
) – Spark Backend only. Skip logging configuration in java and python.local_tmpdir (
str
, optional) – Local temporary directory. Used on driver and executor nodes. Must use the file scheme. Defaults to TMPDIR, or /tmp.driver_cores (
str
orint
, optional) – Batch backend only. Number of cores to use for the driver process. May be 1, 2, 4, or 8. Default is 1.driver_memory (
str
, optional) – Batch backend only. Memory tier to use for the driver process. May be standard or highmem. Default is standard.worker_cores (
str
orint
, optional) – Batch backend only. Number of cores to use for the worker processes. May be 1, 2, 4, or 8. Default is 1.worker_memory (
str
, optional) – Batch backend only. Memory tier to use for the worker processes. May be standard or highmem. Default is standard.gcs_requester_pays_configuration (either
str
ortuple
ofstr
andlist
ofstr
, optional) – If a string is provided, configure the Google Cloud Storage file system to bill usage to the project identified by that string. If a tuple is provided, configure the Google Cloud Storage file system to bill usage to the specified project for buckets specified in the list. See examples above.regions (
list
ofstr
, optional) – List of regions to run jobs in when using the Batch backend. UseANY_REGION
to specify any region is allowed or use None to use the underlying default regions from the hailctl environment configuration. For example, use hailctl config set batch/regions region1,region2 to set the default regions to use.gcs_bucket_allow_list – A list of buckets that Hail should be permitted to read from or write to, even if their default policy is to use “cold” storage. Should look like
["bucket1", "bucket2"]
.copy_spark_log_on_error (
bool
, optional) – Spark backend only. If True, copy the log from the spark driver node to tmp_dir on error.
- hail.default_reference(new_default_reference=None)[source]
With no argument, returns the default reference genome (
'GRCh37'
by default). With an argument, sets the default reference genome to the argument.- Returns:
- hail.get_reference(name)[source]
Returns the reference genome corresponding to name.
Notes
Hail’s built-in references are
'GRCh37'
,GRCh38'
,'GRCm38'
, and'CanFam3'
. The contig names and lengths come from the GATK resource bundle: human_g1k_v37.dict and Homo_sapiens_assembly38.dict.If
name='default'
, the value ofdefault_reference()
is returned.- Parameters:
name (
str
) – Name of a previously loaded reference genome or one of Hail’s built-in references:'GRCh37'
,'GRCh38'
,'GRCm38'
,'CanFam3'
, and'default'
.- Returns:
- hail.set_global_seed(seed)[source]
Deprecated.
Has no effect. To ensure reproducible randomness, use the global_seed argument to
init()
andreset_global_randomness()
.See the random functions reference docs for more.
- Parameters:
seed (
int
) – Integer used to seed Hail’s random number generator
- hail.reset_global_randomness()[source]
Restore global randomness to initial state for test reproducibility.