HailContext

class hail.HailContext(sc=None, app_name='Hail', master=None, local='local[*]', log='hail.log', quiet=False, append=False, parquet_compression='snappy', min_block_size=1, branching_factor=50, tmp_dir='/tmp')[source]

The main entry point for Hail functionality.

Warning

Only one Hail context may be running in a Python session at any time. If you need to reconfigure settings, restart the Python session or use the HailContext.stop() method.

Parameters:
  • sc (pyspark.SparkContext) – spark context, will be auto-generated if None
  • appName – Spark application identifier
  • master – Spark cluster master
  • local – local resources to use
  • log – log path
  • quiet – suppress log messages
  • append – write to end of log file instead of overwriting
  • parquet_compression – level of on-disk annotation compression
  • min_block_size – minimum file split size in MB
  • branching_factor – branching factor for tree aggregation
  • tmp_dir – temporary directory for file merging
Variables:

sc (pyspark.SparkContext) – Spark context

Attributes

version Return the version of Hail associated with this HailContext.

Methods

__init__
balding_nichols_model Simulate a variant dataset using the Balding-Nichols model.
eval_expr Evaluate an expression.
eval_expr_typed Evaluate an expression and return the result as well as its type.
get_running Return the running Hail context in this Python session.
grep Grep big files, like, really fast.
import_bgen Import .bgen file(s) as variant dataset.
import_gen Import .gen file(s) as variant dataset.
import_plink Import PLINK binary file (BED, BIM, FAM) as variant dataset.
import_table Import delimited text file (text table) as key table.
import_vcf Import VCF file(s) as variant dataset.
index_bgen Index .bgen files.
read Read .vds files as variant dataset.
read_table Read a KT file as key table.
report Print information and warnings about VCF + GEN import and deduplication.
stop Shut down the Hail context.
write_partitioning Write partitioning.json.gz file for legacy VDS file.
balding_nichols_model(populations, samples, variants, num_partitions=None, pop_dist=None, fst=None, af_dist=<hail.stats.UniformDist instance>, seed=0)[source]

Simulate a variant dataset using the Balding-Nichols model.

Examples

To generate a VDS with 3 populations, 100 samples in total, and 1000 variants:

>>> vds = hc.balding_nichols_model(3, 100, 1000)

To generate a VDS with 4 populations, 2000 samples, 5000 variants, 10 partitions, population distribution [0.1, 0.2, 0.3, 0.4], \(F_{ST}\) values [.02, .06, .04, .12], ancestral allele frequencies drawn from a truncated beta distribution with a = .01 and b = .05 over the interval [0.05, 1], and random seed 1:

>>> from hail.stats import TruncatedBetaDist
>>> vds = hc.balding_nichols_model(4, 40, 150, 10,
...                                pop_dist=[0.1, 0.2, 0.3, 0.4],
...                                fst=[.02, .06, .04, .12],
...                                af_dist=TruncatedBetaDist(a=0.01, b=2.0, minVal=0.05, maxVal=1.0),
...                                seed=1)

Notes

Hail is able to randomly generate a VDS using the Balding-Nichols model.

  • \(K\) populations are labeled by integers 0, 1, ..., K - 1
  • \(N\) samples are named by strings 0, 1, ..., N - 1
  • \(M\) variants are defined as 1:1:A:C, 1:2:A:C, ..., 1:M:A:C
  • The default ancestral frequency distribution \(P_0\) is uniform on [0.1, 0.9]. Options are UniformDist(minVal, maxVal), BetaDist(a, b), and TruncatedBetaDist(a, b, minVal, maxVal). All three classes are located in hail.stats.
  • The population distribution \(\pi\) defaults to uniform
  • The \(F_{ST}\) values default to 0.1
  • The number of partitions defaults to one partition per million genotypes (i.e., samples * variants / 10^6) or 8, whichever is larger

The Balding-Nichols model models genotypes of individuals from a structured population comprising \(K\) homogeneous subpopulations that have each diverged from a single ancestral population (a star phylogeny). We take \(N\) samples and \(M\) bi-allelic variants in perfect linkage equilibrium. The relative sizes of the subpopulations are given by a probability vector \(\pi\); the ancestral allele frequencies are drawn independently from a frequency spectrum \(P_0\); the subpopulations have diverged with possibly different \(F_{ST}\) parameters \(F_k\) (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. \(k = 1, \ldots, K\)). For each variant, the subpopulation allele frequencies are drawn a beta distribution, a useful continuous approximation of the effect of genetic drift. We denote the individual subpopulation memberships by \(k_n\), the ancestral allele frequences by \(p_{0, m}\), the subpopulation allele frequencies by \(p_{k, m}\), and the genotypes by \(g_{n, m}\). The generative model in then given by:

\[ \begin{align}\begin{aligned}k_n \,&\sim\, \pi\\p_{0,m}\,&\sim\, P_0\\p_{k,m}\mid p_{0,m}\,&\sim\, \mathrm{Beta}(\mu = p_{0,m},\, \sigma^2 = F_k p_{0,m}(1 - p_{0,m}))\\g_{n,m}\mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m})\end{aligned}\end{align} \]

We have parametrized the beta distribution by its mean and variance; the usual parameters are \(a = (1 - p)(1 - F)/F,\; b = p(1-F)/F\) with \(F = F_k,\; p = p_{0,m}\).

Annotations

balding_nichols_model() adds the following global, sample, and variant annotations:

  • global.nPops (Int) – Number of populations
  • global.nSamples (Int) – Number of samples
  • global.nVariants (Int) – Number of variants
  • global.popDist (Array[Double]) – Normalized population distribution indexed by population
  • global.Fst (Array[Double]) – \(F_{ST}\) values indexed by population
  • global.seed (Int) – Random seed
  • global.ancestralAFDist (Struct) – Description of the ancestral allele frequency distribution
  • sa.pop (Int) – Population of sample
  • va.ancestralAF (Double) – Ancestral allele frequency
  • va.AF (Array[Double]) – Allele frequency indexed by population
Parameters:
  • populations (int) – Number of populations.
  • samples (int) – Number of samples.
  • variants (int) – Number of variants.
  • num_partitions (int) – Number of partitions.
  • pop_dist (array of float or None) – Unnormalized population distribution
  • fst (array of float or None) – \(F_{ST}\) values
  • af_dist (UniformDist or BetaDist or TruncatedBetaDist) – Ancestral allele frequency distribution
  • seed (int) – Random seed.
Returns:

Variant dataset simulated using the Balding-Nichols model.

Return type:

VariantDataset

eval_expr(expr)[source]

Evaluate an expression.

Parameters:expr (str) – Expression to evaluate.
Return type:annotation
eval_expr_typed(expr)[source]

Evaluate an expression and return the result as well as its type.

Parameters:expr (str) – Expression to evaluate.
Return type:(annotation, Type)
static get_running()[source]

Return the running Hail context in this Python session.

Example

>>> HailContext()  # oops! Forgot to bind to 'hc'
>>> hc = HailContext.get_running()  # recovery

Useful to recover a Hail context that has been created but is unbound.

Returns:Current Hail context.
Return type:HailContext
grep(regex, path, max_count=100)[source]

Grep big files, like, really fast.

Examples

Print all lines containing the string hello in file.txt:

>>> hc.grep('hello','data/file.txt')

Print all lines containing digits in file1.txt and file2.txt:

>>> hc.grep('\d', ['data/file1.txt','data/file2.txt'])

Background

grep() mimics the basic functionality of Unix grep in parallel, printing results to screen. This command is provided as a convenience to those in the statistical genetics community who often search enormous text files like VCFs. Find background on regular expressions at RegExr.

Parameters:
  • regex (str) – The regular expression to match.
  • path (str or list of str) – The files to search.
  • max_count (int) – The maximum number of matches to return.
import_bgen(path, tolerance=0.2, sample_file=None, min_partitions=None)[source]

Import .bgen file(s) as variant dataset.

Examples

Importing a BGEN file as a VDS (assuming it has already been indexed).

>>> vds = hc.import_bgen("data/example3.bgen", sample_file="data/example3.sample")

Notes

Hail supports importing data in the BGEN file format. For more information on the BGEN file format, see here. Note that only v1.1 and v1.2 BGEN files are supported at this time. For v1.2 BGEN files, only unphased and diploid genotype probabilities are allowed and the genotype probability blocks must be either compressed with zlib or uncompressed.

Before importing, ensure that:

  • The sample file has the same number of samples as the BGEN file.
  • No duplicate sample IDs are present.

To load multiple files at the same time, use Hadoop Glob Patterns.

Genotype probability (``gp``) representation:

The following modifications are made to genotype probabilities in BGEN v1.1 files:

  • Since genotype probabilities are understood to define a probability distribution, import_bgen() automatically sets to missing those genotypes for which the sum of the probabilities is a distance greater than the tolerance parameter from 1.0. The default tolerance is 0.2, so a genotype with sum .79 or 1.21 is filtered out, whereas a genotype with sum .8 or 1.2 remains.
  • import_bgen() normalizes all probabilities to sum to 1.0. Therefore, an input distribution of (0.98, 0.0, 0.0) will be stored as (1.0, 0.0, 0.0) in Hail.

Annotations

import_bgen() adds the following variant annotations:

  • va.varid (String) – 2nd column of .gen file if chromosome present, otherwise 1st column.
  • va.rsid (String) – 3rd column of .gen file if chromosome present, otherwise 2nd column.
Parameters:
  • path (str or list of str) – .bgen files to import.
  • tolerance (float) – If the sum of the probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing. Only applicable if the BGEN files are v1.1.
  • sample_file (str or None) – Sample file.
  • min_partitions (int or None) – Number of partitions.
Returns:

Variant dataset imported from .bgen file.

Return type:

VariantDataset

import_gen(path, sample_file=None, tolerance=0.2, min_partitions=None, chromosome=None)[source]

Import .gen file(s) as variant dataset.

Examples

Read a .gen file and a .sample file and write to a .vds file:

>>> (hc.import_gen('data/example.gen', sample_file='data/example.sample')
...    .write('output/gen_example1.vds'))

Load multiple files at the same time with Hadoop glob patterns:

>>> (hc.import_gen('data/example.chr*.gen', sample_file='data/example.sample')
...    .write('output/gen_example2.vds'))

Notes

For more information on the .gen file format, see here.

To ensure that the .gen file(s) and .sample file are correctly prepared for import:

  • If there are only 5 columns before the start of the genotype probability data (chromosome field is missing), you must specify the chromosome using the chromosome parameter
  • No duplicate sample IDs are allowed

The first column in the .sample file is used as the sample ID s.

Also, see section in import_bgen() linked here for information about Hail’s genotype probability representation.

Annotations

import_gen() adds the following variant annotations:

  • va.varid (String) – 2nd column of .gen file if chromosome present, otherwise 1st column.
  • va.rsid (String) – 3rd column of .gen file if chromosome present, otherwise 2nd column.
Parameters:
  • path (str or list of str) – .gen files to import.
  • sample_file (str) – The sample file.
  • tolerance (float) – If the sum of the genotype probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing.
  • min_partitions (int or None) – Number of partitions.
  • chromosome (str or None) – Chromosome if not listed in the .gen file.
Returns:

Variant dataset imported from .gen and .sample files.

Return type:

VariantDataset

Import PLINK binary file (BED, BIM, FAM) as variant dataset.

Examples

Import data from a PLINK binary file:

>>> vds = hc.import_plink(bed="data/test.bed",
...                       bim="data/test.bim",
...                       fam="data/test.fam")

Notes

Only binary SNP-major mode files can be read into Hail. To convert your file from individual-major mode to SNP-major mode, use PLINK to read in your fileset and use the --make-bed option.

The centiMorgan position is not currently used in Hail (Column 3 in BIM file).

The ID (s) used by Hail is the individual ID (column 2 in FAM file).

Warning

No duplicate individual IDs are allowed.

Chromosome names (Column 1) are automatically converted in the following cases:

  • 23 => “X”
  • 24 => “Y”
  • 25 => “X”
  • 26 => “MT”

Annotations

import_plink() adds the following annotations:

  • va.rsid (String) – Column 2 in the BIM file.
  • sa.famID (String) – Column 1 in the FAM file. Set to missing if ID equals “0”.
  • sa.patID (String) – Column 3 in the FAM file. Set to missing if ID equals “0”.
  • sa.matID (String) – Column 4 in the FAM file. Set to missing if ID equals “0”.
  • sa.isFemale (String) – Column 5 in the FAM file. Set to missing if value equals “-9”, “0”, or “N/A”. Set to true if value equals “2”. Set to false if value equals “1”.
  • sa.isCase (String) – Column 6 in the FAM file. Only present if quantpheno equals False. Set to missing if value equals “-9”, “0”, “N/A”, or the value specified by missing. Set to true if value equals “2”. Set to false if value equals “1”.
  • sa.qPheno (String) – Column 6 in the FAM file. Only present if quantpheno equals True. Set to missing if value equals missing.
Parameters:
  • bed (str) – PLINK BED file.
  • bim (str) – PLINK BIM file.
  • fam (str) – PLINK FAM file.
  • min_partitions (int or None) – Number of partitions.
  • missing (str) – The string used to denote missing values only for the phenotype field. This is in addition to “-9”, “0”, and “N/A” for case-control phenotypes.
  • delimiter (str) – FAM file field delimiter regex.
  • quantpheno (bool) – If True, FAM phenotype is interpreted as quantitative.
Returns:

Variant dataset imported from PLINK binary file.

Return type:

VariantDataset

import_table(paths, key=[], min_partitions=None, impute=False, no_header=False, comment=None, delimiter='\t', missing='NA', types={})[source]

Import delimited text file (text table) as key table.

The resulting key table will have no key columns, use KeyTable.key_by() to specify keys.

Example

Given this file

$ cat data/samples1.tsv
Sample      Height  Status  Age
PT-1234     154.1   ADHD    24
PT-1236     160.9   Control 19
PT-1238     NA      ADHD    89
PT-1239     170.3   Control 55

The interesting thing about this table is that column Height is a floating-point number, and column Age is an integer. We can either provide have Hail impute these types from the file, or pass them ourselves:

Pass the types ourselves:

>>> table = hc.import_table('data/samples1.tsv', types={'Height': TDouble(), 'Age': TInt()})

Note that string columns like Sample and Status do not need to be typed, because String is the default type.

Use type imputation (a bit easier, but requires reading the file twice):

>>> table = hc.import_table('data/samples1.tsv', impute=True)

Detailed examples

Let’s import annotations from a CSV file with missing data and special characters:

$ cat data/samples2.tsv
Batch,PT-ID
1kg,PT-0001
1kg,PT-0002
study1,PT-0003
study3,PT-0003
.,PT-0004
1kg,PT-0005
.,PT-0006
1kg,PT-0007

In this case, we should:

  • Pass the non-default delimiter ,
  • Pass the non-default missing value .
>>> table = hc.import_table('data/samples2.tsv', delimiter=',', missing='.')

Let’s import annotations from a file with no header and sample IDs that need to be transformed. Suppose the vds sample IDs are of the form NA#####. This file has no header line, and the sample ID is hidden in a field with other information.

To import:

>>> annotations = (hc.import_table('data/samples3.tsv', no_header=True)
...                   .annotate('sample = f0.split("_")[1]')
...                   .key_by('sample'))

Notes

The impute option tells Hail to scan the file an extra time to gather information about possible field types. While this is a bit slower for large files, (the file is parsed twice), the convenience is often worth this cost.

The delimiter parameter is a field separator regex. This regex follows the
Java regex standard.

Note

Use delimiter='\s+' to specify whitespace delimited files.

The comment is an optional parameter which causes Hail to skip any line that starts in the given pattern. Passing comment='#' will skip any line beginning in a pound sign, for example.

The missing parameter defines the representation of missing data in the table.

Note

The comment and missing parameters are NOT regexes.

The no_header option indicates that the file has no header line. If this option is passed, then the column names will be f0, f1, ... fN (0-indexed).

The types option allows the user to pass the types of columns in the table. This is a dict keyed by str, with Type values. See the examples above for a standard usage. Additionally, this option can be used to override type imputation. For example, if a column in a file refers to chromosome and does not contain any sex chromosomes, it will be imputed as an integer, while most Hail methods expect chromosome to be passed as a string. Using the impute=True mode and passing types={'Chromosome': TString()} will solve this problem.

The min_partitions option can be used to increase the number of partitions (level of sharding) of an imported table. The default partition size depends on file system and a number of other factors (including the min_block_size of the hail context), but usually is between 32M and 128M.

Parameters:
  • paths (str or list of str) – Files to import.
  • key (str or list of str) – Key column(s).
  • min_partitions (int or None) – Minimum number of partitions.
  • no_header (bool) – File has no header and the N columns are named f0, f1, ... fN (0-indexed)
  • impute (bool) – Impute column types from the file
  • comment (str or None) – Skip lines beginning with the given pattern
  • delimiter (str) – Field delimiter regex
  • missing (str) – Specify identifier to be treated as missing
  • types (dict with str keys and Type values) – Define types of fields in annotations files
Returns:

Key table constructed from text table.

Return type:

KeyTable

import_vcf(path, force=False, force_bgz=False, header_file=None, min_partitions=None, drop_samples=False, store_gq=False, pp_as_pl=False, skip_bad_ad=False, generic=False, call_fields=[])[source]

Import VCF file(s) as variant dataset.

Examples

>>> vds = hc.import_vcf('data/example2.vcf.bgz')

Notes

Hail is designed to be maximally compatible with files in the VCF v4.2 spec.

import_vcf() takes a list of VCF files to load. All files must have the same header and the same set of samples in the same order (e.g., a variant dataset split by chromosome). Files can be specified as Hadoop glob patterns.

Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed (.vcf) or block compressed (.vcf.bgz). If you have a large compressed VCF that ends in .vcf.gz, it is likely that the file is actually block-compressed, and you should rename the file to ”.vcf.bgz” accordingly. If you actually have a standard gzipped file, it is possible to import it to Hail using the force optional parameter. However, this is not recommended – all parsing will have to take place on one node because gzip decompression is not parallelizable. In this case, import could take significantly longer.

If generic equals False (default), Hail makes certain assumptions about the genotype fields, see Representation. On import, Hail filters (sets to no-call) any genotype that violates these assumptions. Hail interprets the format fields: GT, AD, OD, DP, GQ, PL; all others are silently dropped.

If generic equals True, the genotype schema is a TStruct with field names equal to the IDs of the FORMAT fields. The GT field is automatically read in as a TCall type. To specify additional fields to import as a TCall type, use the call_fields parameter. All other fields are imported as the type specified in the FORMAT header field.

An example genotype schema after importing a VCF with generic=True is

Struct {
    GT: Call,
    AD: Array[Int],
    DP: Int,
    GQ: Int,
    PL: Array[Int]
}

Warning

  • The variant dataset generated with generic=True will have significantly slower performance.
  • Not all VariantDataset methods will work with a generic genotype schema.

import_vcf() does not perform deduplication - if the provided VCF(s) contain multiple records with the same chrom, pos, ref, alt, all these records will be imported and will not be collapsed into a single variant.

Since Hail’s genotype representation does not yet support ploidy other than 2, this method imports haploid genotypes as diploid. If generic=False, Hail fills in missing indices in PL / PP arrays with 1000 to support the standard VCF / VDS “genotype schema.

Below are two example haploid genotypes and diploid equivalents that Hail sees.

Haploid:     1:0,6:7:70:70,0
Imported as: 1/1:0,6:7:70:70,1000,0

Haploid:     2:0,0,9:9:24:24,40,0
Imported as: 2/2:0,0,9:9:24:24,1000,40,1000:1000:0

Note

Using the FILTER field:

The information in the FILTER field of a VCF is contained in the va.filters annotation. This annotation is a Set and can be queried for filter membership with expressions like va.filters.contains("VQSRTranche99.5..."). Variants that are flagged as “PASS” will have no filters applied; for these variants, va.filters.isEmpty() is true. Thus, filtering to PASS variants can be done with VariantDataset.filter_variants_expr() as follows:

>>> pass_vds = vds.filter_variants_expr('va.filters.isEmpty()', keep=True)

Annotations

  • va.filters (Set[String]) – Set containing all filters applied to a variant.
  • va.rsid (String) – rsID of the variant.
  • va.qual (Double) – Floating-point number in the QUAL field.
  • va.info (Struct) – All INFO fields defined in the VCF header can be found in the struct va.info. Data types match the type specified in the VCF header, and if the declared Number is not 1, the result will be stored as an array.
Parameters:
  • path (str or list of str) – VCF file(s) to read.
  • force (bool) – If True, load .gz files serially. This means that no downstream operations can be parallelized, so using this mode is strongly discouraged for VCFs larger than a few MB.
  • force_bgz (bool) – If True, load .gz files as blocked gzip files (BGZF)
  • header_file (str or None) – File to load VCF header from. If not specified, the first file in path is used.
  • min_partitions (int or None) – Number of partitions.
  • drop_samples (bool) – If True, create sites-only variant dataset. Don’t load sample ids, sample annotations or genotypes.
  • store_gq (bool) – If True, store GQ FORMAT field instead of computing from PL. Only applies if generic=False.
  • pp_as_pl (bool) – If True, store PP FORMAT field as PL. EXPERIMENTAL. Only applies if generic=False.
  • skip_bad_ad (bool) – If True, set AD FORMAT field with wrong number of elements to missing, rather than setting the entire genotype to missing. Only applies if generic=False.
  • generic (bool) – If True, read the genotype with a generic schema.
  • call_fields (str or list of str) – FORMAT fields in VCF to treat as a TCall. Only applies if generic=True.
Returns:

Variant dataset imported from VCF file(s)

Return type:

VariantDataset

index_bgen(path)[source]

Index .bgen files. HailContext.import_bgen() cannot run without these indices.

Example

>>> hc.index_bgen("data/example3.bgen")

Warning

While this method parallelizes over a list of BGEN files, each file is indexed serially by one core. Indexing several BGEN files on a large cluster is a waste of resources, so indexing should generally be done as a one-time step separately from large analyses.

Parameters:path (str or list of str) – .bgen files to index.
read(path, drop_samples=False, drop_variants=False)[source]

Read .vds files as variant dataset.

When loading multiple VDS files, they must have the same sample IDs, genotype schema, split status and variant metadata.

Parameters:
  • path (str or list of str) – VDS files to read.
  • drop_samples (bool) – If True, create sites-only variant dataset. Don’t load sample ids, sample annotations or gneotypes.
  • drop_variants (bool) – If True, create samples-only variant dataset (no variants or genotypes).
Returns:

Variant dataset read from disk.

Return type:

VariantDataset

read_table(path)[source]

Read a KT file as key table.

Parameters:path (str) – KT file to read.
Returns:Key table read from disk.
Return type:KeyTable
report()[source]

Print information and warnings about VCF + GEN import and deduplication.

stop()[source]

Shut down the Hail context.

It is not possible to have multiple Hail contexts running in a single Python session, so use this if you need to reconfigure the Hail context. Note that this also stops a running Spark context.

version

Return the version of Hail associated with this HailContext.

Return type:str
write_partitioning(path)[source]

Write partitioning.json.gz file for legacy VDS file.

Parameters:path (str) – path to VDS file.