HailContext¶
-
class
hail.
HailContext
(sc=None, app_name='Hail', master=None, local='local[*]', log='hail.log', quiet=False, append=False, parquet_compression='snappy', min_block_size=1, branching_factor=50, tmp_dir='/tmp')[source]¶ The main entry point for Hail functionality.
Warning
Only one Hail context may be running in a Python session at any time. If you need to reconfigure settings, restart the Python session or use the
HailContext.stop()
method.If passing in a Spark context, ensure that the configuration parameters
spark.sql.files.openCostInBytes
andspark.sql.files.maxPartitionBytes
are set to as least 50GB.Parameters: - sc (
pyspark.SparkContext
) – Spark context, one will be created if None. - appName – Spark application identifier.
- master – Spark cluster master.
- local – Local resources to use.
- log – Log path.
- quiet (bool) – Don’t write logging information to standard error.
- append – Write to end of log file instead of overwriting.
- parquet_compression – Level of on-disk annotation compression.
- min_block_size – Minimum file split size in MB.
- branching_factor – Branching factor for tree aggregation.
- tmp_dir – Temporary directory for file merging.
Variables: sc (
pyspark.SparkContext
) – Spark contextAttributes
version
Return the version of Hail associated with this HailContext. Methods
__init__
x.__init__(…) initializes x; see help(type(x)) for signature balding_nichols_model
Simulate a variant dataset using the Balding-Nichols model. eval_expr
Evaluate an expression. eval_expr_typed
Evaluate an expression and return the result as well as its type. get_running
Return the running Hail context in this Python session. grep
Grep big files, like, really fast. import_bgen
Import .bgen file(s) as variant dataset. import_gen
Import .gen file(s) as variant dataset. import_plink
Import PLINK binary file (BED, BIM, FAM) as variant dataset. import_table
Import delimited text file (text table) as key table. import_vcf
Import VCF file(s) as variant dataset. index_bgen
Index .bgen files. read
Read .vds files as variant dataset. read_table
Read a KT file as key table. report
Print information and warnings about VCF + GEN import and deduplication. stop
Shut down the Hail context. write_partitioning
Write partitioning.json.gz file for legacy VDS file. -
balding_nichols_model
(populations, samples, variants, num_partitions=None, pop_dist=None, fst=None, af_dist=<hail.stats.UniformDist instance>, seed=0)[source]¶ Simulate a variant dataset using the Balding-Nichols model.
Examples
To generate a VDS with 3 populations, 100 samples in total, and 1000 variants:
>>> vds = hc.balding_nichols_model(3, 100, 1000)
To generate a VDS with 4 populations, 2000 samples, 5000 variants, 10 partitions, population distribution [0.1, 0.2, 0.3, 0.4], \(F_{ST}\) values [.02, .06, .04, .12], ancestral allele frequencies drawn from a truncated beta distribution with a = .01 and b = .05 over the interval [0.05, 1], and random seed 1:
>>> from hail.stats import TruncatedBetaDist >>> vds = hc.balding_nichols_model(4, 40, 150, 10, ... pop_dist=[0.1, 0.2, 0.3, 0.4], ... fst=[.02, .06, .04, .12], ... af_dist=TruncatedBetaDist(a=0.01, b=2.0, minVal=0.05, maxVal=1.0), ... seed=1)
Notes
Hail is able to randomly generate a VDS using the Balding-Nichols model.
- \(K\) populations are labeled by integers 0, 1, …, K - 1
- \(N\) samples are named by strings 0, 1, …, N - 1
- \(M\) variants are defined as
1:1:A:C
,1:2:A:C
, …,1:M:A:C
- The default ancestral frequency distribution \(P_0\) is uniform on [0.1, 0.9]. Options are UniformDist(minVal, maxVal), BetaDist(a, b), and TruncatedBetaDist(a, b, minVal, maxVal). All three classes are located in hail.stats.
- The population distribution \(\pi\) defaults to uniform
- The \(F_{ST}\) values default to 0.1
- The number of partitions defaults to one partition per million genotypes (i.e., samples * variants / 10^6) or 8, whichever is larger
The Balding-Nichols model models genotypes of individuals from a structured population comprising \(K\) homogeneous subpopulations that have each diverged from a single ancestral population (a star phylogeny). We take \(N\) samples and \(M\) bi-allelic variants in perfect linkage equilibrium. The relative sizes of the subpopulations are given by a probability vector \(\pi\); the ancestral allele frequencies are drawn independently from a frequency spectrum \(P_0\); the subpopulations have diverged with possibly different \(F_{ST}\) parameters \(F_k\) (here and below, lowercase indices run over a range bounded by the corresponding uppercase parameter, e.g. \(k = 1, \ldots, K\)). For each variant, the subpopulation allele frequencies are drawn a beta distribution, a useful continuous approximation of the effect of genetic drift. We denote the individual subpopulation memberships by \(k_n\), the ancestral allele frequences by \(p_{0, m}\), the subpopulation allele frequencies by \(p_{k, m}\), and the genotypes by \(g_{n, m}\). The generative model in then given by:
\[ \begin{align}\begin{aligned}k_n \,&\sim\, \pi\\p_{0,m}\,&\sim\, P_0\\p_{k,m}\mid p_{0,m}\,&\sim\, \mathrm{Beta}(\mu = p_{0,m},\, \sigma^2 = F_k p_{0,m}(1 - p_{0,m}))\\g_{n,m}\mid k_n, p_{k, m} \,&\sim\, \mathrm{Binomial}(2, p_{k_n, m})\end{aligned}\end{align} \]We have parametrized the beta distribution by its mean and variance; the usual parameters are \(a = (1 - p)(1 - F)/F,\; b = p(1-F)/F\) with \(F = F_k,\; p = p_{0,m}\).
Annotations
balding_nichols_model()
adds the following global, sample, and variant annotations:- global.nPops (Int) – Number of populations
- global.nSamples (Int) – Number of samples
- global.nVariants (Int) – Number of variants
- global.popDist (Array[Double]) – Normalized population distribution indexed by population
- global.Fst (Array[Double]) – \(F_{ST}\) values indexed by population
- global.seed (Int) – Random seed
- global.ancestralAFDist (Struct) – Description of the ancestral allele frequency distribution
- sa.pop (Int) – Population of sample
- va.ancestralAF (Double) – Ancestral allele frequency
- va.AF (Array[Double]) – Allele frequency indexed by population
Parameters: - populations (int) – Number of populations.
- samples (int) – Number of samples.
- variants (int) – Number of variants.
- num_partitions (int) – Number of partitions.
- pop_dist (array of float or None) – Unnormalized population distribution
- fst (array of float or None) – \(F_{ST}\) values
- af_dist (
UniformDist
orBetaDist
orTruncatedBetaDist
) – Ancestral allele frequency distribution - seed (int) – Random seed.
Returns: Variant dataset simulated using the Balding-Nichols model.
Return type:
-
eval_expr
(expr)[source]¶ Evaluate an expression.
Parameters: expr (str) – Expression to evaluate. Return type: annotation
-
eval_expr_typed
(expr)[source]¶ Evaluate an expression and return the result as well as its type.
Parameters: expr (str) – Expression to evaluate. Return type: (annotation, Type
)
-
static
get_running
()[source]¶ Return the running Hail context in this Python session.
Example
>>> HailContext() # oops! Forgot to bind to 'hc' >>> hc = HailContext.get_running() # recovery
Useful to recover a Hail context that has been created but is unbound.
Returns: Current Hail context. Return type: HailContext
-
grep
(regex, path, max_count=100)[source]¶ Grep big files, like, really fast.
Examples
Print all lines containing the string
hello
in file.txt:>>> hc.grep('hello','data/file.txt')
Print all lines containing digits in file1.txt and file2.txt:
>>> hc.grep('\d', ['data/file1.txt','data/file2.txt'])
Background
grep()
mimics the basic functionality of Unixgrep
in parallel, printing results to screen. This command is provided as a convenience to those in the statistical genetics community who often search enormous text files like VCFs. Find background on regular expressions at RegExr.Parameters: - regex (str) – The regular expression to match.
- path (str or list of str) – The files to search.
- max_count (int) – The maximum number of matches to return.
-
import_bgen
(path, tolerance=0.2, sample_file=None, min_partitions=None)[source]¶ Import .bgen file(s) as variant dataset.
Examples
Importing a BGEN file as a VDS (assuming it has already been indexed).
>>> vds = hc.import_bgen("data/example3.bgen", sample_file="data/example3.sample")
Notes
Hail supports importing data in the BGEN file format. For more information on the BGEN file format, see here. Note that only v1.1 and v1.2 BGEN files are supported at this time. For v1.2 BGEN files, only unphased and diploid genotype probabilities are allowed and the genotype probability blocks must be either compressed with zlib or uncompressed.
Before importing, ensure that:
- The sample file has the same number of samples as the BGEN file.
- No duplicate sample IDs are present.
To load multiple files at the same time, use Hadoop Glob Patterns.
Genotype probability (``gp``) representation:
The following modifications are made to genotype probabilities in BGEN v1.1 files:
- Since genotype probabilities are understood to define a probability distribution,
import_bgen()
automatically sets to missing those genotypes for which the sum of the probabilities is a distance greater than thetolerance
parameter from 1.0. The default tolerance is 0.2, so a genotype with sum .79 or 1.21 is filtered out, whereas a genotype with sum .8 or 1.2 remains. import_bgen()
normalizes all probabilities to sum to 1.0. Therefore, an input distribution of (0.98, 0.0, 0.0) will be stored as (1.0, 0.0, 0.0) in Hail.
Annotations
import_bgen()
adds the following variant annotations:- va.varid (String) – 2nd column of .gen file if chromosome present, otherwise 1st column.
- va.rsid (String) – 3rd column of .gen file if chromosome present, otherwise 2nd column.
Parameters: - path (str or list of str) – .bgen files to import.
- tolerance (float) – If the sum of the probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing. Only applicable if the BGEN files are v1.1.
- sample_file (str or None) – Sample file.
- min_partitions (int or None) – Number of partitions.
Returns: Variant dataset imported from .bgen file.
Return type:
-
import_gen
(path, sample_file=None, tolerance=0.2, min_partitions=None, chromosome=None)[source]¶ Import .gen file(s) as variant dataset.
Examples
Read a .gen file and a .sample file and write to a .vds file:
>>> (hc.import_gen('data/example.gen', sample_file='data/example.sample') ... .write('output/gen_example1.vds'))
Load multiple files at the same time with Hadoop glob patterns:
>>> (hc.import_gen('data/example.chr*.gen', sample_file='data/example.sample') ... .write('output/gen_example2.vds'))
Notes
For more information on the .gen file format, see here.
To ensure that the .gen file(s) and .sample file are correctly prepared for import:
- If there are only 5 columns before the start of the genotype probability data (chromosome field is missing), you must specify the chromosome using the
chromosome
parameter - No duplicate sample IDs are allowed
The first column in the .sample file is used as the sample ID
s
.Also, see section in
import_bgen()
linked here for information about Hail’s genotype probability representation.Annotations
import_gen()
adds the following variant annotations:- va.varid (String) – 2nd column of .gen file if chromosome present, otherwise 1st column.
- va.rsid (String) – 3rd column of .gen file if chromosome present, otherwise 2nd column.
Parameters: - path (str or list of str) – .gen files to import.
- sample_file (str) – The sample file.
- tolerance (float) – If the sum of the genotype probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing.
- min_partitions (int or None) – Number of partitions.
- chromosome (str or None) – Chromosome if not listed in the .gen file.
Returns: Variant dataset imported from .gen and .sample files.
Return type: - If there are only 5 columns before the start of the genotype probability data (chromosome field is missing), you must specify the chromosome using the
-
import_plink
(bed, bim, fam, min_partitions=None, delimiter='\\\\s+', missing='NA', quantpheno=False)[source]¶ Import PLINK binary file (BED, BIM, FAM) as variant dataset.
Examples
Import data from a PLINK binary file:
>>> vds = hc.import_plink(bed="data/test.bed", ... bim="data/test.bim", ... fam="data/test.fam")
Notes
Only binary SNP-major mode files can be read into Hail. To convert your file from individual-major mode to SNP-major mode, use PLINK to read in your fileset and use the
--make-bed
option.The centiMorgan position is not currently used in Hail (Column 3 in BIM file).
The ID (
s
) used by Hail is the individual ID (column 2 in FAM file).Warning
No duplicate individual IDs are allowed.
Chromosome names (Column 1) are automatically converted in the following cases:
- 23 => “X”
- 24 => “Y”
- 25 => “X”
- 26 => “MT”
Annotations
import_plink()
adds the following annotations:- va.rsid (String) – Column 2 in the BIM file.
- sa.famID (String) – Column 1 in the FAM file. Set to missing if ID equals “0”.
- sa.patID (String) – Column 3 in the FAM file. Set to missing if ID equals “0”.
- sa.matID (String) – Column 4 in the FAM file. Set to missing if ID equals “0”.
- sa.isFemale (String) – Column 5 in the FAM file. Set to missing if value equals “-9”, “0”, or “N/A”. Set to true if value equals “2”. Set to false if value equals “1”.
- sa.isCase (String) – Column 6 in the FAM file. Only present if
quantpheno
equals False. Set to missing if value equals “-9”, “0”, “N/A”, or the value specified bymissing
. Set to true if value equals “2”. Set to false if value equals “1”. - sa.qPheno (String) – Column 6 in the FAM file. Only present if
quantpheno
equals True. Set to missing if value equalsmissing
.
Parameters: - bed (str) – PLINK BED file.
- bim (str) – PLINK BIM file.
- fam (str) – PLINK FAM file.
- min_partitions (int or None) – Number of partitions.
- missing (str) – The string used to denote missing values only for the phenotype field. This is in addition to “-9”, “0”, and “N/A” for case-control phenotypes.
- delimiter (str) – FAM file field delimiter regex.
- quantpheno (bool) – If True, FAM phenotype is interpreted as quantitative.
Returns: Variant dataset imported from PLINK binary file.
Return type:
-
import_table
(paths, key=[], min_partitions=None, impute=False, no_header=False, comment=None, delimiter='\t', missing='NA', types={}, quote=None)[source]¶ Import delimited text file (text table) as key table.
The resulting key table will have no key columns, use
KeyTable.key_by()
to specify keys.Example
Given this file
$ cat data/samples1.tsv Sample Height Status Age PT-1234 154.1 ADHD 24 PT-1236 160.9 Control 19 PT-1238 NA ADHD 89 PT-1239 170.3 Control 55
The interesting thing about this table is that column
Height
is a floating-point number, and columnAge
is an integer. We can either provide have Hail impute these types from the file, or pass them ourselves:Pass the types ourselves:
>>> table = hc.import_table('data/samples1.tsv', types={'Height': TDouble(), 'Age': TInt()})
Note that string columns like
Sample
andStatus
do not need to be typed, becauseString
is the default type.Use type imputation (a bit easier, but requires reading the file twice):
>>> table = hc.import_table('data/samples1.tsv', impute=True)
Detailed examples
Let’s import annotations from a CSV file with missing data and special characters:
$ cat data/samples2.tsv Batch,PT-ID 1kg,PT-0001 1kg,PT-0002 study1,PT-0003 study3,PT-0003 .,PT-0004 1kg,PT-0005 .,PT-0006 1kg,PT-0007
In this case, we should:
- Pass the non-default delimiter
,
- Pass the non-default missing value
.
>>> table = hc.import_table('data/samples2.tsv', delimiter=',', missing='.')
Let’s import annotations from a file with no header and sample IDs that need to be transformed. Suppose the vds sample IDs are of the form
NA#####
. This file has no header line, and the sample ID is hidden in a field with other information.To import:
>>> annotations = (hc.import_table('data/samples3.tsv', no_header=True) ... .annotate('sample = f0.split("_")[1]') ... .key_by('sample'))
Notes
The
impute
option tells Hail to scan the file an extra time to gather information about possible field types. While this is a bit slower for large files, (the file is parsed twice), the convenience is often worth this cost.The
delimiter
parameter is either a delimiter character (if a single character) or a field separator regex (2 or more characters). This regex follows the Java regex standard.Note
Use
delimiter='\s+'
to specify whitespace delimited files.The
comment
is an optional parameter which causes Hail to skip any line that starts in the given pattern. Passingcomment='#'
will skip any line beginning in a pound sign, for example.The
missing
parameter defines the representation of missing data in the table.Note
The
comment
andmissing
parameters are NOT regexes.The
no_header
option indicates that the file has no header line. If this option is passed, then the column names will bef0
,f1
, …fN
(0-indexed).The
types
option allows the user to pass the types of columns in the table. This is a dict keyed bystr
, withType
values. See the examples above for a standard usage. Additionally, this option can be used to override type imputation. For example, if a column in a file refers to chromosome and does not contain any sex chromosomes, it will be imputed as an integer, while most Hail methods expect chromosome to be passed as a string. Using theimpute=True
mode and passingtypes={'Chromosome': TString()}
will solve this problem.The
min_partitions
option can be used to increase the number of partitions (level of sharding) of an imported table. The default partition size depends on file system and a number of other factors (including themin_block_size
of the hail context), but usually is between 32M and 128M.Parameters: - paths (str or list of str) – Files to import.
- key (str or list of str) – Key column(s).
- min_partitions (int or None) – Minimum number of partitions.
- no_header (bool) – File has no header and the N columns are named
f0
,f1
, …fN
(0-indexed) - impute (bool) – Impute column types from the file
- comment (str or None) – Skip lines beginning with the given pattern
- delimiter (str) – Field delimiter regex
- missing (str) – Specify identifier to be treated as missing
- types (dict with str keys and
Type
values) – Define types of fields in annotations files - quote (str or None) – Quote character
Returns: Key table constructed from text table.
Return type: - Pass the non-default delimiter
-
import_vcf
(path, force=False, force_bgz=False, header_file=None, min_partitions=None, drop_samples=False, store_gq=False, pp_as_pl=False, skip_bad_ad=False, generic=False, call_fields=[])[source]¶ Import VCF file(s) as variant dataset.
Examples
>>> vds = hc.import_vcf('data/example2.vcf.bgz')
Notes
Hail is designed to be maximally compatible with files in the VCF v4.2 spec.
import_vcf()
takes a list of VCF files to load. All files must have the same header and the same set of samples in the same order (e.g., a variant dataset split by chromosome). Files can be specified as Hadoop glob patterns.Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed (.vcf) or block compressed (.vcf.bgz). If you have a large compressed VCF that ends in .vcf.gz, it is likely that the file is actually block-compressed, and you should rename the file to “.vcf.bgz” accordingly. If you actually have a standard gzipped file, it is possible to import it to Hail using the
force
optional parameter. However, this is not recommended – all parsing will have to take place on one node because gzip decompression is not parallelizable. In this case, import could take significantly longer.If
generic
equals False (default), Hail makes certain assumptions about the genotype fields, seeRepresentation
. On import, Hail filters (sets to no-call) any genotype that violates these assumptions. Hail interprets the format fields: GT, AD, OD, DP, GQ, PL; all others are silently dropped.If
generic
equals True, the genotype schema is aTStruct
with field names equal to the IDs of the FORMAT fields. TheGT
field is automatically read in as aTCall
type. To specify additional fields to import as aTCall
type, use thecall_fields
parameter. All other fields are imported as the type specified in the FORMAT header field.An example genotype schema after importing a VCF with
generic=True
isStruct { GT: Call, AD: Array[Int], DP: Int, GQ: Int, PL: Array[Int] }
Warning
- The variant dataset generated with
generic=True
will have significantly slower performance. - Not all
VariantDataset
methods will work with a generic genotype schema. - The Hail call representation does not support partially missing calls (e.g. 0/.). Partially missing calls will be treated as (fully) missing.
import_vcf()
does not perform deduplication - if the provided VCF(s) contain multiple records with the same chrom, pos, ref, alt, all these records will be imported and will not be collapsed into a single variant.Since Hail’s genotype representation does not yet support ploidy other than 2, this method imports haploid genotypes as diploid. If
generic=False
, Hail fills in missing indices in PL / PP arrays with 1000 to support the standard VCF / VDS “genotype schema.Below are two example haploid genotypes and diploid equivalents that Hail sees.
Haploid: 1:0,6:7:70:70,0 Imported as: 1/1:0,6:7:70:70,1000,0 Haploid: 2:0,0,9:9:24:24,40,0 Imported as: 2/2:0,0,9:9:24:24,1000,40,1000:1000:0
Note
Using the FILTER field:
The information in the FILTER field of a VCF is contained in the
va.filters
annotation. This annotation is aSet
and can be queried for filter membership with expressions likeva.filters.contains("VQSRTranche99.5...")
. Variants that are flagged as “PASS” will have no filters applied; for these variants,va.filters.isEmpty()
is true. Thus, filtering to PASS variants can be done withVariantDataset.filter_variants_expr()
as follows:>>> pass_vds = vds.filter_variants_expr('va.filters.isEmpty()', keep=True)
Annotations
- va.filters (Set[String]) – Set containing all filters applied to a variant.
- va.rsid (String) – rsID of the variant.
- va.qual (Double) – Floating-point number in the QUAL field.
- va.info (Struct) – All INFO fields defined in the VCF header
can be found in the struct
va.info
. Data types match the type specified in the VCF header, and if the declaredNumber
is not 1, the result will be stored as an array.
Parameters: - path (str or list of str) – VCF file(s) to read.
- force (bool) – If True, load .gz files serially. This means that no downstream operations can be parallelized, so using this mode is strongly discouraged for VCFs larger than a few MB.
- force_bgz (bool) – If True, load .gz files as blocked gzip files (BGZF)
- header_file (str or None) – File to load VCF header from. If not specified, the first file in path is used.
- min_partitions (int or None) – Number of partitions.
- drop_samples (bool) – If True, create sites-only variant dataset. Don’t load sample ids, sample annotations or genotypes.
- store_gq (bool) – If True, store GQ FORMAT field instead of computing from PL. Only applies if
generic=False
. - pp_as_pl (bool) – If True, store PP FORMAT field as PL. EXPERIMENTAL. Only applies if
generic=False
. - skip_bad_ad (bool) – If True, set AD FORMAT field with
wrong number of elements to missing, rather than setting
the entire genotype to missing. Only applies if
generic=False
. - generic (bool) – If True, read the genotype with a generic schema.
- call_fields (str or list of str) – FORMAT fields in VCF to treat as a
TCall
. Only applies ifgeneric=True
.
Returns: Variant dataset imported from VCF file(s)
Return type: - The variant dataset generated with
-
index_bgen
(path)[source]¶ Index .bgen files.
HailContext.import_bgen()
cannot run without these indices.Example
>>> hc.index_bgen("data/example3.bgen")
Warning
While this method parallelizes over a list of BGEN files, each file is indexed serially by one core. Indexing several BGEN files on a large cluster is a waste of resources, so indexing should generally be done as a one-time step separately from large analyses.
Parameters: path (str or list of str) – .bgen files to index.
-
read
(path, drop_samples=False, drop_variants=False)[source]¶ Read .vds files as variant dataset.
When loading multiple VDS files, they must have the same sample IDs, genotype schema, split status and variant metadata.
Parameters: - path (str or list of str) – VDS files to read.
- drop_samples (bool) – If True, create sites-only variant dataset. Don’t load sample ids, sample annotations or gneotypes.
- drop_variants (bool) – If True, create samples-only variant dataset (no variants or genotypes).
Returns: Variant dataset read from disk.
Return type:
-
read_table
(path)[source]¶ Read a KT file as key table.
Parameters: path (str) – KT file to read. Returns: Key table read from disk. Return type: KeyTable
-
stop
()[source]¶ Shut down the Hail context.
It is not possible to have multiple Hail contexts running in a single Python session, so use this if you need to reconfigure the Hail context. Note that this also stops a running Spark context.
-
version
¶ Return the version of Hail associated with this HailContext.
Return type: str
- sc (