Genetics functions

locus(contig, pos, reference_genome, …) Construct a locus expression from a chromosome and position.
locus_from_global_position(global_pos, …) Constructs a locus expression from a global position and a reference genome.
locus_interval(contig, start, end[, …]) Construct a locus interval expression.
parse_locus(s, reference_genome, …) Construct a locus expression by parsing a string or string expression.
parse_variant(s, reference_genome, …) Construct a struct with a locus and alleles by parsing a string.
parse_locus_interval(s, reference_genome, …) Construct a locus interval expression by parsing a string or string expression.
call(*alleles[, phased]) Construct a call expression.
unphased_diploid_gt_index_call(gt_index) Construct an unphased, diploid call from a genotype index.
parse_call(s) Construct a call expression by parsing a string or string expression.
downcode(c, i) Create a new call by setting all alleles other than i to ref
triangle(n) Returns the triangle number of n.
is_snp(ref, alt) Returns True if the alleles constitute a single nucleotide polymorphism.
is_mnp(ref, alt) Returns True if the alleles constitute a multiple nucleotide polymorphism.
is_transition(ref, alt) Returns True if the alleles constitute a transition.
is_transversion(ref, alt) Returns True if the alleles constitute a transversion.
is_insertion(ref, alt) Returns True if the alleles constitute an insertion.
is_deletion(ref, alt) Returns True if the alleles constitute a deletion.
is_indel(ref, alt) Returns True if the alleles constitute an insertion or deletion.
is_star(ref, alt) Returns True if the alleles constitute an upstream deletion.
is_complex(ref, alt) Returns True if the alleles constitute a complex polymorphism.
is_strand_ambiguous(ref, alt) Returns True if the alleles are strand ambiguous.
is_valid_contig(contig[, reference_genome]) Returns True if contig is a valid contig name in reference_genome.
is_valid_locus(contig, position[, …]) Returns True if contig and position is a valid site in reference_genome.
allele_type(ref, alt) Returns the type of the polymorphism as a string.
pl_dosage(pl) Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior.
gp_dosage(gp) Return expected genotype dosage from array of genotype probabilities.
get_sequence(contig, position[, before, …]) Return the reference sequence at a given locus.
mendel_error_code(locus, is_female, father, …) Compute a Mendelian violation code for genotypes.
liftover(x, dest_reference_genome[, …]) Lift over coordinates to a different reference genome.
min_rep(locus, alleles) Computes the minimal representation of a (locus, alleles) polymorphism.
hail.expr.functions.locus(contig, pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Construct a locus expression from a chromosome and position.

Examples

>>> hl.eval(hl.locus("1", 10000))
Locus(contig=1, position=10000, reference_genome=GRCh37)
Parameters:
Returns:

LocusExpression

hail.expr.functions.locus_from_global_position(global_pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Constructs a locus expression from a global position and a reference genome. The inverse of LocusExpression.global_position().

Examples

>>> hl.eval(hl.locus_from_global_position(0))
Locus(contig=1, position=1, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054))
Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054, 'GRCh38'))
Locus(contig=chr22, position=1, reference_genome=GRCh38)
Parameters:
  • global_pos (int or Expression of type tint64) – Global base position along the reference genome.
  • reference_genome (str or ReferenceGenome) – Reference genome to use for converting the global position to a contig and local position.
Returns:

LocusExpression

hail.expr.functions.locus_interval(contig, start, end, includes_start=True, includes_end=False, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.IntervalExpression[source]

Construct a locus interval expression.

Examples

>>> hl.eval(hl.locus_interval("1", 100, 1000))
Interval(start=Locus(contig=1, position=100, reference_genome=GRCh37),
         end=Locus(contig=1, position=1000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
Parameters:
Returns:

IntervalExpression

hail.expr.functions.parse_locus(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Construct a locus expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus("1:10000"))
Locus(contig=1, position=10000, reference_genome=GRCh37)

Notes

This method expects strings of the form contig:position, e.g. 16:29500000 or X:123456.

Parameters:
Returns:

LocusExpression

hail.expr.functions.parse_variant(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.StructExpression[source]

Construct a struct with a locus and alleles by parsing a string.

Examples

>>> hl.eval(hl.parse_variant('1:100000:A:T,C'))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['A', 'T', 'C'])

Notes

This method returns an expression of type tstruct with the following fields:

Parameters:
Returns:

StructExpression – Struct with fields locus and alleles.

hail.expr.functions.parse_locus_interval(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.IntervalExpression[source]

Construct a locus interval expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus_interval('1:1000-2000'))
Interval(start=Locus(contig=1, position=1000, reference_genome=GRCh37),
         end=Locus(contig=1, position=2000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
>>> hl.eval(hl.parse_locus_interval('1:start-10M'))
Interval(start=Locus(contig=1, position=1, reference_genome=GRCh37),
         end=Locus(contig=1, position=10000000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)

Notes

The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of [ or ( at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of ] or ) at the end of the string for right-inclusive or right-exclusive respectively.

There are several acceptable representations for s.

CHR1:POS1-CHR2:POS2 is the fully specified representation, and we use this to define the various shortcut representations.

In a POS field, start (Start, START) stands for 1.

In a POS field, end (End, END) stands for the contig length.

In a POS field, the qualifiers m (M) and k (K) multiply the given number by 1,000,000 and 1,000, respectively. 1.6K is short for 1600, and 29M is short for 29000000.

CHR:POS1-POS2 stands for CHR:POS1-CHR:POS2

CHR1-CHR2 stands for CHR1:START-CHR2:END

CHR stands for CHR:START-CHR:END

Note

The bounds of the interval must be valid loci for the reference genome (contig in reference genome and position is within the range [1-END]) except in the case where the position is 0 AND the interval is left-exclusive which is normalized to be 1 and left-inclusive. Likewise, in the case where the position is END + 1 AND the interval is right-exclusive which is normalized to be END and right-inclusive.

Parameters:
Returns:

IntervalExpression

hail.expr.functions.call(*alleles, phased=False) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct a call expression.

Examples

>>> hl.eval(hl.call(1, 0))
Call(alleles=[0, 1], phased=False)
Parameters:
  • alleles (variable-length args of int or Expression of type tint32) – List of allele indices.
  • phased (bool) – If True, preserve the order of alleles.
Returns:

CallExpression

hail.expr.functions.unphased_diploid_gt_index_call(gt_index) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct an unphased, diploid call from a genotype index.

Examples

>>> hl.eval(hl.unphased_diploid_gt_index_call(4))
Call(alleles=[1, 2], phased=False)
Parameters:gt_index (int or Expression of type tint32) – Unphased, diploid genotype index.
Returns:CallExpression
hail.expr.functions.parse_call(s) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct a call expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_call('0|2'))
Call(alleles=[0, 2], phased=True)

Notes

This method expects strings in the following format:

ploidy Phased Unphased
0 |- -
1 |i i
2 i|j i/j
3 i|j|k i/j/k
N i|j|k|...|N i/j/k/.../N
Parameters:s (str or StringExpression) – String to parse.
Returns:CallExpression
hail.expr.functions.downcode(c, i) → hail.expr.expressions.typed_expressions.CallExpression[source]

Create a new call by setting all alleles other than i to ref

Examples

Preserve the third allele and downcode all other alleles to reference.

>>> hl.eval(hl.downcode(hl.call(1, 2), 2))
Call(alleles=[0, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(2, 2), 2))
Call(alleles=[1, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(0, 1), 2))
Call(alleles=[0, 0], phased=False)
Parameters:
  • c (CallExpression) – A call.
  • i (Expression of type tint32) – The index of the allele that will be sent to the alternate allele. All other alleles will be downcoded to reference.
Returns:

CallExpression

hail.expr.functions.triangle(n) → hail.expr.expressions.typed_expressions.Int32Expression[source]

Returns the triangle number of n.

Examples

>>> hl.eval(hl.triangle(3))
6

Notes

The calculation is n * (n + 1) / 2.

Parameters:n (Expression of type tint32)
Returns:Expression of type tint32
hail.expr.functions.is_snp(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a single nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_snp('A', 'T'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_mnp(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a multiple nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_mnp('AA', 'GT'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_transition(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a transition.

Examples

>>> hl.eval(hl.is_transition('A', 'T'))
False
>>> hl.eval(hl.is_transition('AAA', 'AGA'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_transversion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a transversion.

Examples

>>> hl.eval(hl.is_transversion('A', 'T'))
True
>>> hl.eval(hl.is_transversion('AAA', 'AGA'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_insertion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an insertion.

Examples

>>> hl.eval(hl.is_insertion('A', 'ATT'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_deletion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a deletion.

Examples

>>> hl.eval(hl.is_deletion('ATT', 'A'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_indel(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an insertion or deletion.

Examples

>>> hl.eval(hl.is_indel('ATT', 'A'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_star(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an upstream deletion.

Examples

>>> hl.eval(hl.is_star('A', '*'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_complex(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a complex polymorphism.

Examples

>>> hl.eval(hl.is_complex('ATT', 'GCAC'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_strand_ambiguous(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles are strand ambiguous.

Strand ambiguous allele pairs are A/T, T/A, C/G, and G/C where the first allele is ref and the second allele is alt.

Examples

>>> hl.eval(hl.is_strand_ambiguous('A', 'T'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_valid_contig(contig, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if contig is a valid contig name in reference_genome.

Examples

>>> hl.eval(hl.is_valid_contig('1', 'GRCh37'))
True
>>> hl.eval(hl.is_valid_contig('chr1', 'GRCh37'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_valid_locus(contig, position, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if contig and position is a valid site in reference_genome.

Examples

>>> hl.eval(hl.is_valid_locus('1', 324254, 'GRCh37'))
True
>>> hl.eval(hl.is_valid_locus('chr1', 324254, 'GRCh37'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.allele_type(ref, alt) → hail.expr.expressions.typed_expressions.StringExpression[source]

Returns the type of the polymorphism as a string.

Examples

>>> hl.eval(hl.allele_type('A', 'T'))
'SNP'
>>> hl.eval(hl.allele_type('ATT', 'A'))
'Deletion'

Notes

The possible return values are:
  • "SNP"
  • "MNP"
  • "Insertion"
  • "Deletion"
  • "Complex"
  • "Star"
  • "Symbolic"
  • "Unknown"
Parameters:
Returns:

StringExpression

hail.expr.functions.pl_dosage(pl) → hail.expr.expressions.typed_expressions.Float64Expression[source]

Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. Only defined for bi-allelic variants. The pl argument must be length 3.

For a PL array [a, b, c], let:

\[\begin{split}a^\prime = 10^{-a/10} \\ b^\prime = 10^{-b/10} \\ c^\prime = 10^{-c/10} \\\end{split}\]

The genotype dosage is given by:

\[\frac{b^\prime + 2 c^\prime} {a^\prime + b^\prime +c ^\prime}\]

Examples

>>> hl.eval(hl.pl_dosage([5, 10, 100]))
0.24025307377482674
Parameters:pl (ArrayNumericExpression of type tint32) – Length 3 array of bi-allelic Phred-scaled genotype likelihoods
Returns:Expression of type tfloat64
hail.expr.functions.gp_dosage(gp) → hail.expr.expressions.typed_expressions.Float64Expression[source]

Return expected genotype dosage from array of genotype probabilities.

Examples

>>> hl.eval(hl.gp_dosage([0.0, 0.5, 0.5]))
1.5

Notes

This function is only defined for bi-allelic variants. The gp argument must be length 3. The value is gp[1] + 2 * gp[2].

Parameters:gp (ArrayFloat64Expression) – Length 3 array of bi-allelic genotype probabilities
Returns:Expression of type tfloat64
hail.expr.functions.get_sequence(contig, position, before=0, after=0, reference_genome='default') → hail.expr.expressions.typed_expressions.StringExpression[source]

Return the reference sequence at a given locus.

Examples

Return the reference allele for 'GRCh37' at the locus '1:45323':

>>> hl.eval(hl.get_sequence('1', 45323, 'GRCh37')) 
"T"

Notes

This function requires reference genome has an attached reference sequence. Use ReferenceGenome.add_sequence() to load and attach a reference sequence to a reference genome.

Returns None if contig and position are not valid coordinates in reference_genome.

Parameters:
  • contig (Expression of type tstr) – Locus contig.
  • position (Expression of type tint32) – Locus position.
  • before (Expression of type tint32, optional) – Number of bases to include before the locus of interest. Truncates at contig boundary.
  • after (Expression of type tint32, optional) – Number of bases to include after the locus of interest. Truncates at contig boundary.
  • reference_genome (str or ReferenceGenome) – Reference genome to use. Must have a reference sequence available.
Returns:

StringExpression

hail.expr.functions.mendel_error_code(locus, is_female, father, mother, child)[source]

Compute a Mendelian violation code for genotypes.

>>> father = hl.call(0, 0)
>>> mother = hl.call(1, 1)
>>> child1 = hl.call(0, 1)  # consistent
>>> child2 = hl.call(0, 0)  # Mendel error
>>> locus = hl.locus('2', 2000000)
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child1))
None
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child2))
7

Note

Ignores call phasing, and assumes diploid and biallelic. Haploid calls for hemiploid samples on sex chromosomes also are acceptable input.

Notes

In the table below, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by LocusExpression.in_autosome():

  • Auto – in autosome or in PAR, or in non-PAR of X and female child
  • HemiX – in non-PAR of X and male child
  • HemiY – in non-PAR of Y and male child

Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.

Code Dad Mom Kid Copy State Implicated
1 HomVar HomVar Het Auto Dad, Mom, Kid
2 HomRef HomRef Het Auto Dad, Mom, Kid
3 HomRef ~HomRef HomVar Auto Dad, Kid
4 ~HomRef HomRef HomVar Auto Mom, Kid
5 HomRef HomRef HomVar Auto Kid
6 HomVar ~HomVar HomRef Auto Dad, Kid
7 ~HomVar HomVar HomRef Auto Mom, Kid
8 HomVar HomVar HomRef Auto Kid
9 Any HomVar HomRef HemiX Mom, Kid
10 Any HomRef HomVar HemiX Mom, Kid
11 HomVar Any HomRef HemiY Dad, Kid
12 HomRef Any HomVar HemiY Dad, Kid
Parameters:
Returns:

Int32Expression

hail.expr.functions.liftover(x, dest_reference_genome, min_match=0.95, include_strand=False)[source]

Lift over coordinates to a different reference genome.

Examples

Lift over the locus coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus('1', 1034245, 'GRCh37'), 'GRCh38')) 
Locus(contig='chr1', position=1098865, reference_genome='GRCh38')

Lift over the locus interval coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus_interval('20', 60001, 82456, True, True, 'GRCh37'), 'GRCh38')) 
Interval(Locus(contig='chr20', position=79360, reference_genome='GRCh38'),
         Locus(contig='chr20', position=101815, reference_genome='GRCh38'),
         True,
         True)

Notes

This function requires the reference genome of x has a chain file loaded for dest_reference_genome. Use ReferenceGenome.add_liftover() to load and attach a chain file to a reference genome.

Returns None if x could not be converted.

Warning

Before using the result of liftover() as a new row key or column key, be sure to filter out missing values.

Parameters:
  • x (Expression of type tlocus or tinterval of tlocus) – Locus or locus interval to lift over.
  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.
  • min_match (float) – Minimum ratio of bases that must remap.
  • include_strand (bool) – If True, output the result as a StructExpression with the first field result being the locus or locus interval and the second field is_negative_strand is a boolean indicating whether the locus or locus interval has been mapped to the negative strand of the destination reference genome. Otherwise, output the converted locus or locus interval.
Returns:

Expression – A locus or locus interval converted to dest_reference_genome.

hail.expr.functions.min_rep(locus, alleles)[source]

Computes the minimal representation of a (locus, alleles) polymorphism.

Examples

>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['TAA', 'TA']))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['TA', 'T'])
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['AATAA', 'AACAA']))
Struct(locus=Locus(contig=1, position=100002, reference_genome=GRCh37), alleles=['T', 'C'])

Notes

Computing the minimal representation can cause the locus shift right (the position can increase).

Parameters:
Returns:

StructExpression – A tstruct expression with two fields, locus (LocusExpression) and alleles (ArrayExpression of type tstr).