Genetics functions

locus(contig, pos, reference_genome, …)

Construct a locus expression from a chromosome and position.

locus_from_global_position(global_pos, …)

Constructs a locus expression from a global position and a reference genome.

locus_interval(contig, start, end[, …])

Construct a locus interval expression.

parse_locus(s, reference_genome, …)

Construct a locus expression by parsing a string or string expression.

parse_variant(s, reference_genome, …)

Construct a struct with a locus and alleles by parsing a string.

parse_locus_interval(s, reference_genome, …)

Construct a locus interval expression by parsing a string or string expression.

variant_str(*args)

Create a variant colon-delimited string.

call(*alleles[, phased])

Construct a call expression.

unphased_diploid_gt_index_call(gt_index)

Construct an unphased, diploid call from a genotype index.

parse_call(s)

Construct a call expression by parsing a string or string expression.

downcode(c, i)

Create a new call by setting all alleles other than i to ref

triangle(n)

Returns the triangle number of n.

is_snp(ref, alt)

Returns True if the alleles constitute a single nucleotide polymorphism.

is_mnp(ref, alt)

Returns True if the alleles constitute a multiple nucleotide polymorphism.

is_transition(ref, alt)

Returns True if the alleles constitute a transition.

is_transversion(ref, alt)

Returns True if the alleles constitute a transversion.

is_insertion(ref, alt)

Returns True if the alleles constitute an insertion.

is_deletion(ref, alt)

Returns True if the alleles constitute a deletion.

is_indel(ref, alt)

Returns True if the alleles constitute an insertion or deletion.

is_star(ref, alt)

Returns True if the alleles constitute an upstream deletion.

is_complex(ref, alt)

Returns True if the alleles constitute a complex polymorphism.

is_strand_ambiguous(ref, alt)

Returns True if the alleles are strand ambiguous.

is_valid_contig(contig[, reference_genome])

Returns True if contig is a valid contig name in reference_genome.

is_valid_locus(contig, position[, …])

Returns True if contig and position is a valid site in reference_genome.

allele_type(ref, alt)

Returns the type of the polymorphism as a string.

pl_dosage(pl)

Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior.

gp_dosage(gp)

Return expected genotype dosage from array of genotype probabilities.

get_sequence(contig, position[, before, …])

Return the reference sequence at a given locus.

mendel_error_code(locus, is_female, father, …)

Compute a Mendelian violation code for genotypes.

liftover(x, dest_reference_genome[, …])

Lift over coordinates to a different reference genome.

min_rep(locus, alleles)

Computes the minimal representation of a (locus, alleles) polymorphism.

hail.expr.functions.locus(contig, pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Construct a locus expression from a chromosome and position.

Examples

>>> hl.eval(hl.locus("1", 10000, reference_genome='GRCh37'))
Locus(contig=1, position=10000, reference_genome=GRCh37)
Parameters
Returns

LocusExpression

hail.expr.functions.locus_from_global_position(global_pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Constructs a locus expression from a global position and a reference genome. The inverse of LocusExpression.global_position().

Examples

>>> hl.eval(hl.locus_from_global_position(0))
Locus(contig=1, position=1, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054))
Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054, reference_genome='GRCh38'))
Locus(contig=chr22, position=1, reference_genome=GRCh38)
Parameters
  • global_pos (int or Expression of type tint64) – Global base position along the reference genome.

  • reference_genome (str or ReferenceGenome) – Reference genome to use for converting the global position to a contig and local position.

Returns

LocusExpression

hail.expr.functions.locus_interval(contig, start, end, includes_start=True, includes_end=False, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default', invalid_missing=False) → hail.expr.expressions.typed_expressions.IntervalExpression[source]

Construct a locus interval expression.

Examples

>>> hl.eval(hl.locus_interval("1", 100, 1000, reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=100, reference_genome=GRCh37),
         end=Locus(contig=1, position=1000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
Parameters
Returns

IntervalExpression

hail.expr.functions.parse_locus(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]

Construct a locus expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus('1:10000', reference_genome='GRCh37'))
Locus(contig=1, position=10000, reference_genome=GRCh37)

Notes

This method expects strings of the form contig:position, e.g. 16:29500000 or X:123456.

Parameters
Returns

LocusExpression

hail.expr.functions.parse_variant(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.StructExpression[source]

Construct a struct with a locus and alleles by parsing a string.

Examples

>>> hl.eval(hl.parse_variant('1:100000:A:T,C', reference_genome='GRCh37'))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['A', 'T', 'C'])

Notes

This method returns an expression of type tstruct with the following fields:

Parameters
Returns

StructExpression – Struct with fields locus and alleles.

hail.expr.functions.parse_locus_interval(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default', invalid_missing=False) → hail.expr.expressions.typed_expressions.IntervalExpression[source]

Construct a locus interval expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus_interval('1:1000-2000', reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=1000, reference_genome=GRCh37),
         end=Locus(contig=1, position=2000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
>>> hl.eval(hl.parse_locus_interval('1:start-10M', reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=1, reference_genome=GRCh37),
         end=Locus(contig=1, position=10000000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)

Notes

The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of [ or ( at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of ] or ) at the end of the string for right-inclusive or right-exclusive respectively.

There are several acceptable representations for s.

CHR1:POS1-CHR2:POS2 is the fully specified representation, and we use this to define the various shortcut representations.

In a POS field, start (Start, START) stands for 1.

In a POS field, end (End, END) stands for the contig length.

In a POS field, the qualifiers m (M) and k (K) multiply the given number by 1,000,000 and 1,000, respectively. 1.6K is short for 1600, and 29M is short for 29000000.

CHR:POS1-POS2 stands for CHR:POS1-CHR:POS2

CHR1-CHR2 stands for CHR1:START-CHR2:END

CHR stands for CHR:START-CHR:END

Note

The bounds of the interval must be valid loci for the reference genome (contig in reference genome and position is within the range [1-END]) except in the case where the position is 0 AND the interval is left-exclusive which is normalized to be 1 and left-inclusive. Likewise, in the case where the position is END + 1 AND the interval is right-exclusive which is normalized to be END and right-inclusive.

Parameters
Returns

IntervalExpression

hail.expr.functions.variant_str(*args) → hail.expr.expressions.typed_expressions.StringExpression[source]

Create a variant colon-delimited string.

Parameters

args – Arguments (see notes).

Returns

StringExpression

Notes

Expects either one argument of type struct{locus: locus<RG>, alleles: array<str>, or two arguments of type locus<RG> and array<str>. The function returns a string of the form

CHR:POS:REF:ALT1,ALT2,...ALTN
e.g.
1:1:A:T
16:250125:AAA:A,CAA

Examples

>>> hl.eval(hl.variant_str(hl.locus('1', 10000), ['A', 'T', 'C']))
'1:10000:A:T,C'
hail.expr.functions.call(*alleles, phased=False) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct a call expression.

Examples

>>> hl.eval(hl.call(1, 0))
Call(alleles=[0, 1], phased=False)
Parameters
  • alleles (variable-length args of int or Expression of type tint32) – List of allele indices.

  • phased (bool) – If True, preserve the order of alleles.

Returns

CallExpression

hail.expr.functions.unphased_diploid_gt_index_call(gt_index) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct an unphased, diploid call from a genotype index.

Examples

>>> hl.eval(hl.unphased_diploid_gt_index_call(4))
Call(alleles=[1, 2], phased=False)
Parameters

gt_index (int or Expression of type tint32) – Unphased, diploid genotype index.

Returns

CallExpression

hail.expr.functions.parse_call(s) → hail.expr.expressions.typed_expressions.CallExpression[source]

Construct a call expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_call('0|2'))
Call(alleles=[0, 2], phased=True)

Notes

This method expects strings in the following format:

ploidy

Phased

Unphased

0

|-

-

1

|i

i

2

i|j

i/j

3

i|j|k

i/j/k

N

i|j|k|...|N

i/j/k/.../N

Parameters

s (str or StringExpression) – String to parse.

Returns

CallExpression

hail.expr.functions.downcode(c, i) → hail.expr.expressions.typed_expressions.CallExpression[source]

Create a new call by setting all alleles other than i to ref

Examples

Preserve the third allele and downcode all other alleles to reference.

>>> hl.eval(hl.downcode(hl.call(1, 2), 2))
Call(alleles=[0, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(2, 2), 2))
Call(alleles=[1, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(0, 1), 2))
Call(alleles=[0, 0], phased=False)
Parameters
  • c (CallExpression) – A call.

  • i (Expression of type tint32) – The index of the allele that will be sent to the alternate allele. All other alleles will be downcoded to reference.

Returns

CallExpression

hail.expr.functions.triangle(n) → hail.expr.expressions.typed_expressions.Int32Expression[source]

Returns the triangle number of n.

Examples

>>> hl.eval(hl.triangle(3))
6

Notes

The calculation is n * (n + 1) / 2.

Parameters

n (Expression of type tint32)

Returns

Expression of type tint32

hail.expr.functions.is_snp(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a single nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_snp('A', 'T'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_mnp(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a multiple nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_mnp('AA', 'GT'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_transition(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a transition.

Examples

>>> hl.eval(hl.is_transition('A', 'T'))
False
>>> hl.eval(hl.is_transition('AAA', 'AGA'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_transversion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a transversion.

Examples

>>> hl.eval(hl.is_transversion('A', 'T'))
True
>>> hl.eval(hl.is_transversion('AAA', 'AGA'))
False
Parameters
Returns

BooleanExpression

hail.expr.functions.is_insertion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an insertion.

Examples

>>> hl.eval(hl.is_insertion('A', 'ATT'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_deletion(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a deletion.

Examples

>>> hl.eval(hl.is_deletion('ATT', 'A'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_indel(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an insertion or deletion.

Examples

>>> hl.eval(hl.is_indel('ATT', 'A'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_star(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute an upstream deletion.

Examples

>>> hl.eval(hl.is_star('A', '*'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_complex(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles constitute a complex polymorphism.

Examples

>>> hl.eval(hl.is_complex('ATT', 'GCAC'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_strand_ambiguous(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if the alleles are strand ambiguous.

Strand ambiguous allele pairs are A/T, T/A, C/G, and G/C where the first allele is ref and the second allele is alt.

Examples

>>> hl.eval(hl.is_strand_ambiguous('A', 'T'))
True
Parameters
Returns

BooleanExpression

hail.expr.functions.is_valid_contig(contig, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if contig is a valid contig name in reference_genome.

Examples

>>> hl.eval(hl.is_valid_contig('1', reference_genome='GRCh37'))
True
>>> hl.eval(hl.is_valid_contig('chr1', reference_genome='GRCh37'))
False
Parameters
Returns

BooleanExpression

hail.expr.functions.is_valid_locus(contig, position, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]

Returns True if contig and position is a valid site in reference_genome.

Examples

>>> hl.eval(hl.is_valid_locus('1', 324254, 'GRCh37'))
True
>>> hl.eval(hl.is_valid_locus('chr1', 324254, 'GRCh37'))
False
Parameters
Returns

BooleanExpression

hail.expr.functions.allele_type(ref, alt) → hail.expr.expressions.typed_expressions.StringExpression[source]

Returns the type of the polymorphism as a string.

Examples

>>> hl.eval(hl.allele_type('A', 'T'))
'SNP'
>>> hl.eval(hl.allele_type('ATT', 'A'))
'Deletion'

Notes

The possible return values are:
  • "SNP"

  • "MNP"

  • "Insertion"

  • "Deletion"

  • "Complex"

  • "Star"

  • "Symbolic"

  • "Unknown"

Parameters
Returns

StringExpression

hail.expr.functions.pl_dosage(pl) → hail.expr.expressions.typed_expressions.Float64Expression[source]

Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. Only defined for bi-allelic variants. The pl argument must be length 3.

For a PL array [a, b, c], let:

\[\begin{split}a^\prime = 10^{-a/10} \\ b^\prime = 10^{-b/10} \\ c^\prime = 10^{-c/10} \\\end{split}\]

The genotype dosage is given by:

\[\frac{b^\prime + 2 c^\prime} {a^\prime + b^\prime +c ^\prime}\]

Examples

>>> hl.eval(hl.pl_dosage([5, 10, 100]))
0.24025307377482674
Parameters

pl (ArrayNumericExpression of type tint32) – Length 3 array of bi-allelic Phred-scaled genotype likelihoods

Returns

Expression of type tfloat64

hail.expr.functions.gp_dosage(gp) → hail.expr.expressions.typed_expressions.Float64Expression[source]

Return expected genotype dosage from array of genotype probabilities.

Examples

>>> hl.eval(hl.gp_dosage([0.0, 0.5, 0.5]))
1.5

Notes

This function is only defined for bi-allelic variants. The gp argument must be length 3. The value is gp[1] + 2 * gp[2].

Parameters

gp (ArrayFloat64Expression) – Length 3 array of bi-allelic genotype probabilities

Returns

Expression of type tfloat64

hail.expr.functions.get_sequence(contig, position, before=0, after=0, reference_genome='default') → hail.expr.expressions.typed_expressions.StringExpression[source]

Return the reference sequence at a given locus.

Examples

Return the reference allele for 'GRCh37' at the locus '1:45323':

>>> hl.eval(hl.get_sequence('1', 45323, reference_genome='GRCh37')) # doctest: +SKIP
"T"

Notes

This function requires reference genome has an attached reference sequence. Use ReferenceGenome.add_sequence() to load and attach a reference sequence to a reference genome.

Returns None if contig and position are not valid coordinates in reference_genome.

Parameters
  • contig (Expression of type tstr) – Locus contig.

  • position (Expression of type tint32) – Locus position.

  • before (Expression of type tint32, optional) – Number of bases to include before the locus of interest. Truncates at contig boundary.

  • after (Expression of type tint32, optional) – Number of bases to include after the locus of interest. Truncates at contig boundary.

  • reference_genome (str or ReferenceGenome) – Reference genome to use. Must have a reference sequence available.

Returns

StringExpression

hail.expr.functions.mendel_error_code(locus, is_female, father, mother, child)[source]

Compute a Mendelian violation code for genotypes.

>>> father = hl.call(0, 0)
>>> mother = hl.call(1, 1)
>>> child1 = hl.call(0, 1)  # consistent
>>> child2 = hl.call(0, 0)  # Mendel error
>>> locus = hl.locus('2', 2000000)
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child1))
None
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child2))
7

Note

Ignores call phasing, and assumes diploid and biallelic. Haploid calls for hemiploid samples on sex chromosomes also are acceptable input.

Notes

In the table below, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by LocusExpression.in_autosome():

  • Auto – in autosome or in PAR, or in non-PAR of X and female child

  • HemiX – in non-PAR of X and male child

  • HemiY – in non-PAR of Y and male child

Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.

Code

Dad

Mom

Kid

Copy State

Implicated

1

HomVar

HomVar

Het

Auto

Dad, Mom, Kid

2

HomRef

HomRef

Het

Auto

Dad, Mom, Kid

3

HomRef

~HomRef

HomVar

Auto

Dad, Kid

4

~HomRef

HomRef

HomVar

Auto

Mom, Kid

5

HomRef

HomRef

HomVar

Auto

Kid

6

HomVar

~HomVar

HomRef

Auto

Dad, Kid

7

~HomVar

HomVar

HomRef

Auto

Mom, Kid

8

HomVar

HomVar

HomRef

Auto

Kid

9

Any

HomVar

HomRef

HemiX

Mom, Kid

10

Any

HomRef

HomVar

HemiX

Mom, Kid

11

HomVar

Any

HomRef

HemiY

Dad, Kid

12

HomRef

Any

HomVar

HemiY

Dad, Kid

Parameters
Returns

Int32Expression

hail.expr.functions.liftover(x, dest_reference_genome, min_match=0.95, include_strand=False)[source]

Lift over coordinates to a different reference genome.

Examples

Lift over the locus coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus('1', 1034245, 'GRCh37'), 'GRCh38')) # doctest: +SKIP
Locus(contig='chr1', position=1098865, reference_genome='GRCh38')

Lift over the locus interval coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus_interval('20', 60001, 82456, True, True, 'GRCh37'), 'GRCh38')) # doctest: +SKIP
Interval(Locus(contig='chr20', position=79360, reference_genome='GRCh38'),
         Locus(contig='chr20', position=101815, reference_genome='GRCh38'),
         True,
         True)

See Liftover variants from one coordinate system to another for more instructions on lifting over a Table or MatrixTable.

Notes

This function requires the reference genome of x has a chain file loaded for dest_reference_genome. Use ReferenceGenome.add_liftover() to load and attach a chain file to a reference genome.

Returns None if x could not be converted.

Warning

Before using the result of liftover() as a new row key or column key, be sure to filter out missing values.

Parameters
  • x (Expression of type tlocus or tinterval of tlocus) – Locus or locus interval to lift over.

  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.

  • min_match (float) – Minimum ratio of bases that must remap.

  • include_strand (bool) – If True, output the result as a StructExpression with the first field result being the locus or locus interval and the second field is_negative_strand is a boolean indicating whether the locus or locus interval has been mapped to the negative strand of the destination reference genome. Otherwise, output the converted locus or locus interval.

Returns

Expression – A locus or locus interval converted to dest_reference_genome.

hail.expr.functions.min_rep(locus, alleles)[source]

Computes the minimal representation of a (locus, alleles) polymorphism.

Examples

>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['TAA', 'TA']))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['TA', 'T'])
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['AATAA', 'AACAA']))
Struct(locus=Locus(contig=1, position=100002, reference_genome=GRCh37), alleles=['T', 'C'])

Notes

Computing the minimal representation can cause the locus shift right (the position can increase).

Parameters
Returns

StructExpression – A tstruct expression with two fields, locus (LocusExpression) and alleles (ArrayExpression of type tstr).