Genetics functions

locus(contig, pos[, reference_genome])

Construct a locus expression from a chromosome and position.

locus_from_global_position(global_pos[, ...])

Constructs a locus expression from a global position and a reference genome.

locus_interval(contig, start, end[, ...])

Construct a locus interval expression.

parse_locus(s[, reference_genome])

Construct a locus expression by parsing a string or string expression.

parse_variant(s[, reference_genome])

Construct a struct with a locus and alleles by parsing a string.

parse_locus_interval(s[, reference_genome, ...])

Construct a locus interval expression by parsing a string or string expression.

variant_str(*args)

Create a variant colon-delimited string.

call(*alleles[, phased])

Construct a call expression.

unphased_diploid_gt_index_call(gt_index)

Construct an unphased, diploid call from a genotype index.

parse_call(s)

Construct a call expression by parsing a string or string expression.

downcode(c, i)

Create a new call by setting all alleles other than i to ref

triangle(n)

Returns the triangle number of n.

is_snp(ref, alt)

Returns True if the alleles constitute a single nucleotide polymorphism.

is_mnp(ref, alt)

Returns True if the alleles constitute a multiple nucleotide polymorphism.

is_transition(ref, alt)

Returns True if the alleles constitute a transition.

is_transversion(ref, alt)

Returns True if the alleles constitute a transversion.

is_insertion(ref, alt)

Returns True if the alleles constitute an insertion.

is_deletion(ref, alt)

Returns True if the alleles constitute a deletion.

is_indel(ref, alt)

Returns True if the alleles constitute an insertion or deletion.

is_star(ref, alt)

Returns True if the alleles constitute an upstream deletion.

is_complex(ref, alt)

Returns True if the alleles constitute a complex polymorphism.

is_strand_ambiguous(ref, alt)

Returns True if the alleles are strand ambiguous.

is_valid_contig(contig[, reference_genome])

Returns True if contig is a valid contig name in reference_genome.

is_valid_locus(contig, position[, ...])

Returns True if contig and position is a valid site in reference_genome.

contig_length(contig[, reference_genome])

Returns the length of contig in reference_genome.

allele_type(ref, alt)

Returns the type of the polymorphism as a string.

numeric_allele_type(ref, alt)

Returns the type of the polymorphism as an integer.

pl_dosage(pl)

Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior.

gp_dosage(gp)

Return expected genotype dosage from array of genotype probabilities.

get_sequence(contig, position[, before, ...])

Return the reference sequence at a given locus.

mendel_error_code(locus, is_female, father, ...)

Compute a Mendelian violation code for genotypes.

liftover(x, dest_reference_genome[, ...])

Lift over coordinates to a different reference genome.

min_rep(locus, alleles)

Computes the minimal representation of a (locus, alleles) polymorphism.

reverse_complement(s[, rna])

Reverses the string and translates base pairs into their complements .

hail.expr.functions.locus(contig, pos, reference_genome='default')[source]

Construct a locus expression from a chromosome and position.

Examples

>>> hl.eval(hl.locus("1", 10000, reference_genome='GRCh37'))
Locus(contig=1, position=10000, reference_genome=GRCh37)
Parameters:
Returns:

LocusExpression

hail.expr.functions.locus_from_global_position(global_pos, reference_genome='default')[source]

Constructs a locus expression from a global position and a reference genome. The inverse of LocusExpression.global_position().

Examples

>>> hl.eval(hl.locus_from_global_position(0))
Locus(contig=1, position=1, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054))
Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054, reference_genome='GRCh38'))
Locus(contig=chr22, position=1, reference_genome=GRCh38)
Parameters:
  • global_pos (int or Expression of type tint64) – Global base position along the reference genome.

  • reference_genome (str or ReferenceGenome) – Reference genome to use for converting the global position to a contig and local position.

Returns:

LocusExpression

hail.expr.functions.locus_interval(contig, start, end, includes_start=True, includes_end=False, reference_genome='default', invalid_missing=False)[source]

Construct a locus interval expression.

Examples

>>> hl.eval(hl.locus_interval("1", 100, 1000, reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=100, reference_genome=GRCh37),
         end=Locus(contig=1, position=1000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
Parameters:
Returns:

IntervalExpression

hail.expr.functions.parse_locus(s, reference_genome='default')[source]

Construct a locus expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus('1:10000', reference_genome='GRCh37'))
Locus(contig=1, position=10000, reference_genome=GRCh37)

Notes

This method expects strings of the form contig:position, e.g. 16:29500000 or X:123456.

Parameters:
Returns:

LocusExpression

hail.expr.functions.parse_variant(s, reference_genome='default')[source]

Construct a struct with a locus and alleles by parsing a string.

Examples

>>> hl.eval(hl.parse_variant('1:100000:A:T,C', reference_genome='GRCh37'))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['A', 'T', 'C'])

Notes

This method returns an expression of type tstruct with the following fields:

Parameters:
Returns:

StructExpression – Struct with fields locus and alleles.

hail.expr.functions.parse_locus_interval(s, reference_genome='default', invalid_missing=False)[source]

Construct a locus interval expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_locus_interval('1:1000-2000', reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=1000, reference_genome=GRCh37),
         end=Locus(contig=1, position=2000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)
>>> hl.eval(hl.parse_locus_interval('1:start-10M', reference_genome='GRCh37'))
Interval(start=Locus(contig=1, position=1, reference_genome=GRCh37),
         end=Locus(contig=1, position=10000000, reference_genome=GRCh37),
         includes_start=True,
         includes_end=False)

Notes

The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of [ or ( at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of ] or ) at the end of the string for right-inclusive or right-exclusive respectively.

There are several acceptable representations for s.

CHR1:POS1-CHR2:POS2 is the fully specified representation, and we use this to define the various shortcut representations.

In a POS field, start (Start, START) stands for 1.

In a POS field, end (End, END) stands for the contig length.

In a POS field, the qualifiers m (M) and k (K) multiply the given number by 1,000,000 and 1,000, respectively. 1.6K is short for 1600, and 29M is short for 29000000.

CHR:POS1-POS2 stands for CHR:POS1-CHR:POS2

CHR1-CHR2 stands for CHR1:START-CHR2:END

CHR stands for CHR:START-CHR:END

Note

The bounds of the interval must be valid loci for the reference genome (contig in reference genome and position is within the range [1-END]) except in the case where the position is 0 AND the interval is left-exclusive which is normalized to be 1 and left-inclusive. Likewise, in the case where the position is END + 1 AND the interval is right-exclusive which is normalized to be END and right-inclusive.

Parameters:
Returns:

IntervalExpression

hail.expr.functions.variant_str(*args)[source]

Create a variant colon-delimited string.

Parameters:

args – Arguments (see notes).

Returns:

StringExpression

Notes

Expects either one argument of type struct{locus: locus<RG>, alleles: array<str>, or two arguments of type locus<RG> and array<str>. The function returns a string of the form

CHR:POS:REF:ALT1,ALT2,...ALTN
e.g.
1:1:A:T
16:250125:AAA:A,CAA

Examples

>>> hl.eval(hl.variant_str(hl.locus('1', 10000), ['A', 'T', 'C']))
'1:10000:A:T,C'
hail.expr.functions.call(*alleles, phased=False)[source]

Construct a call expression.

Examples

>>> hl.eval(hl.call(1, 0))
Call(alleles=[0, 1], phased=False)
Parameters:
  • alleles (variable-length args of int or Expression of type tint32) – List of allele indices.

  • phased (bool) – If True, preserve the order of alleles.

Returns:

CallExpression

hail.expr.functions.unphased_diploid_gt_index_call(gt_index)[source]

Construct an unphased, diploid call from a genotype index.

Examples

>>> hl.eval(hl.unphased_diploid_gt_index_call(4))
Call(alleles=[1, 2], phased=False)
Parameters:

gt_index (int or Expression of type tint32) – Unphased, diploid genotype index.

Returns:

CallExpression

hail.expr.functions.parse_call(s)[source]

Construct a call expression by parsing a string or string expression.

Examples

>>> hl.eval(hl.parse_call('0|2'))
Call(alleles=[0, 2], phased=True)

Notes

This method expects strings in the following format:

ploidy

Phased

Unphased

0

|-

-

1

|i

i

2

i|j

i/j

3

i|j|k

i/j/k

N

i|j|k|...|N

i/j/k/.../N

Parameters:

s (str or StringExpression) – String to parse.

Returns:

CallExpression

hail.expr.functions.downcode(c, i)[source]

Create a new call by setting all alleles other than i to ref

Examples

Preserve the third allele and downcode all other alleles to reference.

>>> hl.eval(hl.downcode(hl.call(1, 2), 2))
Call(alleles=[0, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(2, 2), 2))
Call(alleles=[1, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(0, 1), 2))
Call(alleles=[0, 0], phased=False)
Parameters:
  • c (CallExpression) – A call.

  • i (Expression of type tint32) – The index of the allele that will be sent to the alternate allele. All other alleles will be downcoded to reference.

Returns:

CallExpression

hail.expr.functions.triangle(n)[source]

Returns the triangle number of n.

Examples

>>> hl.eval(hl.triangle(3))
6

Notes

The calculation is n * (n + 1) / 2.

Parameters:

n (Expression of type tint32)

Returns:

Expression of type tint32

hail.expr.functions.is_snp(ref, alt)[source]

Returns True if the alleles constitute a single nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_snp('A', 'T'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_mnp(ref, alt)[source]

Returns True if the alleles constitute a multiple nucleotide polymorphism.

Examples

>>> hl.eval(hl.is_mnp('AA', 'GT'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_transition(ref, alt)[source]

Returns True if the alleles constitute a transition.

Examples

>>> hl.eval(hl.is_transition('A', 'T'))
False
>>> hl.eval(hl.is_transition('AAA', 'AGA'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_transversion(ref, alt)[source]

Returns True if the alleles constitute a transversion.

Examples

>>> hl.eval(hl.is_transversion('A', 'T'))
True
>>> hl.eval(hl.is_transversion('AAA', 'AGA'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_insertion(ref, alt)[source]

Returns True if the alleles constitute an insertion.

Examples

>>> hl.eval(hl.is_insertion('A', 'ATT'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_deletion(ref, alt)[source]

Returns True if the alleles constitute a deletion.

Examples

>>> hl.eval(hl.is_deletion('ATT', 'A'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_indel(ref, alt)[source]

Returns True if the alleles constitute an insertion or deletion.

Examples

>>> hl.eval(hl.is_indel('ATT', 'A'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_star(ref, alt)[source]

Returns True if the alleles constitute an upstream deletion.

Examples

>>> hl.eval(hl.is_star('A', '*'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_complex(ref, alt)[source]

Returns True if the alleles constitute a complex polymorphism.

Examples

>>> hl.eval(hl.is_complex('ATT', 'GCAC'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_strand_ambiguous(ref, alt)[source]

Returns True if the alleles are strand ambiguous.

Strand ambiguous allele pairs are A/T, T/A, C/G, and G/C where the first allele is ref and the second allele is alt.

Examples

>>> hl.eval(hl.is_strand_ambiguous('A', 'T'))
True
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_valid_contig(contig, reference_genome='default')[source]

Returns True if contig is a valid contig name in reference_genome.

Examples

>>> hl.eval(hl.is_valid_contig('1', reference_genome='GRCh37'))
True
>>> hl.eval(hl.is_valid_contig('chr1', reference_genome='GRCh37'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.is_valid_locus(contig, position, reference_genome='default')[source]

Returns True if contig and position is a valid site in reference_genome.

Examples

>>> hl.eval(hl.is_valid_locus('1', 324254, 'GRCh37'))
True
>>> hl.eval(hl.is_valid_locus('chr1', 324254, 'GRCh37'))
False
Parameters:
Returns:

BooleanExpression

hail.expr.functions.contig_length(contig, reference_genome='default')[source]

Returns the length of contig in reference_genome.

Examples

>>> hl.eval(hl.contig_length('5', reference_genome='GRCh37'))
180915260
Parameters:
Returns:

Int32Expression

hail.expr.functions.allele_type(ref, alt)[source]

Returns the type of the polymorphism as a string.

Examples

>>> hl.eval(hl.allele_type('A', 'T'))
'SNP'
>>> hl.eval(hl.allele_type('ATT', 'A'))
'Deletion'

Notes

The possible return values are:
  • "SNP"

  • "MNP"

  • "Insertion"

  • "Deletion"

  • "Complex"

  • "Star"

  • "Symbolic"

  • "Unknown"

Parameters:
Returns:

StringExpression

hail.expr.functions.numeric_allele_type(ref, alt)[source]

Returns the type of the polymorphism as an integer. The value returned is the integer value of AlleleType representing that kind of polymorphism.

Examples

>>> hl.eval(hl.numeric_allele_type('A', 'T')) == AlleleType.SNP
True

Notes

The values of AlleleType are not stable and thus should not be relied upon across hail versions.

hail.expr.functions.pl_dosage(pl)[source]

Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. Only defined for bi-allelic variants. The pl argument must be length 3.

For a PL array [a, b, c], let:

\[a^\prime = 10^{-a/10} \\ b^\prime = 10^{-b/10} \\ c^\prime = 10^{-c/10} \\\]

The genotype dosage is given by:

\[\frac{b^\prime + 2 c^\prime} {a^\prime + b^\prime +c ^\prime}\]

Examples

>>> hl.eval(hl.pl_dosage([5, 10, 100]))
0.24025307377482674
Parameters:

pl (ArrayNumericExpression of type tint32) – Length 3 array of bi-allelic Phred-scaled genotype likelihoods

Returns:

Expression of type tfloat64

hail.expr.functions.gp_dosage(gp)[source]

Return expected genotype dosage from array of genotype probabilities.

Examples

>>> hl.eval(hl.gp_dosage([0.0, 0.5, 0.5]))
1.5

Notes

This function is only defined for bi-allelic variants. The gp argument must be length 3. The value is gp[1] + 2 * gp[2].

Parameters:

gp (Expression of type tarray of tfloat64) – Length 3 array of bi-allelic genotype probabilities

Returns:

Expression of type tfloat64

hail.expr.functions.get_sequence(contig, position, before=0, after=0, reference_genome='default')[source]

Return the reference sequence at a given locus.

Examples

Return the reference allele for 'GRCh37' at the locus '1:45323':

>>> hl.eval(hl.get_sequence('1', 45323, reference_genome='GRCh37')) 
"T"

Notes

This function requires reference genome has an attached reference sequence. Use ReferenceGenome.add_sequence() to load and attach a reference sequence to a reference genome.

Returns None if contig and position are not valid coordinates in reference_genome.

Parameters:
  • contig (Expression of type tstr) – Locus contig.

  • position (Expression of type tint32) – Locus position.

  • before (Expression of type tint32, optional) – Number of bases to include before the locus of interest. Truncates at contig boundary.

  • after (Expression of type tint32, optional) – Number of bases to include after the locus of interest. Truncates at contig boundary.

  • reference_genome (str or ReferenceGenome) – Reference genome to use. Must have a reference sequence available.

Returns:

StringExpression

hail.expr.functions.mendel_error_code(locus, is_female, father, mother, child)[source]

Compute a Mendelian violation code for genotypes.

>>> father = hl.call(0, 0)
>>> mother = hl.call(1, 1)
>>> child1 = hl.call(0, 1)  # consistent
>>> child2 = hl.call(0, 0)  # Mendel error
>>> locus = hl.locus('2', 2000000)
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child1))
None
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child2))
7

Note

Ignores call phasing, and assumes diploid and biallelic. Haploid calls for hemiploid samples on sex chromosomes also are acceptable input.

Notes

In the table below, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by LocusExpression.in_autosome():

  • Auto – in autosome or in PAR, or in non-PAR of X and female child

  • HemiX – in non-PAR of X and male child

  • HemiY – in non-PAR of Y and male child

Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.

Code

Dad

Mom

Kid

Copy State

Implicated

1

HomVar

HomVar

Het

Auto

Dad, Mom, Kid

2

HomRef

HomRef

Het

Auto

Dad, Mom, Kid

3

HomRef

~HomRef

HomVar

Auto

Dad, Kid

4

~HomRef

HomRef

HomVar

Auto

Mom, Kid

5

HomRef

HomRef

HomVar

Auto

Kid

6

HomVar

~HomVar

HomRef

Auto

Dad, Kid

7

~HomVar

HomVar

HomRef

Auto

Mom, Kid

8

HomVar

HomVar

HomRef

Auto

Kid

9

Any

HomVar

HomRef

HemiX

Mom, Kid

10

Any

HomRef

HomVar

HemiX

Mom, Kid

11

HomVar

Any

HomRef

HemiY

Dad, Kid

12

HomRef

Any

HomVar

HemiY

Dad, Kid

Parameters:
Returns:

Int32Expression

hail.expr.functions.liftover(x, dest_reference_genome, min_match=0.95, include_strand=False)[source]

Lift over coordinates to a different reference genome.

Examples

Lift over the locus coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus('1', 1034245, 'GRCh37'), 'GRCh38')) 
Locus(contig='chr1', position=1098865, reference_genome='GRCh38')

Lift over the locus interval coordinates from reference genome 'GRCh37' to 'GRCh38':

>>> hl.eval(hl.liftover(hl.locus_interval('20', 60001, 82456, True, True, 'GRCh37'), 'GRCh38')) 
Interval(Locus(contig='chr20', position=79360, reference_genome='GRCh38'),
         Locus(contig='chr20', position=101815, reference_genome='GRCh38'),
         True,
         True)

See Liftover variants from one coordinate system to another for more instructions on lifting over a Table or MatrixTable.

Notes

This function requires the reference genome of x has a chain file loaded for dest_reference_genome. Use ReferenceGenome.add_liftover() to load and attach a chain file to a reference genome.

Returns None if x could not be converted.

Warning

Before using the result of liftover() as a new row key or column key, be sure to filter out missing values.

Parameters:
  • x (Expression of type tlocus or tinterval of tlocus) – Locus or locus interval to lift over.

  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.

  • min_match (float) – Minimum ratio of bases that must remap.

  • include_strand (bool) – If True, output the result as a StructExpression with the first field result being the locus or locus interval and the second field is_negative_strand is a boolean indicating whether the locus or locus interval has been mapped to the negative strand of the destination reference genome. Otherwise, output the converted locus or locus interval.

Returns:

Expression – A locus or locus interval converted to dest_reference_genome.

hail.expr.functions.min_rep(locus, alleles)[source]

Computes the minimal representation of a (locus, alleles) polymorphism.

Examples

>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['TAA', 'TA']))
Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['TA', 'T'])
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['AATAA', 'AACAA']))
Struct(locus=Locus(contig=1, position=100002, reference_genome=GRCh37), alleles=['T', 'C'])

Notes

Computing the minimal representation can cause the locus shift right (the position can increase).

Parameters:
Returns:

StructExpression – A tstruct expression with two fields, locus (LocusExpression) and alleles (ArrayExpression of type tstr).

hail.expr.functions.reverse_complement(s, rna=False)[source]

Reverses the string and translates base pairs into their complements .. rubric:: Examples

>>> bases = hl.literal('NNGATTACA')
>>> hl.eval(hl.reverse_complement(bases))
'TGTAATCNN'
Parameters:
  • s (StringExpression) – Base string.

  • rna (bool) – If True, pair adenine (A) with uracil (U) instead of thymine (T).

Returns:

StringExpression