Genetics functions
|
Construct a locus expression from a chromosome and position. |
|
Constructs a locus expression from a global position and a reference genome. |
|
Construct a locus interval expression. |
|
Construct a locus expression by parsing a string or string expression. |
|
Construct a struct with a locus and alleles by parsing a string. |
|
Construct a locus interval expression by parsing a string or string expression. |
|
Create a variant colon-delimited string. |
|
Construct a call expression. |
|
Construct an unphased, diploid call from a genotype index. |
|
Construct a call expression by parsing a string or string expression. |
|
Create a new call by setting all alleles other than i to ref |
|
Returns the triangle number of n. |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns |
|
Returns the length of contig in reference_genome. |
|
Returns the type of the polymorphism as a string. |
|
Returns the type of the polymorphism as an integer. |
|
Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. |
|
Return expected genotype dosage from array of genotype probabilities. |
|
Return the reference sequence at a given locus. |
|
Compute a Mendelian violation code for genotypes. |
|
Lift over coordinates to a different reference genome. |
|
Computes the minimal representation of a (locus, alleles) polymorphism. |
|
Reverses the string and translates base pairs into their complements . |
- hail.expr.functions.locus(contig, pos, reference_genome='default')[source]
Construct a locus expression from a chromosome and position.
Examples
>>> hl.eval(hl.locus("1", 10000, reference_genome='GRCh37')) Locus(contig=1, position=10000, reference_genome=GRCh37)
- Parameters:
contig (str or
StringExpression
) – Chromosome.pos (int or
Expression
of typetint32
) – Base position along the chromosome.reference_genome (
str
orReferenceGenome
) – Reference genome to use.
- Returns:
- hail.expr.functions.locus_from_global_position(global_pos, reference_genome='default')[source]
Constructs a locus expression from a global position and a reference genome. The inverse of
LocusExpression.global_position()
.Examples
>>> hl.eval(hl.locus_from_global_position(0)) Locus(contig=1, position=1, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054)) Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054, reference_genome='GRCh38')) Locus(contig=chr22, position=1, reference_genome=GRCh38)
- Parameters:
global_pos (int or
Expression
of typetint64
) – Global base position along the reference genome.reference_genome (
str
orReferenceGenome
) – Reference genome to use for converting the global position to a contig and local position.
- Returns:
- hail.expr.functions.locus_interval(contig, start, end, includes_start=True, includes_end=False, reference_genome='default', invalid_missing=False)[source]
Construct a locus interval expression.
Examples
>>> hl.eval(hl.locus_interval("1", 100, 1000, reference_genome='GRCh37')) Interval(start=Locus(contig=1, position=100, reference_genome=GRCh37), end=Locus(contig=1, position=1000, reference_genome=GRCh37), includes_start=True, includes_end=False)
- Parameters:
contig (
StringExpression
) – Contig name.start (
Int32Expression
) – Starting base position.end (
Int32Expression
) – End base position.includes_start (
BooleanExpression
) – IfTrue
, interval includes start point.includes_end (
BooleanExpression
) – IfTrue
, interval includes end point.reference_genome (
str
orhail.genetics.ReferenceGenome
) – Reference genome to use.invalid_missing (
BooleanExpression
) – IfTrue
, invalid intervals are set to NA rather than causing an exception.
- Returns:
- hail.expr.functions.parse_locus(s, reference_genome='default')[source]
Construct a locus expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_locus('1:10000', reference_genome='GRCh37')) Locus(contig=1, position=10000, reference_genome=GRCh37)
Notes
This method expects strings of the form
contig:position
, e.g.16:29500000
orX:123456
.- Parameters:
s (str or
StringExpression
) – String to parse.reference_genome (
str
orReferenceGenome
) – Reference genome to use.
- Returns:
- hail.expr.functions.parse_variant(s, reference_genome='default')[source]
Construct a struct with a locus and alleles by parsing a string.
Examples
>>> hl.eval(hl.parse_variant('1:100000:A:T,C', reference_genome='GRCh37')) Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['A', 'T', 'C'])
Notes
This method returns an expression of type
tstruct
with the following fields:- Parameters:
s (
StringExpression
) – String to parse.reference_genome (
str
orReferenceGenome
) – Reference genome to use.
- Returns:
StructExpression
– Struct with fields locus and alleles.
- hail.expr.functions.parse_locus_interval(s, reference_genome='default', invalid_missing=False)[source]
Construct a locus interval expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_locus_interval('1:1000-2000', reference_genome='GRCh37')) Interval(start=Locus(contig=1, position=1000, reference_genome=GRCh37), end=Locus(contig=1, position=2000, reference_genome=GRCh37), includes_start=True, includes_end=False)
>>> hl.eval(hl.parse_locus_interval('1:start-10M', reference_genome='GRCh37')) Interval(start=Locus(contig=1, position=1, reference_genome=GRCh37), end=Locus(contig=1, position=10000000, reference_genome=GRCh37), includes_start=True, includes_end=False)
Notes
The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of
[
or(
at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of]
or)
at the end of the string for right-inclusive or right-exclusive respectively.There are several acceptable representations for s.
CHR1:POS1-CHR2:POS2
is the fully specified representation, and we use this to define the various shortcut representations.In a
POS
field,start
(Start
,START
) stands for 1.In a
POS
field,end
(End
,END
) stands for the contig length.In a
POS
field, the qualifiersm
(M
) andk
(K
) multiply the given number by1,000,000
and1,000
, respectively.1.6K
is short for 1600, and29M
is short for 29000000.CHR:POS1-POS2
stands forCHR:POS1-CHR:POS2
CHR1-CHR2
stands forCHR1:START-CHR2:END
CHR
stands forCHR:START-CHR:END
Note
The bounds of the interval must be valid loci for the reference genome (contig in reference genome and position is within the range [1-END]) except in the case where the position is
0
AND the interval is left-exclusive which is normalized to be1
and left-inclusive. Likewise, in the case where the position isEND + 1
AND the interval is right-exclusive which is normalized to beEND
and right-inclusive.- Parameters:
s (str or
StringExpression
) – String to parse.reference_genome (
str
orhail.genetics.ReferenceGenome
) – Reference genome to use.invalid_missing (
BooleanExpression
) – IfTrue
, invalid intervals are set to NA rather than causing an exception.
- Returns:
- hail.expr.functions.variant_str(*args)[source]
Create a variant colon-delimited string.
- Parameters:
args – Arguments (see notes).
- Returns:
Notes
Expects either one argument of type
struct{locus: locus<RG>, alleles: array<str>
, or two arguments of typelocus<RG>
andarray<str>
. The function returns a string of the formCHR:POS:REF:ALT1,ALT2,...ALTN e.g. 1:1:A:T 16:250125:AAA:A,CAA
Examples
>>> hl.eval(hl.variant_str(hl.locus('1', 10000), ['A', 'T', 'C'])) '1:10000:A:T,C'
- hail.expr.functions.call(*alleles, phased=False)[source]
Construct a call expression.
Examples
>>> hl.eval(hl.call(1, 0)) Call(alleles=[0, 1], phased=False)
- Parameters:
alleles (variable-length args of
int
orExpression
of typetint32
) – List of allele indices.phased (
bool
) – IfTrue
, preserve the order of alleles.
- Returns:
- hail.expr.functions.unphased_diploid_gt_index_call(gt_index)[source]
Construct an unphased, diploid call from a genotype index.
Examples
>>> hl.eval(hl.unphased_diploid_gt_index_call(4)) Call(alleles=[1, 2], phased=False)
- Parameters:
gt_index (
int
orExpression
of typetint32
) – Unphased, diploid genotype index.- Returns:
- hail.expr.functions.parse_call(s)[source]
Construct a call expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_call('0|2')) Call(alleles=[0, 2], phased=True)
Notes
This method expects strings in the following format:
ploidy
Phased
Unphased
0
|-
-
1
|i
i
2
i|j
i/j
3
i|j|k
i/j/k
N
i|j|k|...|N
i/j/k/.../N
- Parameters:
s (str or
StringExpression
) – String to parse.- Returns:
- hail.expr.functions.downcode(c, i)[source]
Create a new call by setting all alleles other than i to ref
Examples
Preserve the third allele and downcode all other alleles to reference.
>>> hl.eval(hl.downcode(hl.call(1, 2), 2)) Call(alleles=[0, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(2, 2), 2)) Call(alleles=[1, 1], phased=False)
>>> hl.eval(hl.downcode(hl.call(0, 1), 2)) Call(alleles=[0, 0], phased=False)
- Parameters:
c (
CallExpression
) – A call.i (
Expression
of typetint32
) – The index of the allele that will be sent to the alternate allele. All other alleles will be downcoded to reference.
- Returns:
- hail.expr.functions.triangle(n)[source]
Returns the triangle number of n.
Examples
>>> hl.eval(hl.triangle(3)) 6
Notes
The calculation is
n * (n + 1) / 2
.- Parameters:
n (
Expression
of typetint32
)- Returns:
Expression
of typetint32
- hail.expr.functions.is_snp(ref, alt)[source]
Returns
True
if the alleles constitute a single nucleotide polymorphism.Examples
>>> hl.eval(hl.is_snp('A', 'T')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_mnp(ref, alt)[source]
Returns
True
if the alleles constitute a multiple nucleotide polymorphism.Examples
>>> hl.eval(hl.is_mnp('AA', 'GT')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_transition(ref, alt)[source]
Returns
True
if the alleles constitute a transition.Examples
>>> hl.eval(hl.is_transition('A', 'T')) False
>>> hl.eval(hl.is_transition('AAA', 'AGA')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_transversion(ref, alt)[source]
Returns
True
if the alleles constitute a transversion.Examples
>>> hl.eval(hl.is_transversion('A', 'T')) True
>>> hl.eval(hl.is_transversion('AAA', 'AGA')) False
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_insertion(ref, alt)[source]
Returns
True
if the alleles constitute an insertion.Examples
>>> hl.eval(hl.is_insertion('A', 'ATT')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_deletion(ref, alt)[source]
Returns
True
if the alleles constitute a deletion.Examples
>>> hl.eval(hl.is_deletion('ATT', 'A')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_indel(ref, alt)[source]
Returns
True
if the alleles constitute an insertion or deletion.Examples
>>> hl.eval(hl.is_indel('ATT', 'A')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_star(ref, alt)[source]
Returns
True
if the alleles constitute an upstream deletion.Examples
>>> hl.eval(hl.is_star('A', '*')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_complex(ref, alt)[source]
Returns
True
if the alleles constitute a complex polymorphism.Examples
>>> hl.eval(hl.is_complex('ATT', 'GCAC')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_strand_ambiguous(ref, alt)[source]
Returns
True
if the alleles are strand ambiguous.Strand ambiguous allele pairs are
A/T
,T/A
,C/G
, andG/C
where the first allele is ref and the second allele is alt.Examples
>>> hl.eval(hl.is_strand_ambiguous('A', 'T')) True
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.is_valid_contig(contig, reference_genome='default')[source]
Returns
True
if contig is a valid contig name in reference_genome.Examples
>>> hl.eval(hl.is_valid_contig('1', reference_genome='GRCh37')) True
>>> hl.eval(hl.is_valid_contig('chr1', reference_genome='GRCh37')) False
- Parameters:
contig (
Expression
of typetstr
)reference_genome (
str
orReferenceGenome
)
- Returns:
- hail.expr.functions.is_valid_locus(contig, position, reference_genome='default')[source]
Returns
True
if contig and position is a valid site in reference_genome.Examples
>>> hl.eval(hl.is_valid_locus('1', 324254, 'GRCh37')) True
>>> hl.eval(hl.is_valid_locus('chr1', 324254, 'GRCh37')) False
- Parameters:
contig (
Expression
of typetstr
)position (
Expression
of typetint
)reference_genome (
str
orReferenceGenome
)
- Returns:
- hail.expr.functions.contig_length(contig, reference_genome='default')[source]
Returns the length of contig in reference_genome.
Examples
>>> hl.eval(hl.contig_length('5', reference_genome='GRCh37')) 180915260
- Parameters:
contig (
Expression
of typetstr
)reference_genome (
str
orReferenceGenome
)
- Returns:
- hail.expr.functions.allele_type(ref, alt)[source]
Returns the type of the polymorphism as a string.
Examples
>>> hl.eval(hl.allele_type('A', 'T')) 'SNP'
>>> hl.eval(hl.allele_type('ATT', 'A')) 'Deletion'
Notes
- The possible return values are:
"SNP"
"MNP"
"Insertion"
"Deletion"
"Complex"
"Star"
"Symbolic"
"Unknown"
- Parameters:
ref (
StringExpression
) – Reference allele.alt (
StringExpression
) – Alternate allele.
- Returns:
- hail.expr.functions.numeric_allele_type(ref, alt)[source]
Returns the type of the polymorphism as an integer. The value returned is the integer value of
AlleleType
representing that kind of polymorphism.Examples
>>> hl.eval(hl.numeric_allele_type('A', 'T')) == AlleleType.SNP True
Notes
The values of
AlleleType
are not stable and thus should not be relied upon across hail versions.
- hail.expr.functions.pl_dosage(pl)[source]
Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. Only defined for bi-allelic variants. The pl argument must be length 3.
For a PL array
[a, b, c]
, let:\[a^\prime = 10^{-a/10} \\ b^\prime = 10^{-b/10} \\ c^\prime = 10^{-c/10} \\\]The genotype dosage is given by:
\[\frac{b^\prime + 2 c^\prime} {a^\prime + b^\prime +c ^\prime}\]Examples
>>> hl.eval(hl.pl_dosage([5, 10, 100])) 0.24025307377482674
- Parameters:
pl (
ArrayNumericExpression
of typetint32
) – Length 3 array of bi-allelic Phred-scaled genotype likelihoods- Returns:
Expression
of typetfloat64
- hail.expr.functions.gp_dosage(gp)[source]
Return expected genotype dosage from array of genotype probabilities.
Examples
>>> hl.eval(hl.gp_dosage([0.0, 0.5, 0.5])) 1.5
Notes
This function is only defined for bi-allelic variants. The gp argument must be length 3. The value is
gp[1] + 2 * gp[2]
.- Parameters:
gp (
Expression
of typetarray
oftfloat64
) – Length 3 array of bi-allelic genotype probabilities- Returns:
Expression
of typetfloat64
- hail.expr.functions.get_sequence(contig, position, before=0, after=0, reference_genome='default')[source]
Return the reference sequence at a given locus.
Examples
Return the reference allele for
'GRCh37'
at the locus'1:45323'
:>>> hl.eval(hl.get_sequence('1', 45323, reference_genome='GRCh37')) "T"
Notes
This function requires reference genome has an attached reference sequence. Use
ReferenceGenome.add_sequence()
to load and attach a reference sequence to a reference genome.Returns
None
if contig and position are not valid coordinates in reference_genome.- Parameters:
contig (
Expression
of typetstr
) – Locus contig.position (
Expression
of typetint32
) – Locus position.before (
Expression
of typetint32
, optional) – Number of bases to include before the locus of interest. Truncates at contig boundary.after (
Expression
of typetint32
, optional) – Number of bases to include after the locus of interest. Truncates at contig boundary.reference_genome (
str
orReferenceGenome
) – Reference genome to use. Must have a reference sequence available.
- Returns:
- hail.expr.functions.mendel_error_code(locus, is_female, father, mother, child)[source]
Compute a Mendelian violation code for genotypes.
>>> father = hl.call(0, 0) >>> mother = hl.call(1, 1) >>> child1 = hl.call(0, 1) # consistent >>> child2 = hl.call(0, 0) # Mendel error >>> locus = hl.locus('2', 2000000)
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child1)) None
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child2)) 7
Note
Ignores call phasing, and assumes diploid and biallelic. Haploid calls for hemiploid samples on sex chromosomes also are acceptable input.
Notes
In the table below, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by
LocusExpression.in_autosome()
:Auto – in autosome or in PAR, or in non-PAR of X and female child
HemiX – in non-PAR of X and male child
HemiY – in non-PAR of Y and male child
Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.
Code
Dad
Mom
Kid
Copy State
Implicated
1
HomVar
HomVar
Het
Auto
Dad, Mom, Kid
2
HomRef
HomRef
Het
Auto
Dad, Mom, Kid
3
HomRef
~HomRef
HomVar
Auto
Dad, Kid
4
~HomRef
HomRef
HomVar
Auto
Mom, Kid
5
HomRef
HomRef
HomVar
Auto
Kid
6
HomVar
~HomVar
HomRef
Auto
Dad, Kid
7
~HomVar
HomVar
HomRef
Auto
Mom, Kid
8
HomVar
HomVar
HomRef
Auto
Kid
9
Any
HomVar
HomRef
HemiX
Mom, Kid
10
Any
HomRef
HomVar
HemiX
Mom, Kid
11
HomVar
Any
HomRef
HemiY
Dad, Kid
12
HomRef
Any
HomVar
HemiY
Dad, Kid
- Parameters:
locus (
LocusExpression
)is_female (
BooleanExpression
)father (
CallExpression
)mother (
CallExpression
)child (
CallExpression
)
- Returns:
- hail.expr.functions.liftover(x, dest_reference_genome, min_match=0.95, include_strand=False)[source]
Lift over coordinates to a different reference genome.
Examples
Lift over the locus coordinates from reference genome
'GRCh37'
to'GRCh38'
:>>> hl.eval(hl.liftover(hl.locus('1', 1034245, 'GRCh37'), 'GRCh38')) Locus(contig='chr1', position=1098865, reference_genome='GRCh38')
Lift over the locus interval coordinates from reference genome
'GRCh37'
to'GRCh38'
:>>> hl.eval(hl.liftover(hl.locus_interval('20', 60001, 82456, True, True, 'GRCh37'), 'GRCh38')) Interval(Locus(contig='chr20', position=79360, reference_genome='GRCh38'), Locus(contig='chr20', position=101815, reference_genome='GRCh38'), True, True)
See Liftover variants from one coordinate system to another for more instructions on lifting over a Table or MatrixTable.
Notes
This function requires the reference genome of x has a chain file loaded for dest_reference_genome. Use
ReferenceGenome.add_liftover()
to load and attach a chain file to a reference genome.Returns
None
if x could not be converted.Warning
Before using the result of
liftover()
as a new row key or column key, be sure to filter out missing values.- Parameters:
x (
Expression
of typetlocus
ortinterval
oftlocus
) – Locus or locus interval to lift over.dest_reference_genome (
str
orReferenceGenome
) – Reference genome to convert to.min_match (
float
) – Minimum ratio of bases that must remap.include_strand (
bool
) – If True, output the result as aStructExpression
with the first field result being the locus or locus interval and the second field is_negative_strand is a boolean indicating whether the locus or locus interval has been mapped to the negative strand of the destination reference genome. Otherwise, output the converted locus or locus interval.
- Returns:
Expression
– A locus or locus interval converted to dest_reference_genome.
- hail.expr.functions.min_rep(locus, alleles)[source]
Computes the minimal representation of a (locus, alleles) polymorphism.
Examples
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['TAA', 'TA'])) Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['TA', 'T'])
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['AATAA', 'AACAA'])) Struct(locus=Locus(contig=1, position=100002, reference_genome=GRCh37), alleles=['T', 'C'])
Notes
Computing the minimal representation can cause the locus shift right (the position can increase).
- Parameters:
locus (
LocusExpression
)alleles (
ArrayExpression
of typetstr
)
- Returns:
StructExpression
– Atstruct
expression with two fields, locus (LocusExpression
) and alleles (ArrayExpression
of typetstr
).
- hail.expr.functions.reverse_complement(s, rna=False)[source]
Reverses the string and translates base pairs into their complements .. rubric:: Examples
>>> bases = hl.literal('NNGATTACA') >>> hl.eval(hl.reverse_complement(bases)) 'TGTAATCNN'
- Parameters:
s (
StringExpression
) – Base string.rna (
bool
) – IfTrue
, pair adenine (A) with uracil (U) instead of thymine (T).
- Returns: