Introduction to the Expression Language¶
This notebook starts with the basics of the Hail expression language, and builds up practical experience with the type system, syntax, and functionality. By the end of this notebook, we hope that you will be comfortable enough to start using the expression language to slice, dice, filter, and query genetic data. These are covered in the next notebook!
The best part about a Jupyter Notebook is that you don’t just have to run what we’ve written - you can and should change the code and see what happens!
Setup¶
Every Hail practical notebook starts the same: import the necessary
modules, and construct a
HailContext.
This is the entry point for Hail functionality. This object also wraps a
SparkContext, which can be accessed with hc.sc
.
As always, visit the documentation on the Hail website for full reference.
In [1]:
from hail import *
hc = HailContext()
Running on Apache Spark version 2.0.2
SparkUI available at http://10.56.135.40:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.1-5a67787
Hail Expression Language¶
The Hail expression language is used everywhere in Hail: filtering conditions, describing covariates and phenotypes, storing summary statistics about variants and samples, generating synthetic data, plotting, exporting, and more. The Hail expression language takes the form of Python strings passed into various Hail methods like filter_variants_expr and linear regression.
The expression language is a programming language just like Python or R or Scala. While the syntax is different, programming experience will certainly translate. We have built the expression language with the hope that even people new to programming are able to use it to explore genetic data, even if this means copying motifs and expressions found on places like Hail discussion forum.
For learning purposes, HailContext
contains the method
eval_expr_typed.
This method takes a Python string of Hail expr code, evaluates it, and
returns a tuple with the result and the type. We’ll be using this method
throughout the expression language tutorial.
Hail Types¶
The Hail expression language is strongly typed, meaning that every expression has an associated type.
Hail defines the following types:
Primitives: - Int - Double - Float - Long - Boolean - String
Compound Types: - Array[T] - Set[T] - Dict[K, V] - Aggregable[T] - Struct
Genetic Types: - Variant - Locus - AltAllele - Interval - Genotype - Call
Primitive Types¶
Let’s start with simple primitive types. Primitive types are a basic building block for any programming language - these are things like numbers and strings and boolean values.
Hail expressions are passed as Python strings to Hail methods.
In [2]:
# the Boolean literals are 'true' and 'false'
hc.eval_expr_typed('true')
Out[2]:
(True, Boolean)
The return value is True
, not true
. Why? When values are
returned by Hail methods, they are returned as the corresponding Python
value.
In [3]:
hc.eval_expr_typed('123')
Out[3]:
(123, Int)
In [4]:
hc.eval_expr_typed('123.45')
Out[4]:
(123.45, Double)
String literals are denoted with double-quotes. The ‘u’ preceding the printed result denotes a unicode string, and is safe to ignore.
In [5]:
hc.eval_expr_typed('"Hello, world"')
Out[5]:
(u'Hello, world', String)
Primitive types support all the usual operations you’d expect. For details, refer to the documentation on operators and types. Here are some examples.
In [6]:
hc.eval_expr_typed('3 + 8')
Out[6]:
(11, Int)
In [7]:
hc.eval_expr_typed('3.2 * 0.5')
Out[7]:
(1.6, Double)
In [8]:
hc.eval_expr_typed('3 ** 3')
Out[8]:
(27.0, Double)
In [9]:
hc.eval_expr_typed('25 ** 0.5')
Out[9]:
(5.0, Double)
In [10]:
hc.eval_expr_typed('true || false')
Out[10]:
(True, Boolean)
In [11]:
hc.eval_expr_typed('true && false')
Out[11]:
(False, Boolean)
Missingness¶
Like R, all values in Hail can be missing. Most operations, like
addition, return missing if any of their inputs is missing. There are a
few special operations for manipulating missing values. There is also a
missing literal, but you have to specify it’s type. Missing Hail values
are converted to None
in Python.
In [12]:
hc.eval_expr_typed('NA: Int') # missing Int
Out[12]:
(None, Int)
In [13]:
hc.eval_expr_typed('NA: Dict[String, Int]')
Out[13]:
(None, Dict[String,Int])
In [14]:
hc.eval_expr_typed('1 + NA: Int')
Out[14]:
(None, Int)
You can test missingness with isDefined
and isMissing
.
In [15]:
hc.eval_expr_typed('isDefined(1)')
Out[15]:
(True, Boolean)
In [16]:
hc.eval_expr_typed('isDefined(NA: Int)')
Out[16]:
(False, Boolean)
In [17]:
hc.eval_expr_typed('isMissing(NA: Double)')
Out[17]:
(True, Boolean)
orElse
lets you convert missing to a default value and orMissing
lets you turn a value into missing based on a condtion.
In [18]:
hc.eval_expr_typed('orElse(5, 2)')
Out[18]:
(5, Int)
In [19]:
hc.eval_expr_typed('orElse(NA: Int, 2)')
Out[19]:
(2, Int)
In [20]:
hc.eval_expr_typed('orMissing(true, 5)')
Out[20]:
(5, Int)
In [21]:
hc.eval_expr_typed('orMissing(false, 5)')
Out[21]:
(None, Int)
Let¶
You can assign a value to a variable with a let
expression. Here is
an example.
In [22]:
hc.eval_expr_typed('let a = 5 in a + 1')
Out[22]:
(6, Int)
The variable, here a
is only visible in the body of the let, the
expression following in
. You can assign multiple variables. Variable
assignments are separated by and
. Each variable is visible in the
right hand side of the following variables as well as the body of the
let. For example:
In [23]:
hc.eval_expr_typed('''
let a = 5
and b = a + 1
in a * b
''')
Out[23]:
(30, Int)
Conditionals¶
Unlike other languages, conditionals in Hail return a value. The arms of the conditional must have the same type. The predicate must be of type Boolean. If the predicate is missing, the value of the entire conditional is missing. Here are some simple examples.
In [24]:
hc.eval_expr_typed('if (true) 1 else 2')
Out[24]:
(1, Int)
In [25]:
hc.eval_expr_typed('if (false) 1 else 2')
Out[25]:
(2, Int)
In [26]:
hc.eval_expr_typed('if (NA: Boolean) 1 else 2')
Out[26]:
(None, Int)
The if
and else
branches need to return the same type. The below
expression is invalid.
In [27]:
# Uncomment and run the below code to see the error message
# hc.eval_expr_typed('if (true) 1 else "two"')
Compound Types¶
Hail has several compound types: - Array[T] - Set[T] - Dict[K, V] - Aggregable[T] - Struct
T
, K
and V
here mean any type, including other compound
types. Hail’s Array[T]
objects are similar to Python’s lists, except
they must be homogenous: that is, each element must be of the same type.
Arrays are 0-indexed. Here are some examples of simple array
expressions.
Array literals are constructed with square brackets.
In [28]:
hc.eval_expr_typed('[1, 2, 3, 4, 5]')
Out[28]:
([1, 2, 3, 4, 5], Array[Int])
Arrays are indexed with square brackets and support Python’s slice syntax.
In [29]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[0]')
Out[29]:
(1, Int)
In [30]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:3]')
Out[30]:
([2, 3], Array[Int])
In [31]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:]')
Out[31]:
([2, 3, 4, 5], Array[Int])
In [32]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a.length()')
Out[32]:
(5, Int)
Arrays can be transformed with functional operators filter
and
map
. These operations return a new array, never modify the original.
In [33]:
# keep the elements that are less than 10
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10)')
Out[33]:
([1, 2, 7], Array[Int])
In [34]:
# square the elements of an array
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x * x)')
Out[34]:
([1, 4, 484, 49, 100, 121], Array[Int])
In [35]:
# combine the two: keep elements less than 10 and then square them
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10).map(x => x * x)')
Out[35]:
([1, 4, 49], Array[Int])
In the above filter / map expressions, you can see a strange syntax:
x => x < 10
This syntax is a lambda
function. The
functions filter
and map
take functions as arguments! A Hail
lambda function takes the form:
binding => expression
That we named the binding ‘x’ in every example above is a point of preference, and no more. We can name the bindings anything we want.
In [36]:
# use 'foo' and 'bar' as bindings
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(foo => foo < 10).map(bar => bar * bar)')
Out[36]:
([1, 4, 49], Array[Int])
The full list of methods on arrays can be found here.
Numeric Arrays¶
Numeric arrays, like Array[Int]
and Array[Double]
have
additional operations like max
, mean
, median
, sort
. For
a full list, see, for example,
Array[Int]. Here are a
few examples.
In [37]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].sum()')
Out[37]:
(53, Int)
In [38]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].max()')
Out[38]:
(22, Int)
In [39]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].mean()')
Out[39]:
(8.833333333333334, Double)
In [40]:
# take the square root of each element
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x ** 0.5)')
Out[40]:
([1.0,
1.4142135623730951,
4.69041575982343,
2.6457513110645907,
3.1622776601683795,
3.3166247903554],
Array[Double])
Exercise¶
Write an expression that calculates the sum of the squared residuals (x - mean) of an array.
In [41]:
# Uncomment the below code by deleting the triple-quotes and write an expression to calculate the residuals.
"""
result, t = hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2]
and mean = <FILL IN>
in a.map(x => <FILL IN> ).sum()
''')
"""
try:
print('Your result: %s (%s)' % (result, t))
print('Expected answer: 114.8 (Double)')
except NameError:
print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###
What if a
contains a missing value NA: Int? Will your code still
work?
Structs¶
Struct
s are a collection of named values known as fields. Hail
does not have tuples like Python. Unlike arrays, the values can be
heterogenous. Unlike Dict
s, the set of names are part of the type
and must be known statically. Struct
s are constructed with a
syntax similar to Python’s dict
syntax. Struct
fields are
accessed using the .
syntax.
In [42]:
print(hc.eval_expr_typed('{gene: "ACBD", function: "LOF", nHet: 12}'))
(Struct{u'function': u'LOF', u'nHet': 12, u'gene': u'ACBD'}, Struct{gene:String,function:String,nHet:Int})
In [43]:
hc.eval_expr_typed('let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene')
Out[43]:
(u'ACBD', String)
In [44]:
hc.eval_expr_typed('let s = NA: Struct { gene: String, function: String, nHet: Int} in s.gene')
Out[44]:
(None, String)
Genetic Types¶
Hail contains several genetic types: - Variant - Locus - AltAllele - Interval - Genotype - Call
These are designed to make it easy to manipulate genetic data. There are many built-in functions for asking common questions about these data types, like whether an alternate allele is a SNP, or the fraction of reads a called genotype that belong to the reference allele.
Demo variables¶
To explore these types and constructs, we have defined five
representative variables which you can access in eval_expr
:
In [45]:
# 'v' is used to indicate 'Variant' in Hail
hc.eval_expr_typed('v')
Out[45]:
(Variant(contig=16, start=19200405, ref=C, alts=[AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)]),
Variant)
In [46]:
# 's' is used to refer to sample ID in Hail
hc.eval_expr_typed('s')
Out[46]:
(u'NA12878', String)
In [47]:
# 'g' is used to refer to the genotype in Hail
hc.eval_expr_typed('g')
Out[47]:
(Genotype(GT=1, AD=[14, 0, 12], DP=26, GQ=60, PL=[60, 65, 126, 0, 67, 65]),
Genotype)
In [48]:
# 'sa' is used to refer to sample annotations
hc.eval_expr_typed('sa')
Out[48]:
(Struct{u'cohort': u'1KG', u'covariates': Struct{u'PC2': -0.61512, u'PC3': 0.3166666, u'age': 34, u'PC1': 0.102312, u'isFemale': True}},
Struct{cohort:String,covariates:Struct{PC1:Double,PC2:Double,PC3:Double,age:Int,isFemale:Boolean}})
The above output is a bit wordy. Let’s try 'va'
:
In [49]:
# 'va' is used to refer to variant annotations
hc.eval_expr_typed('va')
Out[49]:
(Struct{u'info': Struct{u'AC': [40, 1], u'AF': [0.00784, 0.000196], u'AN': 5102}, u'transcripts': [Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False}, Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False}, Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}]},
Struct{info:Struct{AC:Array[Int],AN:Int,AF:Array[Double]},transcripts:Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}]})
This is totally illegible. pprint
can solve our problems!
pprint
is a Python standard library module that tries to print
objects legibly. Let’s try it out here:
In [50]:
from pprint import pprint
In [51]:
# 'va' is used to refer to variant annotations
pprint(hc.eval_expr_typed('va'))
({u'info': {u'AC': [40, 1], u'AF': [0.00784, 0.000196], u'AN': 5102},
u'transcripts': [{u'canonical': False,
u'consequence': u'SYN',
u'gene': u'GENE1',
u'isoform': u'GENE1.1'},
{u'canonical': True,
u'consequence': u'LOF',
u'gene': u'GENE1',
u'isoform': u'GENE1.2'},
{u'canonical': False,
u'consequence': u'MIS',
u'gene': u'GENE2',
u'isoform': u'GENE2.1'},
{u'canonical': False,
u'consequence': u'MIS',
u'gene': u'GENE2',
u'isoform': u'GENE2.2'},
{u'canonical': False,
u'consequence': u'MIS',
u'gene': u'GENE2',
u'isoform': u'GENE2.3'},
{u'canonical': False,
u'consequence': u'SYN',
u'gene': u'GENE3',
u'isoform': u'GENE3.1'},
{u'canonical': False,
u'consequence': u'SYN',
u'gene': u'GENE3',
u'isoform': u'GENE3.2'}]},
Struct{
info: Struct{
AC: Array[Int],
AN: Int,
AF: Array[Double]
},
transcripts: Array[Struct{
gene: String,
isoform: String,
canonical: Boolean,
consequence: String
}]
})
You’ll rarely need to construct a Variant
or Genotype
object
inside the Hail expression language. More commonly, these objects will
be provided to you as variables. In the remainder of this notebook, we
will explore how to to manipulate the demo variables. In the next
notebook, we start using the expression langauge to annotate and filter
a dataset.
First, a short demonstration of some of the methods accessible on
Variant
and Genotype
objects:
In [52]:
hc.eval_expr_typed('v')
Out[52]:
(Variant(contig=16, start=19200405, ref=C, alts=[AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)]),
Variant)
In [53]:
hc.eval_expr_typed('v.contig')
Out[53]:
(u'16', String)
In [54]:
hc.eval_expr_typed('v.start')
Out[54]:
(19200405, Int)
In [55]:
hc.eval_expr_typed('v.ref')
Out[55]:
(u'C', String)
In [56]:
hc.eval_expr_typed('v.altAlleles')
Out[56]:
([AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)], Array[AltAllele])
In [57]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isSNP())')
Out[57]:
([True, False], Array[Boolean])
In [58]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isInsertion())')
Out[58]:
([False, True], Array[Boolean])
In [59]:
hc.eval_expr_typed('g')
Out[59]:
(Genotype(GT=1, AD=[14, 0, 12], DP=26, GQ=60, PL=[60, 65, 126, 0, 67, 65]),
Genotype)
In [60]:
hc.eval_expr_typed('g.dp')
Out[60]:
(26, Int)
In [61]:
hc.eval_expr_typed('g.ad')
Out[61]:
([14, 0, 12], Array[Int])
In [62]:
hc.eval_expr_typed('g.fractionReadsRef()')
Out[62]:
(0.5384615384615384, Double)
In [63]:
hc.eval_expr_typed('g.isHet()')
Out[63]:
(True, Boolean)
Wrangling complex nested types¶
Structs and Arrays allow arbitrarily deep grouping and nesting of values.
Remember the type of sa
:
In [64]:
pprint(hc.eval_expr_typed('sa')[1])
Struct{
cohort: String,
covariates: Struct{
PC1: Double,
PC2: Double,
PC3: Double,
age: Int,
isFemale: Boolean
}
}
Select elements of a Struct
with a '.'
. If we want to select
PC1
from the above type, we first index into the top-level struct
with covariates
, then select the field with PC1
:
In [65]:
hc.eval_expr_typed('sa.covariates.PC1')
Out[65]:
(0.102312, Double)
We can construct an array from the struct elements:
In [66]:
hc.eval_expr_typed('[sa.covariates.PC1, sa.covariates.PC2, sa.covariates.PC3]')
Out[66]:
([0.102312, -0.61512, 0.3166666], Array[Double])
Now we’ll use va
. Here’s its type of va
:
In [67]:
pprint(hc.eval_expr_typed('va')[1])
Struct{
info: Struct{
AC: Array[Int],
AN: Int,
AF: Array[Double]
},
transcripts: Array[Struct{
gene: String,
isoform: String,
canonical: Boolean,
consequence: String
}]
}
This schema is somewhat representative of typical variant annotations:
AC
, AN
, and AF
are typically included in the INFO
field
of a VCF.
In [68]:
hc.eval_expr_typed('va.info.AF')
Out[68]:
([0.00784, 0.000196], Array[Double])
In [69]:
hc.eval_expr_typed('va.info.AF[1]')
Out[69]:
(0.000196, Double)
AC and AF mean “allele count” and “allele frequency” and are “A-indexed”, which means that there is one element per alternate allele. Perhaps we want to construct an array which contains each alternate allele and its count and frequency.
In [70]:
pprint(hc.eval_expr_typed('''range(v.altAlleles.length()).map(i =>
{allele: v.altAlleles[i],
count: va.info.AC[i],
frequency: va.info.AF[i]})'''))
([{u'allele': AltAllele(ref=C, alt=G), u'count': 40, u'frequency': 0.00784},
{u'allele': AltAllele(ref=C, alt=CCC), u'count': 1, u'frequency': 0.000196}],
Array[Struct{
allele: AltAllele,
count: Int,
frequency: Double
}])
Now, let’s manipulate the va.transcripts
array. Here’s what it looks
like:
In [71]:
hc.eval_expr_typed('va.transcripts')
Out[71]:
([Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False},
Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False},
Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}],
Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])
We’ll start by pulling out just the gene field. Our result will be an
Array[String]
. We need to do this with the map
function, to map
each struct element of the array to its field gene
.
In [72]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene)')
Out[72]:
([u'GENE1', u'GENE1', u'GENE2', u'GENE2', u'GENE2', u'GENE3', u'GENE3'],
Array[String])
Perhaps we just want the set of unique genes:
In [73]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene).toSet()')
Out[73]:
({u'GENE1', u'GENE2', u'GENE3'}, Set[String])
We can find the canonical transcript with find
, which returns the
first element where the predicate is true:
In [74]:
hc.eval_expr_typed('va.transcripts.find(t => t.canonical)')
Out[74]:
(Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})
However, find
returns None
if there isn’t an element where the
predicate is true:
In [75]:
hc.eval_expr_typed('va.transcripts.find(t => t.gene == "GENE5")')
Out[75]:
(None, Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})
Now, we’ll pull out all transcripts marked “MIS” (missense):
In [76]:
hc.eval_expr_typed('va.transcripts.filter(t => t.consequence == "MIS")')
Out[76]:
([Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False}],
Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])
Here’s a bit of a complex motif - we can sort the transcripts by an
arbitrary function. Here we’ll sort so that "LOF"
comes before
"MIS"
, and "MIS"
comes before "SYN"
.
In [77]:
hc.eval_expr_typed('''va.transcripts.sortBy(t =>
if (t.consequence == "LOF") 1
else if (t.consequence == "MIS") 2
else 3)''')
Out[77]:
([Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False},
Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False},
Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False},
Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}],
Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])
If we are interested in pulling out the worst-consequence transcript, we can use this sorting motif and then take the first element:
In [78]:
hc.eval_expr_typed('''va.transcripts.sortBy(t =>
if (t.consequence == "LOF") 1
else if (t.consequence == "MIS") 2
else 3)[0]''')
Out[78]:
(Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})
Learn more!¶
Exercises¶
Uncomment the code blocks, fill them in, and run each block to check your answers.
In [79]:
def check(answer, answer_key):
print('Your answer / type:')
pprint(answer)
print('')
if (answer == answer_key):
print('Correct!')
else:
print('Incorrect. Expected:')
pprint(answer_key)
Exercise 1: using filter
and map
to pull out the gene
isoform for synonymous transcripts
In [80]:
"""
result_1 = hc.eval_expr_typed(
'''
va.transcripts.filter(t => <FILL IN>)
.map(t => <FILL IN>)
''')
"""
# check the answer
try:
answer_key = [u'GENE1.1', u'GENE3.1', u'GENE3.2'], TArray(TString())
check(result_1, answer_key)
except NameError:
print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###
Exercise 2: using groupBy
and mapValues
to produce a
mapping from gene to all observed consequences
Remember: <array>.toSet()
converts an array to a Set, the desired
type of the dictionary value.
Hint: Once you’ve grouped by gene, you can fill in the mapValues
step with ts => ts
to see the type of ts
. It’s an
Array[Struct{...}]
. How do we pull just one field out?
In [81]:
"""
result_2 = hc.eval_expr_typed(
'''
va.transcripts.groupBy(t => <FILL IN>)
.mapValues(ts => <FILL IN>)
''')
"""
# check the answer
try:
answer_key = {u'GENE1': {u'LOF', u'SYN'}, u'GENE2': {u'MIS'}, u'GENE3': {u'SYN'}}, TDict(TString(), TSet(TString()))
check(result_2, answer_key)
except NameError:
print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###
Exercise 3: Do the reverse: group va.transcripts
by
consequence, and produce a mapping from consequence to all genes with
that consequence
In [82]:
"""
result_3 = hc.eval_expr_typed(
'''
<FILL IN>
''')
"""
# check the answer
try:
answer_key = {u'LOF': {u'GENE1'}, u'MIS': {u'GENE2'}, u'SYN': {u'GENE1', u'GENE3'}}, TDict(TString(), TSet(TString()))
check(result_3, answer_key)
except NameError:
print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###