Introduction to the Expression Language

This notebook starts with the basics of the Hail expression language, and builds up practical experience with the type system, syntax, and functionality. By the end of this notebook, we hope that you will be comfortable enough to start using the expression language to slice, dice, filter, and query genetic data. These are covered in the next notebook!

The best part about a Jupyter Notebook is that you don’t just have to run what we’ve written - you can and should change the code and see what happens!

Setup

Every Hail practical notebook starts the same: import the necessary modules, and construct a HailContext. This is the entry point for Hail functionality. This object also wraps a SparkContext, which can be accessed with hc.sc.

As always, visit the documentation on the Hail website for full reference.

In [1]:
from hail import *
hc = HailContext()
Running on Apache Spark version 2.0.2
SparkUI available at http://10.56.135.40:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-5a67787

Hail Expression Language

The Hail expression language is used everywhere in Hail: filtering conditions, describing covariates and phenotypes, storing summary statistics about variants and samples, generating synthetic data, plotting, exporting, and more. The Hail expression language takes the form of Python strings passed into various Hail methods like filter_variants_expr and linear regression.

The expression language is a programming language just like Python or R or Scala. While the syntax is different, programming experience will certainly translate. We have built the expression language with the hope that even people new to programming are able to use it to explore genetic data, even if this means copying motifs and expressions found on places like Hail discussion forum.

For learning purposes, HailContext contains the method eval_expr_typed. This method takes a Python string of Hail expr code, evaluates it, and returns a tuple with the result and the type. We’ll be using this method throughout the expression language tutorial.

Hail Types

The Hail expression language is strongly typed, meaning that every expression has an associated type.

Hail defines the following types:

Primitives: - Int - Double - Float - Long - Boolean - String

Compound Types: - Array[T] - Set[T] - Dict[K, V] - Aggregable[T] - Struct

Genetic Types: - Variant - Locus - AltAllele - Interval - Genotype - Call

Primitive Types

Let’s start with simple primitive types. Primitive types are a basic building block for any programming language - these are things like numbers and strings and boolean values.

Hail expressions are passed as Python strings to Hail methods.

In [2]:
# the Boolean literals are 'true' and 'false'
hc.eval_expr_typed('true')
Out[2]:
(True, Boolean)

The return value is True, not true. Why? When values are returned by Hail methods, they are returned as the corresponding Python value.

In [3]:
hc.eval_expr_typed('123')
Out[3]:
(123, Int)
In [4]:
hc.eval_expr_typed('123.45')
Out[4]:
(123.45, Double)

String literals are denoted with double-quotes. The ‘u’ preceding the printed result denotes a unicode string, and is safe to ignore.

In [5]:
hc.eval_expr_typed('"Hello, world"')
Out[5]:
(u'Hello, world', String)

Primitive types support all the usual operations you’d expect. For details, refer to the documentation on operators and types. Here are some examples.

In [6]:
hc.eval_expr_typed('3 + 8')
Out[6]:
(11, Int)
In [7]:
hc.eval_expr_typed('3.2 * 0.5')
Out[7]:
(1.6, Double)
In [8]:
hc.eval_expr_typed('3 ** 3')
Out[8]:
(27.0, Double)
In [9]:
hc.eval_expr_typed('25 ** 0.5')
Out[9]:
(5.0, Double)
In [10]:
hc.eval_expr_typed('true || false')
Out[10]:
(True, Boolean)
In [11]:
hc.eval_expr_typed('true && false')
Out[11]:
(False, Boolean)

Missingness

Like R, all values in Hail can be missing. Most operations, like addition, return missing if any of their inputs is missing. There are a few special operations for manipulating missing values. There is also a missing literal, but you have to specify it’s type. Missing Hail values are converted to None in Python.

In [12]:
hc.eval_expr_typed('NA: Int') # missing Int
Out[12]:
(None, Int)
In [13]:
hc.eval_expr_typed('NA: Dict[String, Int]')
Out[13]:
(None, Dict[String,Int])
In [14]:
hc.eval_expr_typed('1 + NA: Int')
Out[14]:
(None, Int)

You can test missingness with isDefined and isMissing.

In [15]:
hc.eval_expr_typed('isDefined(1)')
Out[15]:
(True, Boolean)
In [16]:
hc.eval_expr_typed('isDefined(NA: Int)')
Out[16]:
(False, Boolean)
In [17]:
hc.eval_expr_typed('isMissing(NA: Double)')
Out[17]:
(True, Boolean)

orElse lets you convert missing to a default value and orMissing lets you turn a value into missing based on a condtion.

In [18]:
hc.eval_expr_typed('orElse(5, 2)')
Out[18]:
(5, Int)
In [19]:
hc.eval_expr_typed('orElse(NA: Int, 2)')
Out[19]:
(2, Int)
In [20]:
hc.eval_expr_typed('orMissing(true, 5)')
Out[20]:
(5, Int)
In [21]:
hc.eval_expr_typed('orMissing(false, 5)')
Out[21]:
(None, Int)

Let

You can assign a value to a variable with a let expression. Here is an example.

In [22]:
hc.eval_expr_typed('let a = 5 in a + 1')
Out[22]:
(6, Int)

The variable, here a is only visible in the body of the let, the expression following in. You can assign multiple variables. Variable assignments are separated by and. Each variable is visible in the right hand side of the following variables as well as the body of the let. For example:

In [23]:
hc.eval_expr_typed('''
let a = 5
and b = a + 1
 in a * b
''')
Out[23]:
(30, Int)

Conditionals

Unlike other languages, conditionals in Hail return a value. The arms of the conditional must have the same type. The predicate must be of type Boolean. If the predicate is missing, the value of the entire conditional is missing. Here are some simple examples.

In [24]:
hc.eval_expr_typed('if (true) 1 else 2')
Out[24]:
(1, Int)
In [25]:
hc.eval_expr_typed('if (false) 1 else 2')
Out[25]:
(2, Int)
In [26]:
hc.eval_expr_typed('if (NA: Boolean) 1 else 2')
Out[26]:
(None, Int)

The if and else branches need to return the same type. The below expression is invalid.

In [27]:
# Uncomment and run the below code to see the error message

# hc.eval_expr_typed('if (true) 1 else "two"')

Compound Types

Hail has several compound types: - Array[T] - Set[T] - Dict[K, V] - Aggregable[T] - Struct

T, K and V here mean any type, including other compound types. Hail’s Array[T] objects are similar to Python’s lists, except they must be homogenous: that is, each element must be of the same type. Arrays are 0-indexed. Here are some examples of simple array expressions.

Array literals are constructed with square brackets.

In [28]:
hc.eval_expr_typed('[1, 2, 3, 4, 5]')
Out[28]:
([1, 2, 3, 4, 5], Array[Int])

Arrays are indexed with square brackets and support Python’s slice syntax.

In [29]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[0]')
Out[29]:
(1, Int)
In [30]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:3]')
Out[30]:
([2, 3], Array[Int])
In [31]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:]')
Out[31]:
([2, 3, 4, 5], Array[Int])
In [32]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a.length()')
Out[32]:
(5, Int)

Arrays can be transformed with functional operators filter and map. These operations return a new array, never modify the original.

In [33]:
# keep the elements that are less than 10
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10)')
Out[33]:
([1, 2, 7], Array[Int])
In [34]:
# square the elements of an array
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x * x)')
Out[34]:
([1, 4, 484, 49, 100, 121], Array[Int])
In [35]:
# combine the two: keep elements less than 10 and then square them
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10).map(x => x * x)')
Out[35]:
([1, 4, 49], Array[Int])

In the above filter / map expressions, you can see a strange syntax:

x => x < 10

This syntax is a lambda function. The functions filter and map take functions as arguments! A Hail lambda function takes the form:

binding => expression

That we named the binding ‘x’ in every example above is a point of preference, and no more. We can name the bindings anything we want.

In [36]:
# use 'foo' and 'bar' as bindings
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(foo => foo < 10).map(bar => bar * bar)')
Out[36]:
([1, 4, 49], Array[Int])

The full list of methods on arrays can be found here.

Numeric Arrays

Numeric arrays, like Array[Int] and Array[Double] have additional operations like max, mean, median, sort. For a full list, see, for example, Array[Int]. Here are a few examples.

In [37]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].sum()')
Out[37]:
(53, Int)
In [38]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].max()')
Out[38]:
(22, Int)
In [39]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].mean()')
Out[39]:
(8.833333333333334, Double)
In [40]:
# take the square root of each element
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x ** 0.5)')
Out[40]:
([1.0,
  1.4142135623730951,
  4.69041575982343,
  2.6457513110645907,
  3.1622776601683795,
  3.3166247903554],
 Array[Double])

Exercise

Write an expression that calculates the sum of the squared residuals (x - mean) of an array.

In [41]:
# Uncomment the below code by deleting the triple-quotes and write an expression to calculate the residuals.

"""
result, t = hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2]
and mean = <FILL IN>
in a.map(x => <FILL IN> ).sum()
''')
"""

try:
    print('Your result: %s (%s)' % (result, t))
    print('Expected answer:  114.8 (Double)')
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###

What if a contains a missing value NA: Int? Will your code still work?

Structs

Structs are a collection of named values known as fields. Hail does not have tuples like Python. Unlike arrays, the values can be heterogenous. Unlike Dicts, the set of names are part of the type and must be known statically. Structs are constructed with a syntax similar to Python’s dict syntax. Struct fields are accessed using the . syntax.

In [42]:
print(hc.eval_expr_typed('{gene: "ACBD", function: "LOF", nHet: 12}'))
(Struct{u'function': u'LOF', u'nHet': 12, u'gene': u'ACBD'}, Struct{gene:String,function:String,nHet:Int})
In [43]:
hc.eval_expr_typed('let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene')
Out[43]:
(u'ACBD', String)
In [44]:
hc.eval_expr_typed('let s = NA: Struct { gene: String, function: String, nHet: Int} in s.gene')
Out[44]:
(None, String)

Genetic Types

Hail contains several genetic types: - Variant - Locus - AltAllele - Interval - Genotype - Call

These are designed to make it easy to manipulate genetic data. There are many built-in functions for asking common questions about these data types, like whether an alternate allele is a SNP, or the fraction of reads a called genotype that belong to the reference allele.

Demo variables

To explore these types and constructs, we have defined five representative variables which you can access in eval_expr:

In [45]:
# 'v' is used to indicate 'Variant' in Hail
hc.eval_expr_typed('v')
Out[45]:
(Variant(contig=16, start=19200405, ref=C, alts=[AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)]),
 Variant)
In [46]:
# 's' is used to refer to sample ID in Hail
hc.eval_expr_typed('s')
Out[46]:
(u'NA12878', String)
In [47]:
# 'g' is used to refer to the genotype in Hail
hc.eval_expr_typed('g')
Out[47]:
(Genotype(GT=1, AD=[14, 0, 12], DP=26, GQ=60, PL=[60, 65, 126, 0, 67, 65]),
 Genotype)
In [48]:
# 'sa' is used to refer to sample annotations
hc.eval_expr_typed('sa')
Out[48]:
(Struct{u'cohort': u'1KG', u'covariates': Struct{u'PC2': -0.61512, u'PC3': 0.3166666, u'age': 34, u'PC1': 0.102312, u'isFemale': True}},
 Struct{cohort:String,covariates:Struct{PC1:Double,PC2:Double,PC3:Double,age:Int,isFemale:Boolean}})

The above output is a bit wordy. Let’s try 'va':

In [49]:
# 'va' is used to refer to variant annotations
hc.eval_expr_typed('va')
Out[49]:
(Struct{u'info': Struct{u'AC': [40, 1], u'AF': [0.00784, 0.000196], u'AN': 5102}, u'transcripts': [Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False}, Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False}, Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False}, Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}]},
 Struct{info:Struct{AC:Array[Int],AN:Int,AF:Array[Double]},transcripts:Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}]})

This is totally illegible. pprint can solve our problems!

pprint is a Python standard library module that tries to print objects legibly. Let’s try it out here:

In [50]:
from pprint import pprint
In [51]:
# 'va' is used to refer to variant annotations
pprint(hc.eval_expr_typed('va'))
({u'info': {u'AC': [40, 1], u'AF': [0.00784, 0.000196], u'AN': 5102},
  u'transcripts': [{u'canonical': False,
                    u'consequence': u'SYN',
                    u'gene': u'GENE1',
                    u'isoform': u'GENE1.1'},
                   {u'canonical': True,
                    u'consequence': u'LOF',
                    u'gene': u'GENE1',
                    u'isoform': u'GENE1.2'},
                   {u'canonical': False,
                    u'consequence': u'MIS',
                    u'gene': u'GENE2',
                    u'isoform': u'GENE2.1'},
                   {u'canonical': False,
                    u'consequence': u'MIS',
                    u'gene': u'GENE2',
                    u'isoform': u'GENE2.2'},
                   {u'canonical': False,
                    u'consequence': u'MIS',
                    u'gene': u'GENE2',
                    u'isoform': u'GENE2.3'},
                   {u'canonical': False,
                    u'consequence': u'SYN',
                    u'gene': u'GENE3',
                    u'isoform': u'GENE3.1'},
                   {u'canonical': False,
                    u'consequence': u'SYN',
                    u'gene': u'GENE3',
                    u'isoform': u'GENE3.2'}]},
 Struct{
     info: Struct{
         AC: Array[Int],
         AN: Int,
         AF: Array[Double]
     },
     transcripts: Array[Struct{
         gene: String,
         isoform: String,
         canonical: Boolean,
         consequence: String
     }]
 })

You’ll rarely need to construct a Variant or Genotype object inside the Hail expression language. More commonly, these objects will be provided to you as variables. In the remainder of this notebook, we will explore how to to manipulate the demo variables. In the next notebook, we start using the expression langauge to annotate and filter a dataset.

First, a short demonstration of some of the methods accessible on Variant and Genotype objects:

In [52]:
hc.eval_expr_typed('v')
Out[52]:
(Variant(contig=16, start=19200405, ref=C, alts=[AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)]),
 Variant)
In [53]:
hc.eval_expr_typed('v.contig')
Out[53]:
(u'16', String)
In [54]:
hc.eval_expr_typed('v.start')
Out[54]:
(19200405, Int)
In [55]:
hc.eval_expr_typed('v.ref')
Out[55]:
(u'C', String)
In [56]:
hc.eval_expr_typed('v.altAlleles')
Out[56]:
([AltAllele(ref=C, alt=G), AltAllele(ref=C, alt=CCC)], Array[AltAllele])
In [57]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isSNP())')
Out[57]:
([True, False], Array[Boolean])
In [58]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isInsertion())')
Out[58]:
([False, True], Array[Boolean])
In [59]:
hc.eval_expr_typed('g')
Out[59]:
(Genotype(GT=1, AD=[14, 0, 12], DP=26, GQ=60, PL=[60, 65, 126, 0, 67, 65]),
 Genotype)
In [60]:
hc.eval_expr_typed('g.dp')
Out[60]:
(26, Int)
In [61]:
hc.eval_expr_typed('g.ad')
Out[61]:
([14, 0, 12], Array[Int])
In [62]:
hc.eval_expr_typed('g.fractionReadsRef()')
Out[62]:
(0.5384615384615384, Double)
In [63]:
hc.eval_expr_typed('g.isHet()')
Out[63]:
(True, Boolean)

Wrangling complex nested types

Structs and Arrays allow arbitrarily deep grouping and nesting of values.

Remember the type of sa:

In [64]:
pprint(hc.eval_expr_typed('sa')[1])
Struct{
     cohort: String,
     covariates: Struct{
         PC1: Double,
         PC2: Double,
         PC3: Double,
         age: Int,
         isFemale: Boolean
     }
 }

Select elements of a Struct with a '.'. If we want to select PC1 from the above type, we first index into the top-level struct with covariates, then select the field with PC1:

In [65]:
hc.eval_expr_typed('sa.covariates.PC1')
Out[65]:
(0.102312, Double)

We can construct an array from the struct elements:

In [66]:
hc.eval_expr_typed('[sa.covariates.PC1, sa.covariates.PC2, sa.covariates.PC3]')
Out[66]:
([0.102312, -0.61512, 0.3166666], Array[Double])

Now we’ll use va. Here’s its type of va:

In [67]:
pprint(hc.eval_expr_typed('va')[1])
Struct{
     info: Struct{
         AC: Array[Int],
         AN: Int,
         AF: Array[Double]
     },
     transcripts: Array[Struct{
         gene: String,
         isoform: String,
         canonical: Boolean,
         consequence: String
     }]
 }

This schema is somewhat representative of typical variant annotations: AC, AN, and AF are typically included in the INFO field of a VCF.

In [68]:
hc.eval_expr_typed('va.info.AF')
Out[68]:
([0.00784, 0.000196], Array[Double])
In [69]:
hc.eval_expr_typed('va.info.AF[1]')
Out[69]:
(0.000196, Double)

AC and AF mean “allele count” and “allele frequency” and are “A-indexed”, which means that there is one element per alternate allele. Perhaps we want to construct an array which contains each alternate allele and its count and frequency.

In [70]:
pprint(hc.eval_expr_typed('''range(v.altAlleles.length()).map(i =>
                      {allele: v.altAlleles[i],
                       count: va.info.AC[i],
                       frequency: va.info.AF[i]})'''))
([{u'allele': AltAllele(ref=C, alt=G), u'count': 40, u'frequency': 0.00784},
  {u'allele': AltAllele(ref=C, alt=CCC), u'count': 1, u'frequency': 0.000196}],
 Array[Struct{
     allele: AltAllele,
     count: Int,
     frequency: Double
 }])

Now, let’s manipulate the va.transcripts array. Here’s what it looks like:

In [71]:
hc.eval_expr_typed('va.transcripts')
Out[71]:
([Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False},
  Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False},
  Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}],
 Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])

We’ll start by pulling out just the gene field. Our result will be an Array[String]. We need to do this with the map function, to map each struct element of the array to its field gene.

In [72]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene)')
Out[72]:
([u'GENE1', u'GENE1', u'GENE2', u'GENE2', u'GENE2', u'GENE3', u'GENE3'],
 Array[String])

Perhaps we just want the set of unique genes:

In [73]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene).toSet()')
Out[73]:
({u'GENE1', u'GENE2', u'GENE3'}, Set[String])

We can find the canonical transcript with find, which returns the first element where the predicate is true:

In [74]:
hc.eval_expr_typed('va.transcripts.find(t => t.canonical)')
Out[74]:
(Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
 Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})

However, find returns None if there isn’t an element where the predicate is true:

In [75]:
hc.eval_expr_typed('va.transcripts.find(t => t.gene == "GENE5")')
Out[75]:
(None, Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})

Now, we’ll pull out all transcripts marked “MIS” (missense):

In [76]:
hc.eval_expr_typed('va.transcripts.filter(t => t.consequence == "MIS")')
Out[76]:
([Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False}],
 Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])

Here’s a bit of a complex motif - we can sort the transcripts by an arbitrary function. Here we’ll sort so that "LOF" comes before "MIS", and "MIS" comes before "SYN".

In [77]:
hc.eval_expr_typed('''va.transcripts.sortBy(t =>
                        if (t.consequence == "LOF") 1
                        else if (t.consequence == "MIS") 2
                        else 3)''')
Out[77]:
([Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.1', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.2', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'MIS', u'isoform': u'GENE2.3', u'gene': u'GENE2', u'canonical': False},
  Struct{u'consequence': u'SYN', u'isoform': u'GENE1.1', u'gene': u'GENE1', u'canonical': False},
  Struct{u'consequence': u'SYN', u'isoform': u'GENE3.1', u'gene': u'GENE3', u'canonical': False},
  Struct{u'consequence': u'SYN', u'isoform': u'GENE3.2', u'gene': u'GENE3', u'canonical': False}],
 Array[Struct{gene:String,isoform:String,canonical:Boolean,consequence:String}])

If we are interested in pulling out the worst-consequence transcript, we can use this sorting motif and then take the first element:

In [78]:
hc.eval_expr_typed('''va.transcripts.sortBy(t =>
                        if (t.consequence == "LOF") 1
                        else if (t.consequence == "MIS") 2
                        else 3)[0]''')
Out[78]:
(Struct{u'consequence': u'LOF', u'isoform': u'GENE1.2', u'gene': u'GENE1', u'canonical': True},
 Struct{gene:String,isoform:String,canonical:Boolean,consequence:String})

Exercises

Uncomment the code blocks, fill them in, and run each block to check your answers.

In [79]:
def check(answer, answer_key):
    print('Your answer / type:')
    pprint(answer)
    print('')
    if (answer == answer_key):
        print('Correct!')
    else:
        print('Incorrect. Expected:')
        pprint(answer_key)

Exercise 1: using filter and map to pull out the gene isoform for synonymous transcripts

In [80]:
"""
result_1 = hc.eval_expr_typed(
'''
va.transcripts.filter(t => <FILL IN>)
  .map(t => <FILL IN>)
''')
"""
# check the answer
try:
    answer_key = [u'GENE1.1', u'GENE3.1', u'GENE3.2'], TArray(TString())
    check(result_1, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###

Exercise 2: using groupBy and mapValues to produce a mapping from gene to all observed consequences

Remember: <array>.toSet() converts an array to a Set, the desired type of the dictionary value.

Hint: Once you’ve grouped by gene, you can fill in the mapValues step with ts => ts to see the type of ts. It’s an Array[Struct{...}]. How do we pull just one field out?

In [81]:
"""
result_2 = hc.eval_expr_typed(
'''
  va.transcripts.groupBy(t => <FILL IN>)
    .mapValues(ts => <FILL IN>)
''')
"""

# check the answer
try:
    answer_key = {u'GENE1': {u'LOF', u'SYN'}, u'GENE2': {u'MIS'}, u'GENE3': {u'SYN'}}, TDict(TString(), TSet(TString()))
    check(result_2, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###

Exercise 3: Do the reverse: group va.transcripts by consequence, and produce a mapping from consequence to all genes with that consequence

In [82]:
"""
result_3 = hc.eval_expr_typed(
'''
<FILL IN>
''')
"""

# check the answer
try:
    answer_key = {u'LOF': {u'GENE1'}, u'MIS': {u'GENE2'}, u'SYN': {u'GENE1', u'GENE3'}}, TDict(TString(), TSet(TString()))
    check(result_3, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')
### Remove the triple quotes around the above code to start the exercise ###