Expression Tutorial

This tutorial covers data representation with Hail’s expression classes. We will go over Hail’s data types and the expressions that represent them, as well as a few features of expressions, such as lazy evaluation and missingness. We will also cover how expressions can refer to fields in a table or matrix table.

As you are working through the tutorial, you can also check out the expression API for documentation on specific expressions and their methods, or the expression page in the Hailpedia for more information on expressions.

Start by importing the Hail module, which we typically abbreviate as hl, and initializing Hail and Spark with the init method:

In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.32.4.4:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.5-2595d91d83e0
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181215-1612-0.2.5-2595d91d83e0.log

Hail’s Data Types

Each object in Python has a data type, which can be accessed with Python’s type method. Here is a Python string, which has type str.

In [2]:
type("Python")
Out[2]:
str

Hail has its own data types for representing data. Here is a Hail string, which we construct with the str method. We can access the string’s Hail type with the dtype field.

In [3]:
hl.str("Hail").dtype
Out[3]:
dtype('str')

Hail has primitive and container types, as well as a few types specific to the field of genetics.

Each of these types has its own constructor method, which returns an expression:

In [4]:
hl.str("Hail")
Out[4]:
<StringExpression of type str>

What is an Expression?

Data types in Hail are represented by expression classes. Each data type has its own expression class. For example, an integer of type tint32 is represented by an Int32Expression.

We can construct an integer expression in Hail with the int32 function.

In [5]:
hl.int32(3)
Out[5]:
<Int32Expression of type int32>

To automatically impute the type when converting a Python object to a Hail expression, use the literal method. Let’s try it out on a Python list.

In [6]:
hl.literal(['a', 'b', 'c'])
Out[6]:
<ArrayExpression of type array<str>>

The Python list is converted to an ArrayExpression of type array<str>. In other words, an array of strings.

Expressions are Lazy

In languages like Python and R, expressions are evaluated and stored immediately. This is called eager evalutation.

In [7]:
1 + 2
Out[7]:
3

Eager evaluation won’t work on datasets that won’t fit in memory. Consider the UK Biobank BGEN file, which is ~2TB but decompresses to >100TB in memory.

In order to process datasets of this size, Hail uses lazy evaluation. When you enter an expression, Hail doesn’t execute the expression immediately; it only records what you asked to do.

In [8]:
one = hl.int32(1)
three = one + 2
three
Out[8]:
<Int32Expression of type int32>

Hail evaluates an expression only when it must. For example:

  • when performing an aggregation
  • when calling the methods take, collect, and show
  • when exporting or writing to disk

Hail evaluates expressions by streaming to accomodate very large datasets.

If you want to force the evaluation of an expression, you can do so by evaluating it. Note that this can only be done on an expression with no index, such as hl.int32(1) + 2. If the expression has an index, e.g. table.idx + 1, then the eval method will fail. The section on indices below explains this concept further.

In [9]:
hl.eval(three)
Out[9]:
3

The show method can also be used to evaluate and display the expression.

In [10]:
three.show()
+--------+
| <expr> |
+--------+
|  int32 |
+--------+
|      3 |
+--------+

Missing data

All expressions in Hail can represent missing data. Hail has a collection of primitive operations for dealing with missingness.

The null constructor can be used to create a missing expression of a specific type, such as a missing string:

In [11]:
missing_string = hl.null(hl.tstr)

Use is_defined or is_missing to test an expression for missingness.

In [12]:
hl.eval(hl.is_defined(missing_string))
Out[12]:
False
In [13]:
hl.eval(hl.is_missing(missing_string))
Out[13]:
True

Expressions handle missingness in the following ways:

  • a missing value plus another value is always missing
  • a conditional statement with a missing predicate is missing
  • when aggregating a sum of values, the missing values are ignored

This is different from Python’s treatment of missingness, where None + 5 would produce an error. In Hail, hl.null(hl.tint32) + 5 produces a missing result, not an error.

In [14]:
hl.eval(hl.is_missing(hl.null(hl.tint32) + 5))
Out[14]:
True

Here are a few more examples to illustrate how missingness is treated in Hail:

Missingness is ignored in a summation:

In [15]:
hl.eval(hl.sum(hl.array([1, 2, hl.null(hl.tint32)])))
Out[15]:
3

or_missing takes a predicate and a value. If the predicate is True, it returns the value; otherwise, it returns a missing value.

In [16]:
x = hl.int32(5)
hl.eval(hl.or_missing(x>0, x))
Out[16]:
5
In [17]:
print(hl.eval(hl.or_missing(x>10, x)))
None

Indices

Expressions carry another piece of information: indices. Indices record the Table or MatrixTable to which the expression refers, and the axes over which the expression can vary.

Let’s see some examples from the 1000 genomes dataset:

In [18]:
hl.utils.get_1kg('data/')
2018-12-15 16:12:54 Hail: INFO: 1KG files found
In [19]:
mt = hl.read_matrix_table('data/1kg.mt')
mt
Out[19]:
<hail.matrixtable.MatrixTable at 0x7f9b404f6240>

Let’s add a global field.

In [20]:
mt = mt.annotate_globals(dataset = '1kg')

We can examine any field of the matrix table with the describe method. If we examine the field we just added, notice that it has no indices, because it is a global field.

In [21]:
mt.dataset.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f9b4050e0b8>
Index:
    []
--------------------------------------------------------

The locus field is a row field, so it will be indexed by row.

In [22]:
mt.locus.describe()
--------------------------------------------------------
Type:
    locus<GRCh37>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f9b4050e0b8>
Index:
    ['row']
--------------------------------------------------------

Likewise, a column field s will be indexed by column.

In [23]:
mt.s.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f9b4050e0b8>
Index:
    ['column']
--------------------------------------------------------

And finally, an entry field GT will be indexed by both the row and column.

In [24]:
mt.GT.describe()
--------------------------------------------------------
Type:
    call
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f9b4050e0b8>
Index:
    ['row', 'column']
--------------------------------------------------------

Expressions like locus, s, and GT above do not have a single value, but rather a value that varies across rows or columns of mt. Therefore, calling the hl.eval function with these expressions will lead to an error.

Global fields don’t vary across rows or columns, so they can be directly evaluated:

In [25]:
hl.eval(mt.dataset)
Out[25]:
'1kg'

show, take, and collect

Although expressions with indices do not have a single realizable value (calling hl.eval will fail), you can use show to print the first few values, or take and collect to localize all values into a Python list.

show and take grab the first 10 rows by default, but you can specify a number of rows to grab.

In [26]:
mt.s.show()
+-----------+
| s         |
+-----------+
| str       |
+-----------+
| "HG00096" |
| "HG00099" |
| "HG00105" |
| "HG00118" |
| "HG00129" |
| "HG00148" |
| "HG00177" |
| "HG00182" |
| "HG00242" |
| "HG00254" |
+-----------+
showing top 10 rows

In [27]:
mt.s.take(5)
Out[27]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']

You can collect an expression to localize all values, like getting a list of all sample IDs of a dataset.

But be careful – don’t collect more data than can fit in memory!

In [28]:
all_sample_ids = mt.s.collect()
all_sample_ids[:5]
Out[28]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']

Learning more

Hail has a suite of of functions to transform and build expressions.

For further documentation on expressions, see the expression API and the expression page.