Filtering and Annotation Tutorial

Filter

You can filter the rows of a table with Table.filter. This returns a table of those rows for which the expression evaluates to True.

In [1]:
import hail as hl
import seaborn

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Initializing Spark and Hail with default parameters...
Running on Apache Spark version 2.2.0
SparkUI available at http://10.32.4.4:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.5-2595d91d83e0
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181215-1613-0.2.5-2595d91d83e0.log
2018-12-15 16:13:32 Hail: INFO: Movie Lens files found!
In [2]:
users.filter(users.occupation == 'programmer').count()
Out[2]:
66

Annotate

You can add new fields to a table with annotate. Let’s mean-center and variance-normalize the age field.

In [3]:
stats = users.aggregate(hl.agg.stats(users.age))
missing_occupations = hl.set(['other', 'none'])

t = users.annotate(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()
+-------+-------+-----+-----------------+---------+--------------------+
|    id |   age | sex | occupation      | zipcode | cleaned_occupation |
+-------+-------+-----+-----------------+---------+--------------------+
| int32 | int32 | str | str             | str     | str                |
+-------+-------+-----+-----------------+---------+--------------------+
|     1 |    24 | "M" | "technician"    | "85711" | "technician"       |
|     2 |    53 | "F" | "other"         | "94043" | NA                 |
|     3 |    23 | "M" | "writer"        | "32067" | "writer"           |
|     4 |    24 | "M" | "technician"    | "43537" | "technician"       |
|     5 |    33 | "F" | "other"         | "15213" | NA                 |
|     6 |    42 | "M" | "executive"     | "98101" | "executive"        |
|     7 |    57 | "M" | "administrator" | "91344" | "administrator"    |
|     8 |    36 | "M" | "administrator" | "05201" | "administrator"    |
|     9 |    29 | "M" | "student"       | "01002" | "student"          |
|    10 |    53 | "M" | "lawyer"        | "90703" | "lawyer"           |
+-------+-------+-----+-----------------+---------+--------------------+
showing top 10 rows

Note: annotate is functional: it doesn’t mutate users, but returns a new table. This is also true of filter. In fact, all operations in Hail are functional.

In [4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

There are two other annotate methods: select and transmute. select returns a table with the key and an entirely new set of value fields. transmute replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. transmute is useful for transforming data into a new form. How about some examples?

In [5]:
(users.select(len_occupation = hl.len(users.occupation))
 .describe())
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'len_occupation': int32
----------------------------------------
Key: ['id']
----------------------------------------
In [6]:
(users.transmute(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null(hl.tstr),
                                 users.occupation))
 .describe())
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'zipcode': str
    'cleaned_occupation': str
----------------------------------------
Key: ['id']
----------------------------------------

Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.

In [7]:
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()
----------------------------------------
Global fields:
    'cohort': int32
    'cloudable': set<str>
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
In [8]:
t.cloudable
Out[8]:
<SetExpression of type set<str>>
In [9]:
hl.eval(t.cloudable)
Out[9]:
{'sample1', 'sample10', 'sample15'}

Exercises