Filtering and Annotation Tutorial

Filter

You can filter the rows of a table with Table.filter. This returns a table of those rows for which the expression evaluates to True.

[1]:

import hail as hl

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')

Loading BokehJS ...

Initializing Hail with default parameters...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-96b797a886:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.132-678e1f52b999
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20240709-1723-0.2.132-678e1f52b999.log
2024-07-09 17:23:12.323 Hail: INFO: Movie Lens files found!

[2]:

users.filter(users.occupation == 'programmer').count()

SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.

[2]:

We can also express this query in multiple ways using aggregations:

[3]:

users.aggregate(hl.agg.filter(users.occupation == 'programmer', hl.agg.count()))

[3]:

[4]:

users.aggregate(hl.agg.counter(users.occupation == 'programmer'))[True]

[4]:

Annotate

You can add new fields to a table with annotate. As an example, let’s create a new column called cleaned_occupation that replaces missing entries in the occupation field labeled as ‘other’ with ‘none.’

[5]:

missing_occupations = hl.set(['other', 'none'])

t = users.annotate(
    cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
                                    hl.missing('str'),
                                    users.occupation))
t.show()


id	age	sex	occupation	zipcode	cleaned_occupation
int32	int32	str	str	str	str
1	24	"M"	"technician"	"85711"	"technician"
2	53	"F"	"other"	"94043"	NA
3	23	"M"	"writer"	"32067"	"writer"
4	24	"M"	"technician"	"43537"	"technician"
5	33	"F"	"other"	"15213"	NA
6	42	"M"	"executive"	"98101"	"executive"
7	57	"M"	"administrator"	"91344"	"administrator"
8	36	"M"	"administrator"	"05201"	"administrator"
9	29	"M"	"student"	"01002"	"student"
10	53	"M"	"lawyer"	"90703"	"lawyer"

showing top 10 rows

Compare this to what we had before:

[6]:

users.show()


id	age	sex	occupation	zipcode
int32	int32	str	str	str
1	24	"M"	"technician"	"85711"
2	53	"F"	"other"	"94043"
3	23	"M"	"writer"	"32067"
4	24	"M"	"technician"	"43537"
5	33	"F"	"other"	"15213"
6	42	"M"	"executive"	"98101"
7	57	"M"	"administrator"	"91344"
8	36	"M"	"administrator"	"05201"
9	29	"M"	"student"	"01002"
10	53	"M"	"lawyer"	"90703"

showing top 10 rows

Note: annotate is functional: it doesn’t mutate users, but returns a new table. This is also true of filter. In fact, all operations in Hail are functional.

[7]:

users.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

Select and Transmute

There are two other annotate methods: select and transmute. select allows you to create new tables from old ones by selecting existing fields, or creating new ones.

First, let’s extract the sex and occupation fields:

[8]:

users.select(users.sex, users.occupation).show()


id	sex	occupation
int32	str	str
1	"M"	"technician"
2	"F"	"other"
3	"M"	"writer"
4	"M"	"technician"
5	"F"	"other"
6	"M"	"executive"
7	"M"	"administrator"
8	"M"	"administrator"
9	"M"	"student"
10	"M"	"lawyer"

showing top 10 rows

We can also create a new field that stores the age relative to the average. Note that new fields must be assigned a name (in this case mean_shifted_age):

[9]:

mean_age = round(users.aggregate(hl.agg.stats(users.age)).mean)
users.select(users.sex, users.occupation, mean_shifted_age = users.age - mean_age).show()


id	sex	occupation	mean_shifted_age
int32	str	str	int32
1	"M"	"technician"	-10
2	"F"	"other"	19
3	"M"	"writer"	-11
4	"M"	"technician"	-10
5	"F"	"other"	-1
6	"M"	"executive"	8
7	"M"	"administrator"	23
8	"M"	"administrator"	2
9	"M"	"student"	-5
10	"M"	"lawyer"	19

showing top 10 rows

transmute replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. transmute is useful for transforming data into a new form. Compare the following two snippts of code. The second is identical to the first with transmute replacing select.

[10]:

missing_occupations = hl.set(['other', 'none'])

t = users.select(
    cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
                                    hl.missing('str'),
                                    users.occupation))
t.show()


id	cleaned_occupation
int32	str
1	"technician"
2	NA
3	"writer"
4	"technician"
5	NA
6	"executive"
7	"administrator"
8	"administrator"
9	"student"
10	"lawyer"

showing top 10 rows

[11]:

missing_occupations = hl.set(['other', 'none'])

t = users.transmute(
    cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
                                    hl.missing('str'),
                                    users.occupation))
t.show()


id	age	sex	zipcode	cleaned_occupation
int32	int32	str	str	str
1	24	"M"	"85711"	"technician"
2	53	"F"	"94043"	NA
3	23	"M"	"32067"	"writer"
4	24	"M"	"43537"	"technician"
5	33	"F"	"15213"	NA
6	42	"M"	"98101"	"executive"
7	57	"M"	"91344"	"administrator"
8	36	"M"	"05201"	"administrator"
9	29	"M"	"01002"	"student"
10	53	"M"	"90703"	"lawyer"

showing top 10 rows

Global Fields

Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.

[12]:

t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()

----------------------------------------
Global fields:
    'cohort': int32
    'cloudable': set<str>
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

[13]:

t.cloudable

[13]:

<SetExpression of type set<str>>

[14]:

hl.eval(t.cloudable)

[14]:

{'sample1', 'sample10', 'sample15'}

Exercises

Z-score normalize the age field of users.
Convert zip to an integer. Hint: Not all zipcodes are US zipcodes! Use hl.int32 to convert a string to an integer. Use StringExpression.matches to see if a string matches a regular expression.

[ ]: