Filtering and Annotation Tutorial
Filter
You can filter the rows of a table with Table.filter. This returns a table of those rows for which the expression evaluates to True
.
[1]:
import hail as hl
hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Initializing Hail with default parameters...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-09f2439d4b:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.133-4c60fddb171a
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20241004-2009-0.2.133-4c60fddb171a.log
2024-10-04 20:09:44.088 Hail: INFO: Movie Lens files found!
[2]:
users.filter(users.occupation == 'programmer').count()
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
[2]:
66
We can also express this query in multiple ways using aggregations:
[3]:
users.aggregate(hl.agg.filter(users.occupation == 'programmer', hl.agg.count()))
[3]:
66
[4]:
users.aggregate(hl.agg.counter(users.occupation == 'programmer'))[True]
[4]:
66
Annotate
You can add new fields to a table with annotate. As an example, let’s create a new column called cleaned_occupation
that replaces missing entries in the occupation
field labeled as ‘other’ with ‘none.’
[5]:
missing_occupations = hl.set(['other', 'none'])
t = users.annotate(
cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
hl.missing('str'),
users.occupation))
t.show()
showing top 10 rows
Compare this to what we had before:
[6]:
users.show()
showing top 10 rows
Note: annotate
is functional: it doesn’t mutate users
, but returns a new table. This is also true of filter
. In fact, all operations in Hail are functional.
[7]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
Select and Transmute
There are two other annotate methods: select and transmute. select
allows you to create new tables from old ones by selecting existing fields, or creating new ones.
First, let’s extract the sex
and occupation
fields:
[8]:
users.select(users.sex, users.occupation).show()
showing top 10 rows
We can also create a new field that stores the age relative to the average. Note that new fields must be assigned a name (in this case mean_shifted_age
):
[9]:
mean_age = round(users.aggregate(hl.agg.stats(users.age)).mean)
users.select(users.sex, users.occupation, mean_shifted_age = users.age - mean_age).show()
showing top 10 rows
transmute
replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. transmute
is useful for transforming data into a new form. Compare the following two snippts of code. The second is identical to the first with transmute
replacing select.
[10]:
missing_occupations = hl.set(['other', 'none'])
t = users.select(
cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
hl.missing('str'),
users.occupation))
t.show()
showing top 10 rows
[11]:
missing_occupations = hl.set(['other', 'none'])
t = users.transmute(
cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),
hl.missing('str'),
users.occupation))
t.show()
showing top 10 rows
Global Fields
Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.
[12]:
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()
----------------------------------------
Global fields:
'cohort': int32
'cloudable': set<str>
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
[13]:
t.cloudable
[13]:
<SetExpression of type set<str>>
[14]:
hl.eval(t.cloudable)
[14]:
{'sample1', 'sample10', 'sample15'}
Exercises
Z-score normalize the age field of
users
.Convert
zip
to an integer. Hint: Not all zipcodes are US zipcodes! Use hl.int32 to convert a string to an integer. Use StringExpression.matches to see if a string matches a regular expression.
[ ]: