Filtering and Annotation Tutorial

Filter

You can filter the rows of a table with Table.filter. This returns a table of those rows for which the expression evaluates to True.

[1]:
import hail as hl

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Initializing Spark and Hail with default parameters...
Running on Apache Spark version 2.4.0
SparkUI available at http://batch-13396-job-41-8c1c40:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.22-597b3bd86135
LOGGING: writing to /kb9qf6elhwvo/python/hail/docs/tutorials/hail-20190913-1606-0.2.22-597b3bd86135.log
2019-09-13 16:06:40 Hail: INFO: Movie Lens files found!
[2]:
users.filter(users.occupation == 'programmer').count()
[2]:
66

Annotate

You can add new fields to a table with annotate. Let’s mean-center and variance-normalize the age field.

[3]:
stats = users.aggregate(hl.agg.stats(users.age))
missing_occupations = hl.set(['other', 'none'])

t = users.annotate(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()
idagesexoccupationzipcodecleaned_occupation
int32int32strstrstrstr
124"M""technician""85711""technician"
253"F""other""94043"NA
323"M""writer""32067""writer"
424"M""technician""43537""technician"
533"F""other""15213"NA
642"M""executive""98101""executive"
757"M""administrator""91344""administrator"
836"M""administrator""05201""administrator"
929"M""student""01002""student"
1053"M""lawyer""90703""lawyer"

showing top 10 rows

Note: annotate is functional: it doesn’t mutate users, but returns a new table. This is also true of filter. In fact, all operations in Hail are functional.

[4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

There are two other annotate methods: select and transmute. select returns a table with the key and an entirely new set of value fields. transmute replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. transmute is useful for transforming data into a new form. How about some examples?

[5]:
(users.select(len_occupation = hl.len(users.occupation))
 .describe())
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'len_occupation': int32
----------------------------------------
Key: ['id']
----------------------------------------
[6]:
(users.transmute(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null(hl.tstr),
                                 users.occupation))
 .describe())
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'zipcode': str
    'cleaned_occupation': str
----------------------------------------
Key: ['id']
----------------------------------------

Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.

[7]:
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()
----------------------------------------
Global fields:
    'cohort': int32
    'cloudable': set<str>
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
[8]:
t.cloudable
[8]:
<SetExpression of type set<str>>
[9]:
hl.eval(t.cloudable)
[9]:
{'sample1', 'sample10', 'sample15'}

Exercises

[ ]: