Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or
Table differs in 3 important ways:
It is distributed. Hail tables can store far more data than can fit on a single computer.
It carries global fields.
It is keyed.
Table has two different kinds of fields:
Importing and Reading¶
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with hl.read_table. It take a path and returns a
ht stands for Hail Table.
We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!
Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.
import hail as hl hl.init()
users = hl.read_table('data/users.ht')
The describe method prints the structure of a table: the fields and their types.
You can view the first few rows of the table using show.
10 rows are displayed by default. Try changing the code in the cell below to
You can count the rows of a table.
You can access fields of tables with the Python attribute notation
table.field, or with index notation
table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).
users['occupation'] are Hail Expressions
Lets peak at their using
show. Notice that the key is shown as well!
The movie dataset has two other tables:
ratings.ht. Load these tables and have a quick look around.