Table Tutorial

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail tables can store far more data than can fit on a single computer.
  • It carries global fields.
  • It is keyed.

A Table has two different kinds of fields:

  • global fields
  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.

In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.32.4.4:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.5-2595d91d83e0
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181215-1613-0.2.5-2595d91d83e0.log
In [2]:
hl.utils.get_movie_lens('data/')
2018-12-15 16:13:08 Hail: INFO: downloading MovieLens-100k data ...
  Source: http://files.grouplens.org/datasets/movielens/ml-100k.zip
2018-12-15 16:13:08 Hail: INFO: importing users table and writing to data/users.ht ...
2018-12-15 16:13:09 Hail: INFO: Reading table to impute column types
2018-12-15 16:13:09 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
2018-12-15 16:13:10 Hail: INFO: Coerced sorted dataset
2018-12-15 16:13:10 Hail: INFO: wrote table with 943 rows in 1 partition to data/users.ht
2018-12-15 16:13:10 Hail: INFO: importing movies table and writing to data/movies.ht ...
2018-12-15 16:13:10 Hail: INFO: Reading table to impute column types
2018-12-15 16:13:11 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'str' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
  Loading column 'f5' as type 'int32' (imputed)
  Loading column 'f6' as type 'int32' (imputed)
  Loading column 'f7' as type 'int32' (imputed)
  Loading column 'f8' as type 'int32' (imputed)
  Loading column 'f9' as type 'int32' (imputed)
  Loading column 'f10' as type 'int32' (imputed)
  Loading column 'f11' as type 'int32' (imputed)
  Loading column 'f12' as type 'int32' (imputed)
  Loading column 'f13' as type 'int32' (imputed)
  Loading column 'f14' as type 'int32' (imputed)
  Loading column 'f15' as type 'int32' (imputed)
  Loading column 'f16' as type 'int32' (imputed)
  Loading column 'f17' as type 'int32' (imputed)
  Loading column 'f18' as type 'int32' (imputed)
  Loading column 'f19' as type 'int32' (imputed)
  Loading column 'f20' as type 'int32' (imputed)
  Loading column 'f21' as type 'int32' (imputed)
  Loading column 'f22' as type 'int32' (imputed)
  Loading column 'f23' as type 'int32' (imputed)
2018-12-15 16:13:11 Hail: INFO: Coerced sorted dataset
2018-12-15 16:13:11 Hail: INFO: wrote table with 1682 rows in 1 partition to data/movies.ht
2018-12-15 16:13:11 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2018-12-15 16:13:12 Hail: INFO: Reading table to impute column types
2018-12-15 16:13:12 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'int32' (imputed)
  Loading column 'f3' as type 'int32' (imputed)
2018-12-15 16:13:12 Hail: INFO: wrote table with 100000 rows in 2 partitions to data/ratings.ht
In [3]:
users = hl.read_table('data/users.ht')
users
Out[3]:
<hail.table.Table at 0x7fb8b51f6518>

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

In [4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

You can view the first few rows of the table using show. 10 rows are displayed by default. Try changing the code in the cell below to users.show(5).

In [5]:
users.show()
+-------+-------+-----+-----------------+---------+
|    id |   age | sex | occupation      | zipcode |
+-------+-------+-----+-----------------+---------+
| int32 | int32 | str | str             | str     |
+-------+-------+-----+-----------------+---------+
|     1 |    24 | "M" | "technician"    | "85711" |
|     2 |    53 | "F" | "other"         | "94043" |
|     3 |    23 | "M" | "writer"        | "32067" |
|     4 |    24 | "M" | "technician"    | "43537" |
|     5 |    33 | "F" | "other"         | "15213" |
|     6 |    42 | "M" | "executive"     | "98101" |
|     7 |    57 | "M" | "administrator" | "91344" |
|     8 |    36 | "M" | "administrator" | "05201" |
|     9 |    29 | "M" | "student"       | "01002" |
|    10 |    53 | "M" | "lawyer"        | "90703" |
+-------+-------+-----+-----------------+---------+
showing top 10 rows

You can count the rows of a table.

In [6]:
users.count()
Out[6]:
943

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

In [7]:
users.occupation.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7fb8b51f6518>
Index:
    ['row']
--------------------------------------------------------
In [8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7fb8b51f6518>
Index:
    ['row']
--------------------------------------------------------

The snippets users.occupation and users['occupation'] are Hail expressions. Each data type in Hail is represented by its own expression class. Since the occupation field is a string type, the expression users.occupation will be a StringExpression.

In [9]:
users.occupation
Out[9]:
<StringExpression of type str>

You can also show an Expression. Notice that the key is shown as well.

In [10]:
users.occupation.show()
+-------+-----------------+
|    id | occupation      |
+-------+-----------------+
| int32 | str             |
+-------+-----------------+
|     1 | "technician"    |
|     2 | "other"         |
|     3 | "writer"        |
|     4 | "technician"    |
|     5 | "other"         |
|     6 | "executive"     |
|     7 | "administrator" |
|     8 | "administrator" |
|     9 | "student"       |
|    10 | "lawyer"        |
+-------+-----------------+
showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.