Table Tutorial

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail tables can store far more data than can fit on a single computer.

  • It carries global fields.

  • It is keyed.

A Table has two different kinds of fields:

  • global fields

  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!

  1. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.

[1]:
import hail as hl
hl.init()
2023-01-17 19:32:25.371 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.1.3
SparkUI available at http://hostname-f1fbd52994:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.108-fc03e9d5dc08
LOGGING: writing to /fk8b84dy0e09/python/hail/docs/tutorials/hail-20230117-1932-0.2.108-fc03e9d5dc08.log
[2]:
hl.utils.get_movie_lens('data/')
2023-01-17 19:32:33.818 Hail: INFO: downloading MovieLens-100k data ...
  Source: https://files.grouplens.org/datasets/movielens/ml-100k.zip
2023-01-17 19:32:34.459 Hail: INFO: importing users table and writing to data/users.ht ...
2023-01-17 19:32:40.918 Hail: INFO: wrote table with 943 rows in 1 partition to /tmp/persist_tablebyjyAQzkBI
2023-01-17 19:32:42.757 Hail: INFO: Reading table to impute column types
2023-01-17 19:32:44.149 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type int32 (imputed)
  Loading field 'f2' as type str (imputed)
  Loading field 'f3' as type str (imputed)
  Loading field 'f4' as type str (imputed)
2023-01-17 19:32:45.093 Hail: INFO: Coerced sorted dataset
2023-01-17 19:32:45.736 Hail: INFO: wrote table with 943 rows in 1 partition to data/users.ht
2023-01-17 19:32:45.775 Hail: INFO: importing movies table and writing to data/movies.ht ...
2023-01-17 19:32:46.421 Hail: INFO: wrote table with 1682 rows in 1 partition to /tmp/persist_tablezkf002zBNk
2023-01-17 19:32:47.118 Hail: INFO: Reading table to impute column types
2023-01-17 19:32:48.138 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type str (imputed)
  Loading field 'f2' as type str (imputed)
  Loading field 'f3' as type str (imputed)
  Loading field 'f4' as type str (imputed)
  Loading field 'f5' as type int32 (imputed)
  Loading field 'f6' as type int32 (imputed)
  Loading field 'f7' as type int32 (imputed)
  Loading field 'f8' as type int32 (imputed)
  Loading field 'f9' as type int32 (imputed)
  Loading field 'f10' as type int32 (imputed)
  Loading field 'f11' as type int32 (imputed)
  Loading field 'f12' as type int32 (imputed)
  Loading field 'f13' as type int32 (imputed)
  Loading field 'f14' as type int32 (imputed)
  Loading field 'f15' as type int32 (imputed)
  Loading field 'f16' as type int32 (imputed)
  Loading field 'f17' as type int32 (imputed)
  Loading field 'f18' as type int32 (imputed)
  Loading field 'f19' as type int32 (imputed)
  Loading field 'f20' as type int32 (imputed)
  Loading field 'f21' as type int32 (imputed)
  Loading field 'f22' as type int32 (imputed)
  Loading field 'f23' as type int32 (imputed)
2023-01-17 19:32:48.959 Hail: INFO: Coerced sorted dataset
2023-01-17 19:32:49.771 Hail: INFO: wrote table with 1682 rows in 1 partition to data/movies.ht
2023-01-17 19:32:49.819 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2023-01-17 19:32:50.781 Hail: INFO: wrote table with 100000 rows in 1 partition to /tmp/persist_tabler0vUf41pdO
2023-01-17 19:32:51.563 Hail: INFO: Reading table to impute column types
2023-01-17 19:32:52.840 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type int32 (imputed)
  Loading field 'f2' as type int32 (imputed)
  Loading field 'f3' as type int32 (imputed)
2023-01-17 19:32:54.573 Hail: INFO: wrote table with 100000 rows in 1 partition to data/ratings.ht
[3]:
users = hl.read_table('data/users.ht')

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

[4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

You can view the first few rows of the table using show.

10 rows are displayed by default. Try changing the code in the cell below to users.show(5).

[5]:
users.show()
id
age
sex
occupation
zipcode
int32int32strstrstr
124"M""technician""85711"
253"F""other""94043"
323"M""writer""32067"
424"M""technician""43537"
533"F""other""15213"
642"M""executive""98101"
757"M""administrator""91344"
836"M""administrator""05201"
929"M""student""01002"
1053"M""lawyer""90703"

showing top 10 rows

You can count the rows of a table.

[6]:
users.count()
[6]:
943

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

[7]:
users.occupation.describe()
--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f55701e8e90>
Index:
    ['row']
--------------------------------------------------------
[8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f55701e8e90>
Index:
    ['row']
--------------------------------------------------------

users.occupation and users['occupation'] are Hail Expressions

Lets peak at their using show. Notice that the key is shown as well!

[9]:
users.occupation.show()
id
occupation
int32str
1"technician"
2"other"
3"writer"
4"technician"
5"other"
6"executive"
7"administrator"
8"administrator"
9"student"
10"lawyer"

showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.

[ ]: