Table Tutorial
Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas
, but Table
differs in 3 important ways:
It is distributed. Hail tables can store far more data than can fit on a single computer.
It carries global fields.
It is keyed.
A Table
has two different kinds of fields:
global fields
row fields
Importing and Reading
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with hl.read_table. It take a path and returns a Table
. ht
stands for Hail Table.
We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!
Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.
[1]:
import hail as hl
hl.init()
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/local/lib/python3.9/dist-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.3
SparkUI available at http://hostname-7c7f0dc384:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.126-ee77707f4fab
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20231030-1650-0.2.126-ee77707f4fab.log
[2]:
hl.utils.get_movie_lens('data/')
2023-10-30 16:51:02.904 Hail: INFO: downloading MovieLens-100k data ...
Source: https://files.grouplens.org/datasets/movielens/ml-100k.zip
2023-10-30 16:51:03.331 Hail: INFO: importing users table and writing to data/users.ht ...
2023-10-30 16:51:08.505 Hail: INFO: Reading table to impute column types
2023-10-30 16:51:09.763 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type int32 (imputed)
Loading field 'f2' as type str (imputed)
Loading field 'f3' as type str (imputed)
Loading field 'f4' as type str (imputed)
2023-10-30 16:51:10.657 Hail: INFO: Coerced sorted dataset
2023-10-30 16:51:11.382 Hail: INFO: wrote table with 943 rows in 1 partition to data/users.ht
2023-10-30 16:51:11.406 Hail: INFO: importing movies table and writing to data/movies.ht ...
2023-10-30 16:51:12.245 Hail: INFO: Reading table to impute column types
2023-10-30 16:51:13.103 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type str (imputed)
Loading field 'f2' as type str (imputed)
Loading field 'f3' as type str (imputed)
Loading field 'f4' as type str (imputed)
Loading field 'f5' as type int32 (imputed)
Loading field 'f6' as type int32 (imputed)
Loading field 'f7' as type int32 (imputed)
Loading field 'f8' as type int32 (imputed)
Loading field 'f9' as type int32 (imputed)
Loading field 'f10' as type int32 (imputed)
Loading field 'f11' as type int32 (imputed)
Loading field 'f12' as type int32 (imputed)
Loading field 'f13' as type int32 (imputed)
Loading field 'f14' as type int32 (imputed)
Loading field 'f15' as type int32 (imputed)
Loading field 'f16' as type int32 (imputed)
Loading field 'f17' as type int32 (imputed)
Loading field 'f18' as type int32 (imputed)
Loading field 'f19' as type int32 (imputed)
Loading field 'f20' as type int32 (imputed)
Loading field 'f21' as type int32 (imputed)
Loading field 'f22' as type int32 (imputed)
Loading field 'f23' as type int32 (imputed)
2023-10-30 16:51:13.858 Hail: INFO: Coerced sorted dataset
2023-10-30 16:51:14.557 Hail: INFO: wrote table with 1682 rows in 1 partition to data/movies.ht
2023-10-30 16:51:14.573 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2023-10-30 16:51:15.300 Hail: INFO: Reading table to impute column types
2023-10-30 16:51:16.310 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type int32 (imputed)
Loading field 'f2' as type int32 (imputed)
Loading field 'f3' as type int32 (imputed)
2023-10-30 16:51:17.120 Hail: INFO: wrote table with 100000 rows in 1 partition to data/ratings.ht
[3]:
users = hl.read_table('data/users.ht')
Exploring Tables
The describe method prints the structure of a table: the fields and their types.
[4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
You can view the first few rows of the table using show.
10 rows are displayed by default. Try changing the code in the cell below to users.show(5)
.
[5]:
users.show()
showing top 10 rows
You can count the rows of a table.
[6]:
users.count()
[6]:
943
You can access fields of tables with the Python attribute notation table.field
, or with index notation table['field']
. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).
[7]:
users.occupation.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f5397f2cfd0>
Index:
['row']
--------------------------------------------------------
[8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f5397f2cfd0>
Index:
['row']
--------------------------------------------------------
users.occupation
and users['occupation']
are Hail Expressions
Lets peak at their using show
. Notice that the key is shown as well!
[9]:
users.occupation.show()
showing top 10 rows
Exercise
The movie dataset has two other tables: movies.ht
and ratings.ht
. Load these tables and have a quick look around.
[ ]: