Table Tutorial
Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas
, but Table
differs in 3 important ways:
It is distributed. Hail tables can store far more data than can fit on a single computer.
It carries global fields.
It is keyed.
A Table
has two different kinds of fields:
global fields
row fields
Importing and Reading
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with hl.read_table. It take a path and returns a Table
. ht
stands for Hail Table.
We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!
Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.
[1]:
import hail as hl
hl.init()
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/local/lib/python3.8/dist-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.2
SparkUI available at http://hostname-40a21d68f8:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.117-2a7ef112c342
LOGGING: writing to /nh2xaqq6eidm/python/hail/docs/tutorials/hail-20230523-0158-0.2.117-2a7ef112c342.log
[2]:
hl.utils.get_movie_lens('data/')
2023-05-23 01:59:07.431 Hail: INFO: downloading MovieLens-100k data ...
Source: https://files.grouplens.org/datasets/movielens/ml-100k.zip
2023-05-23 01:59:07.941 Hail: INFO: importing users table and writing to data/users.ht ...
2023-05-23 01:59:13.993 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:15.547 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type int32 (imputed)
Loading field 'f2' as type str (imputed)
Loading field 'f3' as type str (imputed)
Loading field 'f4' as type str (imputed)
2023-05-23 01:59:16.507 Hail: INFO: Coerced sorted dataset
2023-05-23 01:59:17.588 Hail: INFO: wrote table with 943 rows in 1 partition to data/users.ht
2023-05-23 01:59:17.657 Hail: INFO: importing movies table and writing to data/movies.ht ...
2023-05-23 01:59:18.932 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:20.186 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type str (imputed)
Loading field 'f2' as type str (imputed)
Loading field 'f3' as type str (imputed)
Loading field 'f4' as type str (imputed)
Loading field 'f5' as type int32 (imputed)
Loading field 'f6' as type int32 (imputed)
Loading field 'f7' as type int32 (imputed)
Loading field 'f8' as type int32 (imputed)
Loading field 'f9' as type int32 (imputed)
Loading field 'f10' as type int32 (imputed)
Loading field 'f11' as type int32 (imputed)
Loading field 'f12' as type int32 (imputed)
Loading field 'f13' as type int32 (imputed)
Loading field 'f14' as type int32 (imputed)
Loading field 'f15' as type int32 (imputed)
Loading field 'f16' as type int32 (imputed)
Loading field 'f17' as type int32 (imputed)
Loading field 'f18' as type int32 (imputed)
Loading field 'f19' as type int32 (imputed)
Loading field 'f20' as type int32 (imputed)
Loading field 'f21' as type int32 (imputed)
Loading field 'f22' as type int32 (imputed)
Loading field 'f23' as type int32 (imputed)
2023-05-23 01:59:20.969 Hail: INFO: Coerced sorted dataset
2023-05-23 01:59:21.992 Hail: INFO: wrote table with 1682 rows in 1 partition to data/movies.ht
2023-05-23 01:59:22.037 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2023-05-23 01:59:23.105 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:24.412 Hail: INFO: Finished type imputation
Loading field 'f0' as type int32 (imputed)
Loading field 'f1' as type int32 (imputed)
Loading field 'f2' as type int32 (imputed)
Loading field 'f3' as type int32 (imputed)
2023-05-23 01:59:25.604 Hail: INFO: wrote table with 100000 rows in 1 partition to data/ratings.ht
[3]:
users = hl.read_table('data/users.ht')
Exploring Tables
The describe method prints the structure of a table: the fields and their types.
[4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
You can view the first few rows of the table using show.
10 rows are displayed by default. Try changing the code in the cell below to users.show(5)
.
[5]:
users.show()
showing top 10 rows
You can count the rows of a table.
[6]:
users.count()
[6]:
943
You can access fields of tables with the Python attribute notation table.field
, or with index notation table['field']
. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).
[7]:
users.occupation.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f69c9d52760>
Index:
['row']
--------------------------------------------------------
[8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f69c9d52760>
Index:
['row']
--------------------------------------------------------
users.occupation
and users['occupation']
are Hail Expressions
Lets peak at their using show
. Notice that the key is shown as well!
[9]:
users.occupation.show()
showing top 10 rows
Exercise
The movie dataset has two other tables: movies.ht
and ratings.ht
. Load these tables and have a quick look around.
[ ]: