Table Tutorial
Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas
, but Table
differs in 3 important ways:
It is distributed. Hail tables can store far more data than can fit on a single computer.
It carries global fields.
It is keyed.
A Table
has two different kinds of fields:
global fields
row fields
Importing and Reading
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with hl.read_table. It take a path and returns a Table
. ht
stands for Hail Table.
We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!
Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.
[1]:
import hail as hl
hl.init()
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-09f2439d4b:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.133-4c60fddb171a
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20241004-2008-0.2.133-4c60fddb171a.log
[2]:
hl.utils.get_movie_lens('data/')
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
[Stage 3:> (0 + 1) / 1]
[3]:
users = hl.read_table('data/users.ht')
Exploring Tables
The describe method prints the structure of a table: the fields and their types.
[4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
You can view the first few rows of the table using show.
10 rows are displayed by default. Try changing the code in the cell below to users.show(5)
.
[5]:
users.show()
showing top 10 rows
You can count the rows of a table.
[6]:
users.count()
[6]:
943
You can access fields of tables with the Python attribute notation table.field
, or with index notation table['field']
. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).
[7]:
users.occupation.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f39046280d0>
Index:
['row']
--------------------------------------------------------
[8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f39046280d0>
Index:
['row']
--------------------------------------------------------
users.occupation
and users['occupation']
are Hail Expressions
Lets peak at their using show
. Notice that the key is shown as well!
[9]:
users.occupation.show()
showing top 10 rows
Exercise
The movie dataset has two other tables: movies.ht
and ratings.ht
. Load these tables and have a quick look around.
[ ]: