Table Tutorial

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

It is distributed. Hail tables can store far more data than can fit on a single computer.
It carries global fields.
It is keyed.

A Table has two different kinds of fields:

global fields
row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!

Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.

[1]:

import hail as hl
hl.init()

Loading BokehJS ...

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-09f2439d4b:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20241004-2008-0.2.133-4c60fddb171a.log

[2]:

hl.utils.get_movie_lens('data/')

SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
[Stage 3:>                                                          (0 + 1) / 1]

[3]:

users = hl.read_table('data/users.ht')

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

[4]:

users.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

You can view the first few rows of the table using show.

10 rows are displayed by default. Try changing the code in the cell below to users.show(5).

[5]:

users.show()


id	age	sex	occupation	zipcode
int32	int32	str	str	str
1	24	"M"	"technician"	"85711"
2	53	"F"	"other"	"94043"
3	23	"M"	"writer"	"32067"
4	24	"M"	"technician"	"43537"
5	33	"F"	"other"	"15213"
6	42	"M"	"executive"	"98101"
7	57	"M"	"administrator"	"91344"
8	36	"M"	"administrator"	"05201"
9	29	"M"	"student"	"01002"
10	53	"M"	"lawyer"	"90703"

showing top 10 rows

You can count the rows of a table.

[6]:

users.count()

[6]:

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

[7]:

users.occupation.describe()

--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f39046280d0>
Index:
    ['row']
--------------------------------------------------------

[8]:

users['occupation'].describe()

--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f39046280d0>
Index:
    ['row']
--------------------------------------------------------

users.occupation and users['occupation'] are Hail Expressions

Lets peak at their using show. Notice that the key is shown as well!

[9]:

users.occupation.show()


id	occupation
int32	str
1	"technician"
2	"other"
3	"writer"
4	"technician"
5	"other"
6	"executive"
7	"administrator"
8	"administrator"
9	"student"
10	"lawyer"

showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.

[ ]: