Table Tutorial

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail tables can store far more data than can fit on a single computer.

  • It carries global fields.

  • It is keyed.

A Table has two different kinds of fields:

  • global fields

  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!

  1. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872.

[1]:
import hail as hl
hl.init()
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/local/lib/python3.8/dist-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.2
SparkUI available at http://hostname-40a21d68f8:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.117-2a7ef112c342
LOGGING: writing to /nh2xaqq6eidm/python/hail/docs/tutorials/hail-20230523-0158-0.2.117-2a7ef112c342.log
[2]:
hl.utils.get_movie_lens('data/')
2023-05-23 01:59:07.431 Hail: INFO: downloading MovieLens-100k data ...
  Source: https://files.grouplens.org/datasets/movielens/ml-100k.zip
2023-05-23 01:59:07.941 Hail: INFO: importing users table and writing to data/users.ht ...
2023-05-23 01:59:13.993 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:15.547 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type int32 (imputed)
  Loading field 'f2' as type str (imputed)
  Loading field 'f3' as type str (imputed)
  Loading field 'f4' as type str (imputed)
2023-05-23 01:59:16.507 Hail: INFO: Coerced sorted dataset
2023-05-23 01:59:17.588 Hail: INFO: wrote table with 943 rows in 1 partition to data/users.ht
2023-05-23 01:59:17.657 Hail: INFO: importing movies table and writing to data/movies.ht ...
2023-05-23 01:59:18.932 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:20.186 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type str (imputed)
  Loading field 'f2' as type str (imputed)
  Loading field 'f3' as type str (imputed)
  Loading field 'f4' as type str (imputed)
  Loading field 'f5' as type int32 (imputed)
  Loading field 'f6' as type int32 (imputed)
  Loading field 'f7' as type int32 (imputed)
  Loading field 'f8' as type int32 (imputed)
  Loading field 'f9' as type int32 (imputed)
  Loading field 'f10' as type int32 (imputed)
  Loading field 'f11' as type int32 (imputed)
  Loading field 'f12' as type int32 (imputed)
  Loading field 'f13' as type int32 (imputed)
  Loading field 'f14' as type int32 (imputed)
  Loading field 'f15' as type int32 (imputed)
  Loading field 'f16' as type int32 (imputed)
  Loading field 'f17' as type int32 (imputed)
  Loading field 'f18' as type int32 (imputed)
  Loading field 'f19' as type int32 (imputed)
  Loading field 'f20' as type int32 (imputed)
  Loading field 'f21' as type int32 (imputed)
  Loading field 'f22' as type int32 (imputed)
  Loading field 'f23' as type int32 (imputed)
2023-05-23 01:59:20.969 Hail: INFO: Coerced sorted dataset
2023-05-23 01:59:21.992 Hail: INFO: wrote table with 1682 rows in 1 partition to data/movies.ht
2023-05-23 01:59:22.037 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2023-05-23 01:59:23.105 Hail: INFO: Reading table to impute column types
2023-05-23 01:59:24.412 Hail: INFO: Finished type imputation
  Loading field 'f0' as type int32 (imputed)
  Loading field 'f1' as type int32 (imputed)
  Loading field 'f2' as type int32 (imputed)
  Loading field 'f3' as type int32 (imputed)
2023-05-23 01:59:25.604 Hail: INFO: wrote table with 100000 rows in 1 partition to data/ratings.ht
[3]:
users = hl.read_table('data/users.ht')

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

[4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

You can view the first few rows of the table using show.

10 rows are displayed by default. Try changing the code in the cell below to users.show(5).

[5]:
users.show()
id
age
sex
occupation
zipcode
int32int32strstrstr
124"M""technician""85711"
253"F""other""94043"
323"M""writer""32067"
424"M""technician""43537"
533"F""other""15213"
642"M""executive""98101"
757"M""administrator""91344"
836"M""administrator""05201"
929"M""student""01002"
1053"M""lawyer""90703"

showing top 10 rows

You can count the rows of a table.

[6]:
users.count()
[6]:
943

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

[7]:
users.occupation.describe()
--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f69c9d52760>
Index:
    ['row']
--------------------------------------------------------
[8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f69c9d52760>
Index:
    ['row']
--------------------------------------------------------

users.occupation and users['occupation'] are Hail Expressions

Lets peak at their using show. Notice that the key is shown as well!

[9]:
users.occupation.show()
id
occupation
int32str
1"technician"
2"other"
3"writer"
4"technician"
5"other"
6"executive"
7"administrator"
8"administrator"
9"student"
10"lawyer"

showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.

[ ]: