Table

A Table is the Hail equivalent of a SQL table, a Pandas Dataframe, an R Dataframe, a dyplr Tibble, or a Spark Dataframe. It consists of rows of data conforming to a given schema where each column (row field) in the dataset is of a specific type.

Import

Hail has functions to create tables from a variety of data sources. The most common use case is to load data from a TSV or CSV file, which can be done with the import_table() function.

>>> ht = hl.import_table("data/kt_example1.tsv", impute=True)

Examples of genetics-specific import methods are import_locus_intervals(), import_fam(), and import_bed(). Many Hail methods also return tables.

An example of a table is below. We recommend ht as a variable name for tables, referring to a “Hail table”.

>>> ht.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+

Global Fields

In addition to row fields, Hail tables also have global fields. You can think of globals as extra fields in the table whose values are identical for every row. For example, the same table above with the global field G = 5 can be thought of as

+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     G |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |     5 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |     5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |     5 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+

but the value 5 is only stored once for the entire dataset and NOT once per row of the table. The output of Table.describe() lists what all of the row fields and global fields are.

>>> ht.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'ID': int32
    'HT': int32
    'SEX': str
    'X': int32
    'Z': int32
    'C1': int32
    'C2': int32
    'C3': int32
----------------------------------------
Key:
    None
----------------------------------------

Keys

Row fields can be specified to be the key of the table with the method Table.key_by(). Keys are important for joining tables together (discussed below).

Referencing Fields

Each Table object has all of its row fields and global fields as attributes in its namespace. This means that the row field ID can be accessed from table ht with ht.Sample or ht['Sample']. If ht also had a global field G, then it could be accessed by either ht.G or ht['G']. Both row fields and global fields are top level fields. Be aware that accessing a field with the dot notation will not work if the field name has spaces or special characters in it. The Python type of each attribute is an Expression that also contains context about its type and source, in this case a row field of table ht.

>>> ht  
<hail.table.Table at 0x110791a20>
>>> ht.ID  
<Int32Expression of type int32>

Updating Fields

Add or remove row fields from a Table with Table.select() and Table.drop().

>>> ht.drop('C1', 'C2')
>>> ht.drop(*['C1', 'C2'])
>>> ht.select(ht.ID, ht.SEX)
>>> ht.select(*['ID', 'C3'])

Use Table.annotate() to add new row fields or update the values of existing row fields and use Table.filter() to either keep or remove rows based on a condition:

>>> ht_new = ht.filter(ht['C1'] >= 10)
>>> ht_new = ht_new.annotate(id_times_2 = ht_new.ID * 2)

Aggregation

To compute an aggregate statistic over the rows of a dataset, Hail provides an Table.aggregate() method which can be passed a wide variety of aggregator functions (see Aggregators):

>>> ht.aggregate(hl.agg.fraction(ht.SEX == 'F'))
0.5

We also might want to compute the mean value of HT for each sex. This is possible with a combination of Table.group_by() and GroupedTable.aggregate():

>>> ht_agg = (ht.group_by(ht.SEX)
...             .aggregate(mean = hl.agg.mean(ht.HT)))
>>> ht_agg.show()
+-----+----------+
| SEX |     mean |
+-----+----------+
| str |  float64 |
+-----+----------+
| "F" | 6.50e+01 |
| "M" | 6.85e+01 |
+-----+----------+

Note that the result of ht.group_by(...).aggregate(...) is a new Table while the result of ht.aggregate(...) is a Python value.

Joins

To join the row fields of two tables together, Hail provides a Table.join() method with options for how to join the rows together (left, right, inner, outer). The tables are joined by the row fields designated as keys. The number of keys and their types must be identical between the two tables. However, the names of the keys do not need to be identical. Use the Table.key attribute to view the current table row keys and the Table.key_by() method to change the table keys. If top level row field names overlap between the two tables, the second table’s field names will be appended with a unique identifier “_N”.

>>> ht = ht.key_by('ID')
>>> ht2 = hl.import_table("data/kt_example2.tsv", impute=True).key_by('ID')
>>> ht_join = ht.join(ht2)
>>> ht_join.show()
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     A | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |    65 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |    72 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |    70 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |    60 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+

In addition to the Table.join() method, Hail provides another join syntax using Python’s bracket indexing syntax. The syntax looks like right_table[left_table.key], which will return an Expression instead of a Table. This expression is a dictionary mapping the keys in the left table to the rows in the right table. We can annotate the left table with this expression to perform a left join: left_table.annotate(x = right_table[left_table.key].x]. For example, below we add the field ‘B’ from ht2 to ht:

>>> ht1 = ht.annotate(B = ht2[ht.ID].B)
>>> ht1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+

Interacting with Tables Locally

Hail has many useful methods for interacting with tables locally such as in an Jupyter notebook. Use the Table.show() method to see the first few rows of a table.

Table.take() will collect the first n rows of a table into a local Python list:

>>> first3 = ht.take(3)
>>> first3
[Struct(ID=1, HT=65, SEX='M', X=5, Z=4, C1=2, C2=50, C3=5),
 Struct(ID=2, HT=72, SEX='M', X=6, Z=3, C1=2, C2=61, C3=1),
 Struct(ID=3, HT=70, SEX='F', X=7, Z=3, C1=10, C2=81, C3=-5)]

Note that each element of the list is a Struct whose elements can be accessed using Python’s get attribute or get item notation:

>>> first3[0].ID
1
>>> first3[0]['ID']
1

The Table.head() method is helpful for testing pipelines. It subsets a table to the first n rows, causing downstream operations to run much more quickly.

Table.describe() is a useful method for showing all of the fields of the table and their types. The types themselves can be accessed using the fields (e.g. ht.ID.dtype), and the full row and global types can be accessed with ht.row.dtype and ht.globals.dtype. The row fields that are part of the key can be accessed with Table.key. The Table.count() method returns the number of rows.

Export

Hail provides multiple methods to export data to other formats. Tables can be exported to TSV files with the Table.export() method or written to disk in Hail’s on-disk format with Table.write() (these files may be read in with read_table()). Tables can also be exported to pandas DataFrames with Table.to_pandas() or to pyspark Dataframes with Table.to_spark().