Table

class hail.Table(tir)[source]

Hail’s distributed implementation of a dataframe or SQL table.

Use read_table() to read a table that was written with Table.write(). Use to_spark() and Table.from_spark() to inter-operate with PySpark’s SQL and machine learning functionality.

Examples

The examples below use table1 and table2, which are imported from text files using import_table().

>>> table1 = hl.import_table('data/kt_example1.tsv', impute=True, key='ID')
>>> table1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+
>>> table2 = hl.import_table('data/kt_example2.tsv', impute=True, key='ID')
>>> table2.show()
+-------+-------+--------+
|    ID |     A | B      |
+-------+-------+--------+
| int32 | int32 | str    |
+-------+-------+--------+
|     1 |    65 | cat    |
|     2 |    72 | dog    |
|     3 |    70 | mouse  |
|     4 |    60 | rabbit |
+-------+-------+--------+

Define new annotations:

>>> height_mean_m = 68
>>> height_sd_m = 3
>>> height_mean_f = 65
>>> height_sd_f = 2.5
>>>
>>> def get_z(height, sex):
...    return hl.cond(sex == 'M',
...                  (height - height_mean_m) / height_sd_m,
...                  (height - height_mean_f) / height_sd_f)
>>>
>>> table1 = table1.annotate(height_z = get_z(table1.HT, table1.SEX))
>>> table1 = table1.annotate_globals(global_field_1 = [1, 2, 3])

Filter rows of the table:

>>> table2 = table2.filter(table2.B != 'rabbit')

Compute global aggregation statistics:

>>> t1_stats = table1.aggregate(hl.struct(mean_c1 = hl.agg.mean(table1.C1),
...                                       mean_c2 = hl.agg.mean(table1.C2),
...                                       stats_c3 = hl.agg.stats(table1.C3)))
>>> print(t1_stats)

Group by a field and aggregate to produce a new table:

>>> table3 = (table1.group_by(table1.SEX)
...                 .aggregate(mean_height_data = hl.agg.mean(table1.HT)))
>>> table3.show()

Join tables together inside an annotation expression:

>>> table2 = table2.key_by('ID')
>>> table1 = table1.annotate(B = table2[table1.ID].B)
>>> table1.show()

Attributes

globals Returns a struct expression including all global fields.
key Row key struct.
row Returns a struct expression of all row-indexed fields, including keys.
row_value Returns a struct expression including all non-key row-indexed fields.

Methods

__init__ Initialize self.
add_index Add the integer index of each row as a new row field.
aggregate Aggregate over rows into a local value.
all Evaluate whether a boolean expression is true for all rows.
annotate Add new fields.
annotate_globals Add new global fields.
anti_join Filters the table to rows whose key does not appear in other.
any Evaluate whether a Boolean expression is true for at least one row.
cache Persist this table in memory.
checkpoint Checkpoint the table to disk by writing and reading.
collect Collect the rows of the table into a local list.
collect_by_key Collect values for each unique key into an array.
count Count the number of rows in the table.
describe Print information about the fields in the table.
distinct Deduplicate keys, keeping only one row for each unique key.
drop Drop fields from the table.
expand_types Expand complex types into structs and arrays.
explode Explode rows along a field of type array or set, copying the entire row for each element.
export Export to a TSV file.
filter Filter rows.
flatten Flatten nested structs.
from_pandas Create table from Pandas DataFrame
from_spark Convert PySpark SQL DataFrame to a table.
group_by Group by a new key for use with GroupedTable.aggregate().
head Subset table to first n rows.
index Expose the row values as if looked up in a dictionary, indexing with exprs.
index_globals Return this table’s global variables for use in another expression context.
join Join two tables together.
key_by Key table by a new set of fields.
multi_way_zip_join Combine many tables in a zip join
n_partitions Returns the number of partitions in the table.
naive_coalesce Naively decrease the number of partitions.
order_by Sort by the specified fields.
parallelize Parallelize a local array of structs into a distributed table.
persist Persist this table in memory or on disk.
rename Rename fields of the table.
repartition Change the number of partitions.
sample Downsample the table by keeping each row with probability p.
select Select existing fields or create new fields by name, dropping the rest.
select_globals Select existing global fields or create new fields by name, dropping the rest.
semi_join Filters the table to rows whose key appears in other.
show Print the first few rows of the table to the console.
summarize Compute and print summary information about the fields in the table.
take Collect the first n rows of the table into a local list.
to_matrix_table Construct a matrix table from a table in coordinate representation.
to_matrix_table_row_major Construct a matrix table from a table in row major representation.
to_pandas Converts this table to a Pandas DataFrame.
to_spark Converts this table to a Spark DataFrame.
transmute Add new fields and drop fields referenced.
transmute_globals Similar to Table.annotate_globals(), but drops referenced fields.
union Union the rows of multiple tables.
unpersist Unpersists this table from memory/disk.
write Write to disk.
add_index(name='idx') → Table[source]

Add the integer index of each row as a new row field.

Examples

>>> table_result = table1.add_index()
>>> table_result.show()  
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |   idx |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int64 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |     0 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |     1 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |     2 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |     3 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+

Notes

This method returns a table with a new field whose name is given by the name parameter, with type tint64. The value of this field is the integer index of each row, starting from 0. Methods that respect ordering (like Table.take() or Table.export()) will return rows in order.

This method is also helpful for creating a unique integer index for rows of a table so that more complex types can be encoded as a simple number for performance reasons.

Parameters:name (str) – Name of index field.
Returns:Table – Table with a new index field.
aggregate(expr, _localize=True)[source]

Aggregate over rows into a local value.

Examples

Aggregate over rows:

>>> table1.aggregate(hl.struct(fraction_male=hl.agg.fraction(table1.SEX == 'M'),
...                            mean_x=hl.agg.mean(table1.X)))
Struct(fraction_male=0.5, mean_x=6.5)

Note

This method supports (and expects!) aggregation over rows.

Parameters:expr (Expression) – Aggregation expression.
Returns:any – Aggregated value dependent on expr.
all(expr)[source]

Evaluate whether a boolean expression is true for all rows.

Examples

Test whether C1 is greater than 5 in all rows of the table:

>>> if table1.all(table1.C1 == 5):
...     print("All rows have C1 equal 5.")
Parameters:expr (BooleanExpression) – Expression to test.
Returns:bool
annotate(**named_exprs) → Table[source]

Add new fields.

Examples

Add field Y by computing the square of X:

>>> table_result = table1.annotate(Y = table1.X ** 2)

Add multiple fields simultaneously:

>>> table_result = table1.annotate(A = table1.X / 2,
...                                B = table1.X + 21)
Parameters:named_exprs (keyword args of Expression) – Expressions for new fields.
Returns:Table – Table with new fields.
annotate_globals(**named_exprs) → Table[source]

Add new global fields.

Examples

Add a new global field:

>>> table_result = table1.annotate_globals(pops = ['EUR', 'AFR', 'EAS', 'SAS'])

Note

This method does not support aggregation.

Parameters:named_exprs (varargs of Expression) – Annotation expressions.
Returns:Table – Table with new global field(s).
anti_join(other: Table) → Table[source]

Filters the table to rows whose key does not appear in other.

Parameters:other (Table) – Table with compatible key field(s).
Returns:Table

Notes

The key type of the table must match the key type of other.

This method does not change the schema of the table; it is a method of filtering the table to keys not present in another table.

To restrict to keys present in other, use semi_join().

Examples

>>> table_result = table1.anti_join(table2)

It may be expensive to key the left-side table by the right-side key. In this case, it is possible to implement an anti-join using a non-key field as follows:

>>> table_result = table1.filter(hl.is_missing(table2.index(table1['ID'])))

See also

semi_join(), filter()

any(expr)[source]

Evaluate whether a Boolean expression is true for at least one row.

Examples

Test whether C1 is equal to 5 any row in any row of the table:

>>> if table1.any(table1.C1 == 5):
...     print("At least one row has C1 equal 5.")
Parameters:expr (BooleanExpression) – Boolean expression.
Returns:boolTrue if the predicate evaluated for True for any row, otherwise False.
cache() → hail.table.Table[source]

Persist this table in memory.

Examples

Persist the table in memory:

>>> table = table.cache() 

Notes

This method is an alias for persist("MEMORY_ONLY").

Returns:Table – Cached table.
checkpoint(output: str, overwrite: bool = False, stage_locally: bool = False, _codec_spec: Optional[str] = None, _read_if_exists: bool = False) → Table[source]

Checkpoint the table to disk by writing and reading.

Parameters:
  • output (str) – Path at which to write.
  • stage_locally (bool) – If True, major output will be written to temporary local storage before being copied to output
  • overwrite (bool) – If True, overwrite an existing file at the destination.
Returns:

Table

Warning

Do not checkpoint to a path that is being read from in the same computation.

Notes

An alias for write() followed by read_table(). It is possible to read the file at this path later with read_table().

Examples

>>> table1 = table1.checkpoint('output/table_checkpoint.ht')
collect(_localize=True)[source]

Collect the rows of the table into a local list.

Examples

Collect a list of all X records:

>>> all_xs = [row['X'] for row in table1.select(table1.X).collect()]

Notes

This method returns a list whose elements are of type Struct. Fields of these structs can be accessed similarly to fields on a table, using dot methods (struct.foo) or string indexing (struct['foo']).

Warning

Using this method can cause out of memory errors. Only collect small tables.

Returns:list of Struct – List of rows.
collect_by_key(name: str = 'values') → hail.table.Table[source]

Collect values for each unique key into an array.

Note

Requires a keyed table.

Examples

>>> t1 = hl.Table.parallelize([
...     {'t': 'foo', 'x': 4, 'y': 'A'},
...     {'t': 'bar', 'x': 2, 'y': 'B'},
...     {'t': 'bar', 'x': -3, 'y': 'C'},
...     {'t': 'quam', 'x': 0, 'y': 'D'}],
...     hl.tstruct(t=hl.tstr, x=hl.tint32, y=hl.tstr),
...     key='t')
>>> t1.show()
+--------+-------+-----+
| t      |     x | y   |
+--------+-------+-----+
| str    | int32 | str |
+--------+-------+-----+
| "bar"  |     2 | "B" |
| "bar"  |    -3 | "C" |
| "foo"  |     4 | "A" |
| "quam" |     0 | "D" |
+--------+-------+-----+
>>> t1.collect_by_key().show()
+--------+---------------------------------+
| t      | values                          |
+--------+---------------------------------+
| str    | array<struct{x: int32, y: str}> |
+--------+---------------------------------+
| "bar"  | [(2,"B"),(-3,"C")]              |
| "foo"  | [(4,"A")]                       |
| "quam" | [(0,"D")]                       |
+--------+---------------------------------+

Notes

The order of the values array is not guaranteed.

Parameters:name (str) – Field name for all values per key.
Returns:Table
count()[source]

Count the number of rows in the table.

Examples

>>> table1.count()
4
Returns:int
describe(handler=<built-in function print>)[source]

Print information about the fields in the table.

distinct() → hail.table.Table[source]

Deduplicate keys, keeping only one row for each unique key.

Note

Requires a keyed table.

Examples

>>> t1 = hl.Table.parallelize([
...     {'a': 'foo', 'b': 1},
...     {'a': 'bar', 'b': 5},
...     {'a': 'bar', 'b': 2}],
...     hl.tstruct(a=hl.tstr, b=hl.tint32),
...     key='a')
>>> t1.show()
+-------+-------+
| a     |     b |
+-------+-------+
| str   | int32 |
+-------+-------+
| "bar" |     5 |
| "bar" |     2 |
| "foo" |     1 |
+-------+-------+
>>> t1.distinct().show()
+-------+-------+
| a     |     b |
+-------+-------+
| str   | int32 |
+-------+-------+
| "bar" |     5 |
| "foo" |     1 |
+-------+-------+

Notes

The row chosen per distinct key is not guaranteed.

Returns:Table
drop(*exprs) → Table[source]

Drop fields from the table.

Examples

Drop fields C1 and C2 using strings:

>>> table_result = table1.drop('C1', 'C2')

Drop fields C1 and C2 using field references:

>>> table_result = table1.drop(table1.C1, table1.C2)

Drop a list of fields:

>>> fields_to_drop = ['C1', 'C2']
>>> table_result = table1.drop(*fields_to_drop)

Notes

This method can be used to drop global or row-indexed fields. The arguments can be either strings ('field'), or top-level field references (table.field or table['field']).

Parameters:exprs (varargs of str or Expression) – Names of fields to drop or field reference expressions.
Returns:Table – Table without specified fields.
expand_types() → hail.table.Table[source]

Expand complex types into structs and arrays.

Examples

>>> table_result = table1.expand_types()

Notes

Expands the following types: tlocus, tinterval, tset, tdict, ttuple.

The only types that will remain after this method are: tbool, tint32, tint64, tfloat64, tfloat32, tarray, tstruct.

Note, expand_types always returns an unkeyed table.

Returns:Table – Expanded table.
explode(field, name=None) → Table[source]

Explode rows along a field of type array or set, copying the entire row for each element.

Examples

people_table is a Table with three fields: Name, Age and Children.

>>> people_table.show()
+------------+-------+--------------------------+
| Name       |   Age | Children                 |
+------------+-------+--------------------------+
| str        | int32 | array<str>               |
+------------+-------+--------------------------+
| "Alice"    |    34 | ["Dave","Ernie","Frank"] |
| "Bob"      |    51 | ["Gaby","Helen"]         |
| "Caroline" |    10 | []                       |
+------------+-------+--------------------------+

Table.explode() can be used to produce a distinct row for each element in the Children field:

>>> exploded = people_table.explode('Children')
>>> exploded.show() 
+---------+-------+----------+
| Name    |   Age | Children |
+---------+-------+----------+
| str     | int32 | str      |
+---------+-------+----------+
| "Alice" |    34 | "Dave"   |
| "Alice" |    34 | "Ernie"  |
| "Alice" |    34 | "Frank"  |
| "Bob"   |    51 | "Gaby"   |
| "Bob"   |    51 | "Helen"  |
+---------+-------+----------+

The name parameter can be used to produce more appropriate field names:

>>> exploded = people_table.explode('Children', name='Child')
>>> exploded.show() 
+---------+-------+---------+
| Name    |   Age | Child   |
+---------+-------+---------+
| str     | int32 | str     |
+---------+-------+---------+
| "Alice" |    34 | "Dave"  |
| "Alice" |    34 | "Ernie" |
| "Alice" |    34 | "Frank" |
| "Bob"   |    51 | "Gaby"  |
| "Bob"   |    51 | "Helen" |
+---------+-------+---------+

Notes

Each row is copied for each element of field. The explode operation unpacks the elements in a field of type array or set into its own row. If an empty array or set is exploded, the entire row is removed from the table. In the example above, notice that the name “Caroline” is not found in the exploded table.

Missing arrays or sets are treated as empty.

Currently, the name argument may not be used if field is not a top-level field of the table (e.g. name may be used with ht.foo but not ht.foo.bar).

Parameters:
  • field (str or Expression) – Top-level field name or expression.
  • name (str or None) – If not None, rename the exploded field to name.
Returns:

Table

export(output, types_file=None, header=True, parallel=None, delimiter='\t')[source]

Export to a TSV file.

Examples

Export to a tab-separated file:

>>> table1.export('output/table1.tsv.bgz')

Note

It is highly recommended to export large files with a .bgz extension, which will use a block gzipped compression codec. These files can be read natively with any Hail method, as well as with Python’s gzip.open and R’s read.table.

Warning

Do not export to a path that is being read from in the same pipeline.

Parameters:
  • output (str) – URI at which to write exported file.
  • types_file (str, optional) – URI at which to write file containing field type information.
  • header (bool) – Include a header in the file.
  • parallel (str, optional) – If None, a single file is produced, otherwise a folder of file shards is produced. If ‘separate_header’, the header file is output separately from the file shards. If ‘header_per_shard’, each file shard has a header. If set to None the export will be slower.
  • delimiter (str) – Field delimiter.
filter(expr, keep=True) → Table[source]

Filter rows.

Examples

Keep rows where C1 equals 5:

>>> table_result = table1.filter(table1.C1 == 5)

Remove rows where C1 equals 10:

>>> table_result = table1.filter(table1.C1 == 10, keep=False)

Notes

The expression expr will be evaluated for every row of the table. If keep is True, then rows where expr evaluates to True will be kept (the filter removes the rows where the predicate evaluates to False). If keep is False, then rows where expr evaluates to True will be removed (the filter keeps the rows where the predicate evaluates to False).

Warning

When expr evaluates to missing, the row will be removed regardless of keep.

Note

This method does not support aggregation.

Parameters:
  • expr (bool or BooleanExpression) – Filter expression.
  • keep (bool) – Keep rows where expr is true.
Returns:

Table – Filtered table.

flatten() → hail.table.Table[source]

Flatten nested structs.

Examples

Flatten table:

>>> table_result = table1.flatten()

Notes

Consider a table with signature

a: struct{
    p: int32,
    q: str
},
b: int32,
c: struct{
    x: str,
    y: array<struct{
        y: str,
        z: str
    }>
}

and key a. The result of flatten is

a.p: int32
a.q: str
b: int32
c.x: str
c.y: array<struct{
    y: str,
    z: str
}>

with key a.p, a.q.

Note, structures inside collections like arrays or sets will not be flattened.

Note, the result of flatten is always unkeyed.

Warning

Flattening a table will produces fields that cannot be referenced using the table.<field> syntax, e.g. “a.b”. Reference these fields using square bracket lookups: table['a.b'].

Returns:Table – Table with a flat schema (no struct fields).
static from_pandas(df, key=[]) → Table[source]

Create table from Pandas DataFrame

Examples

>>> t = hl.Table.from_pandas(df) 
Parameters:
  • df (pandas.DataFrame) – Pandas DataFrame.
  • key (str or list of str) – Key fields.
Returns:

Table

static from_spark(df, key=[]) → Table[source]

Convert PySpark SQL DataFrame to a table.

Examples

>>> t = Table.from_spark(df) 

Notes

Spark SQL data types are converted to Hail types as follows:

BooleanType => :py:data:`.tbool`
IntegerType => :py:data:`.tint32`
LongType => :py:data:`.tint64`
FloatType => :py:data:`.tfloat32`
DoubleType => :py:data:`.tfloat64`
StringType => :py:data:`.tstr`
BinaryType => :class:`.TBinary`
ArrayType => :class:`.tarray`
StructType => :class:`.tstruct`

Unlisted Spark SQL data types are currently unsupported.

Parameters:
  • df (pyspark.sql.DataFrame) – PySpark DataFrame.
  • key (str or list of str) – Key fields.
Returns:

Table – Table constructed from the Spark SQL DataFrame.

globals

Returns a struct expression including all global fields.

Examples

The data type of the globals struct:

>>> table1.globals.dtype
dtype('struct{global_field_1: int32, global_field_2: int32}')

The number of global fields:

>>> len(table1.globals)
2
Returns:StructExpression – Struct of all global fields.
group_by(*exprs, **named_exprs) → hail.table.GroupedTable[source]

Group by a new key for use with GroupedTable.aggregate().

Examples

Compute the mean value of X and the sum of Z per unique ID:

>>> table_result = (table1.group_by(table1.ID)
...                       .aggregate(meanX = hl.agg.mean(table1.X), sumZ = hl.agg.sum(table1.Z)))

Group by a height bin and compute sex ratio per bin:

>>> table_result = (table1.group_by(height_bin = table1.HT // 20)
...                       .aggregate(fraction_female = hl.agg.fraction(table1.SEX == 'F')))

Notes

This function is always followed by GroupedTable.aggregate(). Follow the link for documentation on the aggregation step.

Note

Using group_by

group_by and its sibling methods (MatrixTable.group_rows_by() and MatrixTable.group_cols_by()) accept both variable-length (f(x, y, z)) and keyword (f(a=x, b=y, c=z)) arguments.

Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.

The following three usages are all equivalent, producing a GroupedTable grouped by fields C1 and C2 of table1.

First, variable-length string arguments:

>>> table_result = (table1.group_by('C1', 'C2')
...                       .aggregate(meanX = hl.agg.mean(table1.X)))

Second, field reference variable-length arguments:

>>> table_result = (table1.group_by(table1.C1, table1.C2)
...                       .aggregate(meanX = hl.agg.mean(table1.X)))

Last, expression keyword arguments:

>>> table_result = (table1.group_by(C1 = table1.C1, C2 = table1.C2)
...                       .aggregate(meanX = hl.agg.mean(table1.X)))

Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:

>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))

The following two usages are equivalent, grouping by one field, x:

>>> table_result = (table3.group_by(table3.s.x)
...                       .aggregate(meanX = hl.agg.mean(table3.X)))
>>> table_result = (table3.group_by(x = table3.s.x)
...                       .aggregate(meanX = hl.agg.mean(table3.X)))

The keyword argument syntax permits arbitrary expressions:

>>> table_result = (table1.group_by(foo=table1.X ** 2 + 1)
...                       .aggregate(meanZ = hl.agg.mean(table1.Z)))

These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.

>>> table_result = (table1.group_by(table1.C1, 'C2', height_bin = table1.HT // 20)
...                       .aggregate(meanX = hl.agg.mean(table1.X)))

Note

This method does not support aggregation in key expressions.

Parameters:
  • exprs (varargs of type str or Expression) – Field names or field reference expressions.
  • named_exprs (keyword args of type Expression) – Field names and expressions to compute them.
Returns:

GroupedTable – Grouped table; use GroupedTable.aggregate() to complete the aggregation.

head(n) → Table[source]

Subset table to first n rows.

Examples

Subset to the first three rows:

>>> table_result = table1.head(3)
>>> table_result.count()
3

Notes

The number of partitions in the new table is equal to the number of partitions containing the first n rows.

Parameters:n (int) – Number of rows to include.
Returns:Table – Table including the first n rows.
index(*exprs, all_matches=False) → hail.expr.expressions.base_expression.Expression[source]

Expose the row values as if looked up in a dictionary, indexing with exprs.

Examples

In the example below, both table1 and table2 are keyed by one field ID of type int.

>>> table_result = table1.select(B = table2.index(table1.ID).B)
>>> table_result.B.show()
+-------+----------+
|    ID | B        |
+-------+----------+
| int32 | str      |
+-------+----------+
|     1 | "cat"    |
|     2 | "dog"    |
|     3 | "mouse"  |
|     4 | "rabbit" |
+-------+----------+

Using key as the sole index expression is equivalent to passing all key fields individually:

>>> table_result = table1.select(B = table2.index(table1.key).B)

It is also possible to use non-key fields or expressions as the index expressions:

>>> table_result = table1.select(B = table2.index(table1.C1 % 4).B)
>>> table_result.show()
+-------+---------+
|    ID | B       |
+-------+---------+
| int32 | str     |
+-------+---------+
|     1 | "dog"   |
|     2 | "dog"   |
|     3 | "dog"   |
|     4 | "mouse" |
+-------+---------+

Notes

Table.index() is used to expose one table’s fields for use in expressions involving the another table or matrix table’s fields. The result of the method call is a struct expression that is usable in the same scope as exprs, just as if exprs were used to look up values of the table in a dictionary.

The type of the struct expression is the same as the indexed table’s row_value() (the key fields are removed, as they are available in the form of the index expressions).

Note

There is a shorthand syntax for Table.index() using square brackets (the Python __getitem__ syntax). This syntax is preferred.

>>> table_result = table1.select(B = table2[table1.ID].B)
Parameters:
  • exprs (variable-length args of Expression) – Index expressions.
  • all_matches (bool) – Experimental. If True, value of expression is array of all matches.
Returns:

Expression

index_globals() → hail.expr.expressions.typed_expressions.StructExpression[source]

Return this table’s global variables for use in another expression context.

Examples

>>> table_result = table2.annotate(C = table2.A * table1.index_globals().global_field_1)
Returns:StructExpression
join(right: Table, how='inner', _mangle: Callable[[str, int], str] = <function Table.<lambda>>) → Table[source]

Join two tables together.

Examples

Join table1 to table2 to produce table_joined:

>>> table_joined = table1.key_by('ID').join(table2.key_by('ID'))

Notes

Hail supports four types of joins specified by how:

  • inner – Key must be present in both the left and right tables.
  • outer – Key present in either the left or the right. For keys only in the left table, the right table’s fields will be missing. For keys only in the right table, the left table’s fields will be missing.
  • left – Key present in the left table. For keys not found on the right, the right table’s fields will be missing.
  • right – Key present in the right table. For keys not found on the right, the right table’s fields will be missing.

Both tables must have the same number of keys and the corresponding types of each key must be the same (order matters), but the key names can be different. For example, if table1 is keyed by fields ['a', 'b'], both of type int32, and table2 is keyed by fields ['c', 'd'], both of type int32, then the two tables can be joined (their rows will be joined where table1.a == table2.c and table1.b == table2.d).

The key fields and order from the left table are preserved, while the key fields from the right table are not present in the result.

Note

These join methods implement a traditional Cartesian product join, and the number of records in the resulting table can be larger than the number of records on the left or right if duplicate keys are present.

Parameters:
  • right (Table) – Table with which to join.
  • how (str) – Join type. One of “inner”, “outer”, “left”, “right”.
Returns:

Table – Joined table.

key

Row key struct.

Examples

List of key field names:

>>> list(table1.key)
['ID']

Number of key fields:

>>> len(table1.key)
1
Returns:StructExpression
key_by(*keys, **named_keys) → Table[source]

Key table by a new set of fields.

Examples

Assume table1 is a Table with three fields: C1, C2 and C3.

Changing key fields:

>>> table_result = table1.key_by('C2', 'C3')

This keys the table by ‘C2’ and ‘C3’, preserving old keys as value fields.

>>> table_result = table1.key_by(table1.C1)

This keys the table by ‘C1’, preserving old keys as value fields.

>>> table_result = table1.key_by(C1 = table1.C2, foo = table1.C1)

This keys the table by fields named ‘C1’ and ‘foo’, which have values corresponding to the original ‘C2’ and ‘C1’ fields respectively. The original ‘C1’ field has been overwritten by the new assignment, but the original ‘C2’ field is preserved as a value field.

Remove key:

>>> table_result = table1.key_by()

Notes

This method is used to specify all the fields of a new row key. The old key fields may be overwritten by newly-assigned fields, as described in Table.annotate(). If not overwritten, they are preserved as non-key fields.

See Table.select() for more information about how to define new key fields.

Parameters:keys (varargs of type str) – Field(s) to key by.
Returns:Table – Table with a new key.
static multi_way_zip_join(tables, data_field_name, global_field_name) → Table[source]

Combine many tables in a zip join

Notes

The row type of the returned table is a struct with the key fields, and one extra field, data_field_name, which is an array of structs with the non key fields, one per input. The array elements are missing if their corresponding input had no row with that key or possibly if there is another input with more rows with that key than the corresponding input.

The global type of the returned table is an array of structs of the global type of all of the inputs.

The types for every input must be identical, not merely compatible, including the keys.

A zip join is similar to an outer join however rows are not duplicated to create the full Cartesian product of duplicate keys. Instead, there is exactly one entry in some data_field_name array for every row in the inputs.

Parameters:
  • tables (List[Table]) – A list of tables to combine
  • data_field_name (str) – The name of the resulting data field
  • global_field_name (str) – The name of the resulting global field
  • .. include:: _templates/experimental.rst
n_partitions()[source]

Returns the number of partitions in the table.

Returns:int
naive_coalesce(max_partitions: int) → Table[source]

Naively decrease the number of partitions.

Example

Naively repartition to 10 partitions:

>>> table_result = table1.naive_coalesce(10)

Warning

naive_coalesce() simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlike repartition(), so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.

Parameters:max_partitions (int) – Desired number of partitions. If the current number of partitions is less than or equal to max_partitions, do nothing.
Returns:Table – Table with at most max_partitions partitions.
order_by(*exprs) → Table[source]

Sort by the specified fields. Unkeys the table, if keyed.

Examples

Four equivalent ways to order the table by field HT, ascending:

>>> sorted_table = table1.order_by(table1.HT)
>>> sorted_table = table1.order_by('HT')
>>> sorted_table = table1.order_by(hl.asc(table1.HT))
>>> sorted_table = table1.order_by(hl.asc('HT'))

Notes

Missing values are sorted after non-missing values. When multiple fields are passed, the table will be sorted first by the first argument, then the second, etc.

Note

This method unkeys the table.

Parameters:exprs (varargs of Ascending or Descending or Expression or str) – Fields to sort by.
Returns:Table – Table sorted by the given fields.
classmethod parallelize(rows, schema=None, key=None, n_partitions=None) → Table[source]

Parallelize a local array of structs into a distributed table.

Examples

Parallelize a list of dictionaries:

>>> a = [ {'a': 5, 'b': 10}, {'a': 0, 'b': 200} ]
>>> table = hl.Table.parallelize(hl.literal(a, 'array<struct{a: int, b: int}>'))
>>> table.show()

Warning

Parallelizing very large local arrays will be slow.

Parameters:
  • rows – List of row values, or expression of type array<struct{...}>.
  • schema (str or HailType:, optional) – Value type.
  • key (Union[str, List[str]]], optional) – Key field(s).
  • n_partitions (int, optional)
Returns:

Table

persist(storage_level='MEMORY_AND_DISK') → Table[source]

Persist this table in memory or on disk.

Examples

Persist the table to both memory and disk:

>>> table = table.persist() 

Notes

The Table.persist() and Table.cache() methods store the current table on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution for Table.write(), which stores a permanent file.

Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.

Parameters:storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
Returns:Table – Persisted table.
rename(mapping) → Table[source]

Rename fields of the table.

Examples

Rename C1 to col1 and C2 to col2:

>>> table_result = table1.rename({'C1' : 'col1', 'C2' : 'col2'})
Parameters:mapping (dict of str, str) – Mapping from old field names to new field names.

Notes

Any field that does not appear as a key in mapping will not be renamed.

Returns:Table – Table with renamed fields.
repartition(n, shuffle=True) → Table[source]

Change the number of partitions.

Examples

Repartition to 500 partitions:

>>> table_result = table1.repartition(500)

Notes

Check the current number of partitions with n_partitions().

The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a table with \(M\) rows is first imported, each of the \(k\) partitions will contain about \(M/k\) of the rows. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores. Partitions are a core concept of distributed computation in Spark, see their documentation for details.

When shuffle=True, Hail does a full shuffle of the data and creates equal sized partitions. When shuffle=False, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to the repartition and coalesce commands in Spark, respectively. In particular, when shuffle=False, n_partitions cannot exceed current number of partitions.

Parameters:
  • n (int) – Desired number of partitions.
  • shuffle (bool) – If True, use full shuffle to repartition.
Returns:

Table – Repartitioned table.

row

Returns a struct expression of all row-indexed fields, including keys.

Examples

The data type of the row struct:

>>> table1.row.dtype
dtype('struct{ID: int32, HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')

The number of row fields:

>>> len(table1.row)
8
Returns:StructExpression – Struct of all row fields, including key fields.
row_value

Returns a struct expression including all non-key row-indexed fields.

Examples

The data type of the row struct:

>>> table1.row_value.dtype
dtype('struct{HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')

The number of row fields:

>>> len(table1.row_value)
7
Returns:StructExpression – Struct of all non-key row fields.
sample(p, seed=None) → Table[source]

Downsample the table by keeping each row with probability p.

Examples

Downsample the table to approximately 1% of its rows.

>>> small_table1 = table1.sample(0.01)
Parameters:
  • p (float) – Probability of keeping each row.
  • seed (int) – Random seed.
Returns:

Table – Table with approximately p * n_rows rows.

select(*exprs, **named_exprs) → Table[source]

Select existing fields or create new fields by name, dropping the rest.

Examples

Select a few old fields and compute a new one:

>>> table_result = table1.select(table1.C1, Y=table1.Z - table1.X)

Notes

This method creates new row-indexed fields. If a created field shares its name with a global field of the table, the method will fail.

Note

Using select

Select and its sibling methods (Table.select_globals(), MatrixTable.select_globals(), MatrixTable.select_rows(), MatrixTable.select_cols(), and MatrixTable.select_entries()) accept both variable-length (f(x, y, z)) and keyword (f(a=x, b=y, c=z)) arguments.

Select methods will always preserve the key along that axis; e.g. for Table.select(), the table key will aways be kept. To modify the key, use key_by().

Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.

The following three usages are all equivalent, producing a new table with fields C1 and C2 of table1, and the table key ID.

First, variable-length string arguments:

>>> table_result = table1.select('C1', 'C2')

Second, field reference variable-length arguments:

>>> table_result = table1.select(table1.C1, table1.C2)

Last, expression keyword arguments:

>>> table_result = table1.select(C1 = table1.C1, C2 = table1.C2)

Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:

>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))

The following two usages are equivalent, producing a table with one field, x.:

>>> table3_result = table3.select(table3.s.x)
>>> table3_result = table3.select(x = table3.s.x)

The keyword argument syntax permits arbitrary expressions:

>>> table_result = table1.select(foo=table1.X ** 2 + 1)

These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.

>>> table_result = table1.select(table1.X, 'Z', bar = [table1.C1, table1.C2])

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

Table – Table with specified fields.

select_globals(*exprs, **named_exprs) → hail.table.Table[source]

Select existing global fields or create new fields by name, dropping the rest.

Examples

Select one existing field and compute a new one:

>>> table_result = table1.select_globals(table1.global_field_1,
...                                      another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])

Notes

This method creates new global fields. If a created field shares its name with a row-indexed field of the table, the method will fail.

Note

See Table.select() for more information about using select methods.

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

Table – Table with specified global fields.

semi_join(other: Table) → Table[source]

Filters the table to rows whose key appears in other.

Parameters:other (Table) – Table with compatible key field(s).
Returns:Table

Notes

The key type of the table must match the key type of other.

This method does not change the schema of the table; it is a method of filtering the table to keys present in another table.

To discard keys present in other, use anti_join().

Examples

>>> table_result = table1.semi_join(table2)

It may be expensive to key the left-side table by the right-side key. In this case, it is possible to implement a semi-join using a non-key field as follows:

>>> table_result = table1.filter(hl.is_defined(table2.index(table1['ID'])))

See also

anti_join()

show(n=None, width=None, truncate=None, types=True, handler=None, n_rows=None)[source]

Print the first few rows of the table to the console.

Examples

Show the first lines of the table:

>>> table1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+
Parameters:
  • n or n_rows (int) – Maximum number of rows to show, or negative to show all rows.
  • width (int) – Horizontal width at which to break fields.
  • truncate (int, optional) – Truncate each field to the given number of characters. If None, truncate fields to the given width.
  • types (bool) – Print an extra header line with the type of each field.
  • handler (Callable[[str], Any]) – Handler function for data string.
summarize()[source]

Compute and print summary information about the fields in the table.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

take(n, _localize=True)[source]

Collect the first n rows of the table into a local list.

Examples

Take the first three rows:

>>> first3 = table1.take(3)
>>> first3
[Struct(ID=1, HT=65, SEX='M', X=5, Z=4, C1=2, C2=50, C3=5),
 Struct(ID=2, HT=72, SEX='M', X=6, Z=3, C1=2, C2=61, C3=1),
 Struct(ID=3, HT=70, SEX='F', X=7, Z=3, C1=10, C2=81, C3=-5)]

Notes

This method does not need to look at all the data in the table, and allows for fast queries of the start of the table.

This method is equivalent to Table.head() followed by Table.collect().

Parameters:n (int) – Number of rows to take.
Returns:list of Struct – List of row structs.
to_matrix_table(row_key, col_key, row_fields=[], col_fields=[], n_partitions=None) → hl.MatrixTable[source]

Construct a matrix table from a table in coordinate representation.

Notes

Any row fields in the table that do not appear in one of the arguments to this method are assumed to be entry fields of the resulting matrix table.

Parameters:
  • row_key (Sequence[str]) – Fields to be used as row key.
  • col_key (Sequence[str]) – Fields to be used as column key.
  • row_fields (Sequence[str]) – Fields to be stored once per row.
  • col_fields (Sequence[str]) – Fields to be stored once per column.
  • n_partitions (int or None) – Number of partitions.
Returns:

MatrixTable

to_matrix_table_row_major(columns, entry_field_name=None, col_field_name='col')[source]

Construct a matrix table from a table in row major representation. Each element in columns is a field that will become an entry field in the matrix table. Fields omitted from columns become row fields. If columns are structs, then the matrix table will have the entry fields of those structs. Otherwise, the matrix table will have one entry field named entry_field_name whose values come from the values of the columns fields. The matrix table is column indexed by col_field_name.

Notes

All fields in columns must have the same type.

Parameters:
  • columns (Sequence[str]) – Fields to be used as columns.
  • entry_field_name (str or None) – Field name for the entries of the matrix table.
  • col_field_name (str) – Field name for the columns of the matrix table.
Returns:

MatrixTable

to_pandas(flatten=True)[source]

Converts this table to a Pandas DataFrame.

Because conversion to Pandas is done through Spark, and Spark cannot represent complex types, types are expanded before flattening or conversion.

Parameters:flatten (bool) – If True, flatten() before converting to Pandas DataFrame.
Returns:pandas.DataFrame
to_spark(flatten=True)[source]

Converts this table to a Spark DataFrame.

Because Spark cannot represent complex types, types are expanded before flattening or conversion.

Parameters:flatten (bool) – If True, flatten() before converting to Spark DataFrame.
Returns:pyspark.sql.DataFrame
transmute(**named_exprs) → Table[source]

Add new fields and drop fields referenced.

Examples

Create a single field from an expression of C1, C2, and C3.

>>> table4.show()
+-------+------+---------+-------+-------+-------+-------+-------+
|     A | B.B0 | B.B1    | C     | D.cat | D.dog |   E.A |   E.B |
+-------+------+---------+-------+-------+-------+-------+-------+
| int32 | bool | str     | bool  | int32 | int32 | int32 | int32 |
+-------+------+---------+-------+-------+-------+-------+-------+
|    32 | true | "hello" | false |     5 |     7 |     5 |     7 |
+-------+------+---------+-------+-------+-------+-------+-------+
>>> table_result = table4.transmute(F=table4.A + 2 * table4.E.B)
>>> table_result.show()
+------+---------+-------+-------+-------+-------+
| B.B0 | B.B1    | C     | D.cat | D.dog |     F |
+------+---------+-------+-------+-------+-------+
| bool | str     | bool  | int32 | int32 | int32 |
+------+---------+-------+-------+-------+-------+
| true | "hello" | false |     5 |     7 |    46 |
+------+---------+-------+-------+-------+-------+

Notes

This method functions to create new row-indexed fields and consume fields found in the expressions in named_exprs.

All row-indexed top-level fields found in an expression are dropped after the new fields are created.

Note

transmute() will not drop key fields.

Warning

References to fields inside a top-level struct will remove the entire struct, as field E was removed in the example above since E.B was referenced.

Note

This method does not support aggregation.

Parameters:named_exprs (keyword args of Expression) – New field expressions.
Returns:Table – Table with transmuted fields.
transmute_globals(**named_exprs) → Table[source]

Similar to Table.annotate_globals(), but drops referenced fields.

Notes

This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:Table
union(*tables, unify: bool = False) → Table[source]

Union the rows of multiple tables.

Examples

Take the union of rows from two tables:

>>> union_table = table1.union(other_table)

Notes

If a row appears in more than one table identically, it is duplicated in the result. All tables must have the same key names and types. They must also have the same row types, unless the unify parameter is True, in which case a field appearing in any table will be included in the result, with missing values for tables that do not contain the field. If a field appears in multiple tables with incompatible types, like arrays and strings, then an error will be raised.

Parameters:
  • tables (varargs of Table) – Tables to union.
  • unify (bool) – Attempt to unify table field.
Returns:

Table – Table with all rows from each component table.

unpersist() → hail.table.Table[source]

Unpersists this table from memory/disk.

Notes

This function will have no effect on a table that was not previously persisted.

Returns:Table – Unpersisted table.
write(output: str, overwrite=False, stage_locally: bool = False, _codec_spec: Optional[str] = None)[source]

Write to disk.

Examples

>>> table1.write('output/table1.ht')

Warning

Do not write to a path that is being read from in the same computation.

Parameters:
  • output (str) – Path at which to write.
  • stage_locally (bool) – If True, major output will be written to temporary local storage before being copied to output.
  • overwrite (bool) – If True, overwrite an existing file at the destination.