Table

class hail.Table(jt)[source]

Hail’s distributed implementation of a dataframe or SQL table.

Use read_table() to read a table that was written with Table.write(). Use to_spark() and Table.from_spark() to inter-operate with PySpark’s SQL and machine learning functionality.

Examples

The examples below use table1 and table2, which are imported from text files using import_table().

>>> table1 = hl.import_table('data/kt_example1.tsv', impute=True, key='ID')
>>> table1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+
>>> table2 = hl.import_table('data/kt_example2.tsv', impute=True, key='ID')
>>> table2.show()
+-------+-------+--------+
|    ID |     A | B      |
+-------+-------+--------+
| int32 | int32 | str    |
+-------+-------+--------+
|     1 |    65 | cat    |
|     2 |    72 | dog    |
|     3 |    70 | mouse  |
|     4 |    60 | rabbit |
+-------+-------+--------+

Define new annotations:

>>> height_mean_m = 68
>>> height_sd_m = 3
>>> height_mean_f = 65
>>> height_sd_f = 2.5
>>>
>>> def get_z(height, sex):
...    return hl.cond(sex == 'M',
...                  (height - height_mean_m) / height_sd_m,
...                  (height - height_mean_f) / height_sd_f)
>>>
>>> table1 = table1.annotate(height_z = get_z(table1.HT, table1.SEX))
>>> table1 = table1.annotate_globals(global_field_1 = [1, 2, 3])

Filter rows of the table:

>>> table2 = table2.filter(table2.B != 'rabbit')

Compute global aggregation statistics:

>>> t1_stats = table1.aggregate(hl.struct(mean_c1 = agg.mean(table1.C1),
...                                       mean_c2 = agg.mean(table1.C2),
...                                       stats_c3 = agg.stats(table1.C3)))
>>> print(t1_stats)

Group by a field and aggregate to produce a new table:

>>> table3 = (table1.group_by(table1.SEX)
...                 .aggregate(mean_height_data = agg.mean(table1.HT)))
>>> table3.show()

Join tables together inside an annotation expression:

>>> table2 = table2.key_by('ID')
>>> table1 = table1.annotate(B = table2[table1.ID].B)
>>> table1.show()

Attributes

globals Returns a struct expression including all global fields.
key Row key struct.
row Returns a struct expression of all row-indexed fields, including keys.
row_value Returns a struct expression including all non-key row-indexed fields.

Methods

__init__ Initialize self.
add_index Add the integer index of each row as a new row field.
aggregate Aggregate over rows into a local value.
all Evaluate whether a boolean expression is true for all rows.
annotate Add new fields.
annotate_globals Add new global fields.
any Evaluate whether a Boolean expression is true for at least one row.
cache Persist this table in memory.
collect Collect the rows of the table into a local list.
collect_by_key Collect values for each unique key into an array.
count Count the number of rows in the table.
describe Print information about the fields in the table.
distinct Keep only one row for each unique key.
drop Drop fields from the table.
expand_types Expand complex types into structs and arrays.
explode Explode rows along a top-level field of the table.
export Export to a TSV file.
filter Filter rows.
flatten Flatten nested structs.
from_pandas Create table from Pandas DataFrame
from_spark Convert PySpark SQL DataFrame to a table.
group_by Group by a new key for use with GroupedTable.aggregate().
head Subset table to first n rows.
index Expose the row values as if looked up in a dictionary, indexing with exprs.
index_globals Return this table’s global variables for use in another expression context.
join Join two tables together.
key_by Key table by a new set of fields.
n_partitions Returns the number of partitions in the table.
order_by Sort by the specified fields.
parallelize Parallelize a local array of structs into a distributed table.
persist Persist this table in memory or on disk.
rename Rename fields of the table.
repartition Change the number of distributed partitions.
sample Downsample the table by keeping each row with probability p.
select Select existing fields or create new fields by name, dropping the rest.
select_globals Select existing global fields or create new fields by name, dropping the rest.
show Print the first few rows of the table to the console.
take Collect the first n rows of the table into a local list.
to_matrix_table Construct a matrix table from a table in coordinate representation.
to_pandas Converts this table to a Pandas DataFrame.
to_spark Converts this table to a Spark DataFrame.
transmute Add new fields and drop fields referenced.
transmute_globals Similar to Table.annotate_globals(), but drops referenced fields.
union Union the rows of multiple tables.
unpersist Unpersists this table from memory/disk.
write Write to disk.
add_index(name='idx')[source]

Add the integer index of each row as a new row field.

Examples

>>> table_result = table1.add_index()
>>> table_result.show()
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |   idx |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int64 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |     0 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |     1 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |     2 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |     3 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+

Notes

This method returns a table with a new field whose name is given by the name parameter, with type tint64. The value of this field is the integer index of each row, starting from 0. Methods that respect ordering (like Table.take() or Table.export()) will return rows in order.

This method is also helpful for creating a unique integer index for rows of a table so that more complex types can be encoded as a simple number for performance reasons.

Parameters:name (str) – Name of index field.
Returns:Table – Table with a new index field.
aggregate(expr)[source]

Aggregate over rows into a local value.

Examples

Aggregate over rows:

>>> table1.aggregate(hl.struct(fraction_male=agg.fraction(table1.SEX == 'M'),
...                            mean_x=agg.mean(table1.X)))
Struct(fraction_male=0.5, mean_x=6.5)

Note

This method supports (and expects!) aggregation over rows.

Parameters:expr (Expression) – Aggregation expression.
Returns:any – Aggregated value dependent on expr.
all(expr)[source]

Evaluate whether a boolean expression is true for all rows.

Examples

Test whether C1 is greater than 5 in all rows of the table:

>>> if table1.all(table1.C1 == 5):
...     print("All rows have C1 equal 5.")
Parameters:expr (BooleanExpression) – Expression to test.
Returns:bool
annotate(**named_exprs)[source]

Add new fields.

Examples

Add field Y by computing the square of X:

>>> table_result = table1.annotate(Y = table1.X ** 2)

Add multiple fields simultaneously:

>>> table_result = table1.annotate(A = table1.X / 2,
...                                B = table1.X + 21)
Parameters:named_exprs (keyword args of Expression) – Expressions for new fields.
Returns:Table – Table with new fields.
annotate_globals(**named_exprs)[source]

Add new global fields.

Examples

Add a new global field:

>>> table_result = table1.annotate_globals(pops = ['EUR', 'AFR', 'EAS', 'SAS'])

Note

This method does not support aggregation.

Parameters:named_exprs (varargs of Expression) – Annotation expressions.
Returns:Table – Table with new global field(s).
any(expr)[source]

Evaluate whether a Boolean expression is true for at least one row.

Examples

Test whether C1 is equal to 5 any row in any row of the table:

>>> if table1.any(table1.C1 == 5):
...     print("At least one row has C1 equal 5.")
Parameters:expr (BooleanExpression) – Boolean expression.
Returns:boolTrue if the predicate evaluated for True for any row, otherwise False.
cache()[source]

Persist this table in memory.

Examples

Persist the table in memory:

>>> table = table.cache() 

Notes

This method is an alias for persist("MEMORY_ONLY").

Returns:Table – Cached table.
collect()[source]

Collect the rows of the table into a local list.

Examples

Collect a list of all X records:

>>> all_xs = [row['X'] for row in table1.select(table1.X).collect()]

Notes

This method returns a list whose elements are of type Struct. Fields of these structs can be accessed similarly to fields on a table, using dot methods (struct.foo) or string indexing (struct['foo']).

Warning

Using this method can cause out of memory errors. Only collect small tables.

Returns:list of Struct – List of rows.
collect_by_key(name: str = 'values') → hail.table.Table[source]

Collect values for each unique key into an array.

Note

Requires a keyed table.

Examples

>>> t1 = hl.Table.parallelize([
...     {'t': 'foo', 'x': 4, 'y': 'A'},
...     {'t': 'bar', 'x': 2, 'y': 'B'},
...     {'t': 'bar', 'x': -3, 'y': 'C'},
...     {'t': 'quam', 'x': 0, 'y': 'D'}],
...     hl.tstruct(t=hl.tstr, x=hl.tint32, y=hl.tstr),
...     key='t')
>>> t1.show()
+------+-------+-----+
| t    |     x | y   |
+------+-------+-----+
| str  | int32 | str |
+------+-------+-----+
| foo  |     4 | A   |
| bar  |     2 | B   |
| bar  |    -3 | C   |
| quam |     0 | D   |
+------+-------+-----+
>>> t1.collect_by_key().show()
+------+------------------------------------+
| t    | values                             |
+------+------------------------------------+
| str  | array<struct{x: int32, y: str}>    |
+------+------------------------------------+
| bar  | [{"x":2,"y":"B"},{"x":-3,"y":"C"}] |
| foo  | [{"x":4,"y":"A"}]                  |
| quam | [{"x":0,"y":"D"}]                  |
+------+------------------------------------+

Notes

The order of the values array is not guaranteed.

Parameters:name (str) – Field name for all values per key.
Returns:Table
count()[source]

Count the number of rows in the table.

Examples

>>> table1.count()
4
Returns:int
describe(handler=<built-in function print>)[source]

Print information about the fields in the table.

distinct() → hail.table.Table[source]

Keep only one row for each unique key.

Note

Requires a keyed table.

Examples

>>> t1 = hl.Table.parallelize([
...     {'a': 'foo', 'b': 1},
...     {'a': 'bar', 'b': 5},
...     {'a': 'bar', 'b': 2}],
...     hl.tstruct(a=hl.tstr, b=hl.tint32),
...     key='a')
>>> t1.show()
+-----+-------+
| a   |     b |
+-----+-------+
| str | int32 |
+-----+-------+
| foo |     1 |
| bar |     5 |
| bar |     2 |
+-----+-------+
>>> t1.distinct().show()
+-----+-------+
| a   |     b |
+-----+-------+
| str | int32 |
+-----+-------+
| bar |     5 |
| foo |     1 |
+-----+-------+

Notes

The row chosen per distinct key is not guaranteed.

Returns:Table
drop(*exprs)[source]

Drop fields from the table.

Examples

Drop fields C1 and C2 using strings:

>>> table_result = table1.drop('C1', 'C2')

Drop fields C1 and C2 using field references:

>>> table_result = table1.drop(table1.C1, table1.C2)

Drop a list of fields:

>>> fields_to_drop = ['C1', 'C2']
>>> table_result = table1.drop(*fields_to_drop)

Notes

This method can be used to drop global or row-indexed fields. The arguments can be either strings ('field'), or top-level field references (table.field or table['field']).

Parameters:exprs (varargs of str or Expression) – Names of fields to drop or field reference expressions.
Returns:Table – Table without specified fields.
expand_types()[source]

Expand complex types into structs and arrays.

Examples

>>> table_result = table1.expand_types()

Notes

Expands the following types: tlocus, tinterval, tset, tdict, ttuple.

The only types that will remain after this method are: tbool, tint32, tint64, tfloat64, tfloat32, tarray, tstruct.

Note, expand_types always returns an unkeyed table.

Returns:Table – Expanded table.
explode(field, name=None)[source]

Explode rows along a top-level field of the table.

Each row is copied for each element of field. The explode operation unpacks the elements in a field of type Array or Set into its own row. If an empty Array or Set is exploded, the entire row is removed from the table.

Examples

people_table is a Table with three fields: Name, Age and Children.

>>> people_table.show()
+----------+-------+--------------------------+
| Name     |   Age | Children                 |
+----------+-------+--------------------------+
| str      | int32 | array<str>               |
+----------+-------+--------------------------+
| Alice    |    34 | ["Dave","Ernie","Frank"] |
| Bob      |    51 | ["Gaby","Helen"]         |
| Caroline |    10 | []                       |
+----------+-------+--------------------------+

Table.explode() can be used to produce a distinct row for each element in the Children field:

>>> exploded = people_table.explode('Children')
>>> exploded.show()
+-------+-------+----------+
| Name  |   Age | Children |
+-------+-------+----------+
| str   | int32 | str      |
+-------+-------+----------+
| Alice |    34 | Dave     |
| Alice |    34 | Ernie    |
| Alice |    34 | Frank    |
| Bob   |    51 | Gaby     |
| Bob   |    51 | Helen    |
+-------+-------+----------+

The name parameter can be used to produce more appropriate field names:

>>> exploded = people_table.explode('Children', name='Child')
>>> exploded.show()
+-------+-------+-------+
| Name  |   Age | Child |
+-------+-------+-------+
| str   | int32 | str   |
+-------+-------+-------+
| Alice |    34 | Dave  |
| Alice |    34 | Ernie |
| Alice |    34 | Frank |
| Bob   |    51 | Gaby  |
| Bob   |    51 | Helen |
+-------+-------+-------+

Notes

Empty arrays or sets produce no rows in the resulting table. In the example above, notice that the name “Caroline” is not found in the exploded table.

Missing arrays or sets are treated as empty.

Parameters:
  • field (str or Expression) – Top-level field name or expression.
  • name (str or None) – If not None, rename the exploded field to name.
Returns:

Table – Table with exploded field.

export(output, types_file=None, header=True, parallel=None)[source]

Export to a TSV file.

Examples

Export to a tab-separated file:

>>> table1.export('output/table1.tsv.bgz')

Note

It is highly recommended to export large files with a .bgz extension, which will use a block gzipped compression codec. These files can be read natively with any Hail method, as well as with Python’s gzip.open and R’s read.table.

Parameters:
  • output (str) – URI at which to write exported file.
  • types_file (str, optional) – URI at which to write file containing field type information.
  • header (bool) – Include a header in the file.
  • parallel (str, optional) – If None, a single file is produced, otherwise a folder of file shards is produced. If ‘separate_header’, the header file is output separately from the file shards. If ‘header_per_shard’, each file shard has a header. If set to None the export will be slower.
filter(expr, keep=True)[source]

Filter rows.

Examples

Keep rows where C1 equals 5:

>>> table_result = table1.filter(table1.C1 == 5)

Remove rows where C1 equals 10:

>>> table_result = table1.filter(table1.C1 == 10, keep=False)

Notes

The expression expr will be evaluated for every row of the table. If keep is True, then rows where expr evaluates to False will be removed (the filter keeps the rows where the predicate evaluates to True). If keep is False, then rows where expr evaluates to False will be removed (the filter removes the rows where the predicate evaluates to True).

Warning

When expr evaluates to missing, the row will be removed regardless of keep.

Note

This method does not support aggregation.

Parameters:
  • expr (bool or BooleanExpression) – Filter expression.
  • keep (bool) – Keep rows where expr is true.
Returns:

Table – Filtered table.

flatten()[source]

Flatten nested structs.

Examples

Flatten table:

>>> table_result = table1.flatten()

Notes

Consider a table with signature

a: struct{
    p: int32,
    q: str
},
b: int32,
c: struct{
    x: str,
    y: array<struct{
        y: str,
        z: str
    }>
}

and key a. The result of flatten is

a.p: int32
a.q: str
b: int32
c.x: str
c.y: array<struct{
    y: str,
    z: str
}>

with key a.p, a.q.

Note, structures inside collections like arrays or sets will not be flattened.

Note, the result of flatten is always unkeyed.

Warning

Flattening a table will produces fields that cannot be referenced using the table.<field> syntax, e.g. “a.b”. Reference these fields using square bracket lookups: table['a.b'].

Returns:Table – Table with a flat schema (no struct fields).
static from_pandas(df, key=[])[source]

Create table from Pandas DataFrame

Examples

>>> t = hl.Table.from_pandas(df) 
Parameters:
  • df (pandas.DataFrame) – Pandas DataFrame.
  • key (str or list of str) – Key fields.
Returns:

Table

static from_spark(df, key=[])[source]

Convert PySpark SQL DataFrame to a table.

Examples

>>> t = Table.from_spark(df) 

Notes

Spark SQL data types are converted to Hail types as follows:

BooleanType => :py:data:`.tbool`
IntegerType => :py:data:`.tint32`
LongType => :py:data:`.tint64`
FloatType => :py:data:`.tfloat32`
DoubleType => :py:data:`.tfloat64`
StringType => :py:data:`.tstr`
BinaryType => :class:`.TBinary`
ArrayType => :class:`.tarray`
StructType => :class:`.tstruct`

Unlisted Spark SQL data types are currently unsupported.

Parameters:
  • df (pyspark.sql.DataFrame) – PySpark DataFrame.
  • key (str or list of str) – Key fields.
Returns:

Table – Table constructed from the Spark SQL DataFrame.

globals

Returns a struct expression including all global fields.

Examples

The data type of the globals struct:

>>> table1.globals.dtype
dtype('struct{}')

The number of global fields:

>>> len(table1.globals)
0
Returns:StructExpression – Struct of all global fields.
group_by(*exprs, **named_exprs) → hail.table.GroupedTable[source]

Group by a new key for use with GroupedTable.aggregate().

Examples

Compute the mean value of X and the sum of Z per unique ID:

>>> table_result = (table1.group_by(table1.ID)
...                       .aggregate(meanX = agg.mean(table1.X), sumZ = agg.sum(table1.Z)))

Group by a height bin and compute sex ratio per bin:

>>> table_result = (table1.group_by(height_bin = table1.HT // 20)
...                       .aggregate(fraction_female = agg.fraction(table1.SEX == 'F')))

Notes

This function is always followed by GroupedTable.aggregate(). Follow the link for documentation on the aggregation step.

Note

Using group_by

group_by and its sibling methods (MatrixTable.group_rows_by() and MatrixTable.group_cols_by()) accept both variable-length (f(x, y, z)) and keyword (f(a=x, b=y, c=z)) arguments.

Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.

The following three usages are all equivalent, producing a GroupedTable grouped by fields C1 and C2 of table1.

First, variable-length string arguments:

>>> table_result = (table1.group_by('C1', 'C2')
...                       .aggregate(meanX = agg.mean(table1.X)))

Second, field reference variable-length arguments:

>>> table_result = (table1.group_by(table1.C1, table1.C2)
...                       .aggregate(meanX = agg.mean(table1.X)))

Last, expression keyword arguments:

>>> table_result = (table1.group_by(C1 = table1.C1, C2 = table1.C2)
...                       .aggregate(meanX = agg.mean(table1.X)))

Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:

>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))

The following two usages are equivalent, grouping by one field, x:

>>> table_result = (table3.group_by(table3.s.x)
...                       .aggregate(meanX = agg.mean(table3.X)))
>>> table_result = (table3.group_by(x = table3.s.x)
...                       .aggregate(meanX = agg.mean(table3.X)))

The keyword argument syntax permits arbitrary expressions:

>>> table_result = (table1.group_by(foo=table1.X ** 2 + 1)
...                       .aggregate(meanZ = agg.mean(table1.Z)))

These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.

>>> table_result = (table1.group_by(table1.C1, 'C2', height_bin = table1.HT // 20)
...                       .aggregate(meanX = agg.mean(table1.X)))

Note

This method does not support aggregation in key expressions.

Parameters:
  • exprs (varargs of type str or Expression) – Field names or field reference expressions.
  • named_exprs (keyword args of type Expression) – Field names and expressions to compute them.
Returns:

GroupedTable – Grouped table; use GroupedTable.aggregate() to complete the aggregation.

head(n)[source]

Subset table to first n rows.

Examples

Subset to the first three rows:

>>> table_result = table1.head(3)
>>> table_result.count()
3

Notes

The number of partitions in the new table is equal to the number of partitions containing the first n rows.

Parameters:n (int) – Number of rows to include.
Returns:Table – Table including the first n rows.
index(*exprs)[source]

Expose the row values as if looked up in a dictionary, indexing with exprs.

Examples

In the example below, both table1 and table2 are keyed by one field ID of type int.

>>> table_result = table1.select(B = table2.index(table1.ID).B)
>>> table_result.B.show()
+-------+--------+
|    ID | B      |
+-------+--------+
| int32 | str    |
+-------+--------+
|     1 | cat    |
|     2 | dog    |
|     3 | mouse  |
|     4 | rabbit |
+-------+--------+

Using key as the sole index expression is equivalent to passing all key fields individually:

>>> table_result = table1.select(B = table2.index(table1.key).B)

It is also possible to use non-key fields or expressions as the index expressions:

>>> table_result = table1.select(B = table2.index(table1.C1 % 4).B)
>>> table_result.show()
+-------+-------+
|    ID | B     |
+-------+-------+
| int32 | str   |
+-------+-------+
|     1 | dog   |
|     2 | dog   |
|     3 | dog   |
|     4 | mouse |
+-------+-------+

Notes

Table.index() is used to expose one table’s fields for use in expressions involving the another table or matrix table’s fields. The result of the method call is a struct expression that is usable in the same scope as exprs, just as if exprs were used to look up values of the table in a dictionary.

The type of the struct expression is the same as the indexed table’s row_value() (the key fields are removed, as they are available in the form of the index expressions).

Note

There is a shorthand syntax for Table.index() using square brackets (the Python __getitem__ syntax). This syntax is preferred.

>>> table_result = table1.select(B = table2[table1.ID].B)
Parameters:exprs (variable-length args of Expression) – Index expressions.
Returns:StructExpression
index_globals()[source]

Return this table’s global variables for use in another expression context.

Examples

>>> table_result = table2.annotate(C = table2.A * table1.index_globals().global_field_1)
Returns:StructExpression
join(right: Table, how='inner', _mangle: Callable[[str, int], str] = <function Table.<lambda>>) → Table[source]

Join two tables together.

Examples

Join table1 to table2 to produce table_joined:

>>> table_joined = table1.key_by('ID').join(table2.key_by('ID'))

Notes

Hail supports four types of joins specified by how:

  • inner – Key must be present in both the left and right tables.
  • outer – Key present in either the left or the right. For keys only in the left table, the right table’s fields will be missing. For keys only in the right table, the left table’s fields will be missing.
  • left – Key present in the left table. For keys not found on the right, the right table’s fields will be missing.
  • right – Key present in the right table. For keys not found on the right, the right table’s fields will be missing.

Both tables must have the same number of keys and the corresponding types of each key must be the same (order matters), but the key names can be different. For example, if table1 is keyed by fields ['a', 'b'], both of type int32, and table2 is keyed by fields ['c', 'd'], both of type int32, then the two tables can be joined (their rows will be joined where table1.a == table2.c and table1.b == table2.d).

The key fields and order from the left table are preserved, while the key fields from the right table are not present in the result.

Note

These join methods implement a traditional Cartesian product join, and the number of records in the resulting table can be larger than the number of records on the left or right if duplicate keys are present.

Parameters:
  • right (Table) – Table with which to join.
  • how (str) – Join type. One of “inner”, “outer”, “left”, “right”.
Returns:

Table – Joined table.

key

Row key struct.

Examples

List of key field names:

>>> list(table1.key)
['ID']

Number of key fields:

>>> len(table1.key)
1
Returns:StructExpression
key_by(*keys, **named_keys) → Table[source]

Key table by a new set of fields.

Examples

Assume table1 is a Table with three fields: C1, C2 and C3.

Changing key fields:

>>> table_result = table1.key_by('C2', 'C3')

This keys the table by ‘C2’ and ‘C3’, preserving old keys as value fields.

>>> table_result = table1.key_by(table1.C1)

This keys the table by ‘C1’, preserving old keys as value fields.

>>> table_result = table1.key_by(C1 = table1.C2, foo = table1.C1)

This keys the table by fields named ‘C1’ and ‘foo’, which have values corresponding to the original ‘C2’ and ‘C1’ fields respectively. The original ‘C1’ field has been overwritten by the new assignment, but the original ‘C2’ field is preserved as a value field.

Remove key:

>>> table_result = table1.key_by()

Notes

This method is used to specify all the fields of a new row key. The old key fields may be overwritten by newly-assigned fields, as described in Table.annotate(). If not overwritten, they are preserved as non-key fields.

See Table.select() for more information about how to define new key fields.

Parameters:keys (varargs of type str) – Field(s) to key by.
Returns:Table – Table with a new key.
n_partitions()[source]

Returns the number of partitions in the table.

Returns:int
order_by(*exprs)[source]

Sort by the specified fields. Unkeys the table, if keyed.

Examples

Four equivalent ways to order the table by field HT, ascending:

>>> sorted_table = table1.order_by(table1.HT)
>>> sorted_table = table1.order_by('HT')
>>> sorted_table = table1.order_by(hl.asc(table1.HT))
>>> sorted_table = table1.order_by(hl.asc('HT'))

Notes

Missing values are sorted after non-missing values. When multiple fields are passed, the table will be sorted first by the first argument, then the second, etc.

Note

This method unkeys the table.

Parameters:exprs (varargs of Ascending or Descending or Expression or str) – Fields to sort by.
Returns:Table – Table sorted by the given fields.
classmethod parallelize(rows, schema=None, key=None, n_partitions=None)[source]

Parallelize a local array of structs into a distributed table.

Examples

Parallelize a list of dictionaries:

>>> a = [ {'a': 5, 'b': 10}, {'a': 0, 'b': 200} ]
>>> table = hl.Table.parallelize(hl.literal(a, 'array<struct{a: int, b: int}>'))
>>> table.show()

Warning

Parallelizing very large local arrays will be slow.

Parameters:
  • rows – List of row values, or expression of type array<struct{...}>.
  • schema (str or HailType:, optional) – Value type.
  • key (Union[str, List[str]]], optional) – Key field(s).
  • n_partitions (int, optional)
Returns:

Table

persist(storage_level='MEMORY_AND_DISK')[source]

Persist this table in memory or on disk.

Examples

Persist the table to both memory and disk:

>>> table = table.persist() 

Notes

The Table.persist() and Table.cache() methods store the current table on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution for Table.write(), which stores a permanent file.

Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.

Parameters:storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
Returns:Table – Persisted table.
rename(mapping)[source]

Rename fields of the table.

Examples

Rename C1 to col1 and C2 to col2:

>>> table_result = table1.rename({'C1' : 'col1', 'C2' : 'col2'})
Parameters:mapping (dict of str, str) – Mapping from old field names to new field names.

Notes

Any field that does not appear as a key in mapping will not be renamed.

Returns:Table – Table with renamed fields.
repartition(n, shuffle=True)[source]

Change the number of distributed partitions.

Examples

Repartition to 10 partitions:

>>> table_result = table1.repartition(10)

Warning

When shuffle is False, repartition can only decrease the number of partitions and simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance and so can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.

Parameters:
  • n (int) – Desired number of partitions.
  • shuffle (bool) – If True, shuffle data. Otherwise, naively coalesce.
Returns:

Table – Repartitioned table.

row

Returns a struct expression of all row-indexed fields, including keys.

Examples

The data type of the row struct:

>>> table1.row.dtype
dtype('struct{ID: int32, HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')

The number of row fields:

>>> len(table1.row)
8
Returns:StructExpression – Struct of all row fields, including key fields.
row_value

Returns a struct expression including all non-key row-indexed fields.

Examples

The data type of the row struct:

>>> table1.row_value.dtype
dtype('struct{HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')

The number of row fields:

>>> len(table1.row_value)
7
Returns:StructExpression – Struct of all non-key row fields.
sample(p, seed=None)[source]

Downsample the table by keeping each row with probability p.

Examples

Downsample the table to approximately 1% of its rows.

>>> small_table1 = table1.sample(0.01)
Parameters:
  • p (float) – Probability of keeping each row.
  • seed (int) – Random seed.
Returns:

Table – Table with approximately p * n_rows rows.

select(*exprs, **named_exprs) → Table[source]

Select existing fields or create new fields by name, dropping the rest.

Examples

Select a few old fields and compute a new one:

>>> table_result = table1.select(table1.C1, Y=table1.Z - table1.X)

Notes

This method creates new row-indexed fields. If a created field shares its name with a global field of the table, the method will fail.

Note

Using select

Select and its sibling methods (Table.select_globals(), MatrixTable.select_globals(), MatrixTable.select_rows(), MatrixTable.select_cols(), and MatrixTable.select_entries()) accept both variable-length (f(x, y, z)) and keyword (f(a=x, b=y, c=z)) arguments.

Select methods will always preserve the key along that axis; e.g. for Table.select(), the table key will aways be kept. To modify the key, use key_by().

Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.

The following three usages are all equivalent, producing a new table with fields C1 and C2 of table1, and the table key ID.

First, variable-length string arguments:

>>> table_result = table1.select('C1', 'C2')

Second, field reference variable-length arguments:

>>> table_result = table1.select(table1.C1, table1.C2)

Last, expression keyword arguments:

>>> table_result = table1.select(C1 = table1.C1, C2 = table1.C2)

Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:

>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))

The following two usages are equivalent, producing a table with one field, x.:

>>> table3_result = table3.select(table3.s.x)
>>> table3_result = table3.select(x = table3.s.x)

The keyword argument syntax permits arbitrary expressions:

>>> table_result = table1.select(foo=table1.X ** 2 + 1)

These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.

>>> table_result = table1.select(table1.X, 'Z', bar = [table1.C1, table1.C2])

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

Table – Table with specified fields.

select_globals(*exprs, **named_exprs)[source]

Select existing global fields or create new fields by name, dropping the rest.

Examples

Select one existing field and compute a new one:

>>> table_result = table1.select_globals(table1.global_field_1,
...                                      another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])

Notes

This method creates new global fields. If a created field shares its name with a row-indexed field of the table, the method will fail.

Note

See Table.select() for more information about using select methods.

Note

This method does not support aggregation.

Parameters:
  • exprs (variable-length args of str or Expression) – Arguments that specify field names or nested field reference expressions.
  • named_exprs (keyword args of Expression) – Field names and the expressions to compute them.
Returns:

Table – Table with specified global fields.

show(n=10, width=90, truncate=None, types=True, handler=<built-in function print>)[source]

Print the first few rows of the table to the console.

Examples

Show the first lines of the table:

>>> table1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+
Parameters:
  • n (int) – Maximum number of rows to show.
  • width (int) – Horizontal width at which to break fields.
  • truncate (int, optional) – Truncate each field to the given number of characters. If None, truncate fields to the given width.
  • types (bool) – Print an extra header line with the type of each field.
  • handler (Callable[[str], Any]) – Handler function for data string.
take(n)[source]

Collect the first n rows of the table into a local list.

Examples

Take the first three rows:

>>> first3 = table1.take(3)
>>> print(first3)
[Struct(HT=65, SEX=M, X=5, C3=5, C2=50, C1=2, Z=4, ID=1),
 Struct(HT=72, SEX=M, X=6, C3=1, C2=61, C1=2, Z=3, ID=2),
 Struct(HT=70, SEX=F, X=7, C3=-5, C2=81, C1=10, Z=3, ID=3)]

Notes

This method does not need to look at all the data in the table, and allows for fast queries of the start of the table.

This method is equivalent to Table.head() followed by Table.collect().

Parameters:n (int) – Number of rows to take.
Returns:list of Struct – List of row structs.
to_matrix_table(row_key, col_key, row_fields=[], col_fields=[], n_partitions=None)[source]

Construct a matrix table from a table in coordinate representation.

Notes

Any row fields in the table that do not appear in one of the arguments to this method are assumed to be entry fields of the resulting matrix table.

Parameters:
  • row_key (Sequence[str]) – Fields to be used as row key.
  • col_key (Sequence[str]) – Fields to be used as column key.
  • row_fields (Sequence[str]) – Fields to be stored once per row.
  • col_fields (Sequence[str]) – Fields to be stored once per column.
  • n_partitions (int or None) – Number of partitions.
Returns:

MatrixTable

to_pandas(flatten=True)[source]

Converts this table to a Pandas DataFrame.

Because conversion to Pandas is done through Spark, and Spark cannot represent complex types, types are expanded before flattening or conversion.

Parameters:flatten (bool) – If True, flatten() before converting to Pandas DataFrame.
Returns:pandas.DataFrame
to_spark(flatten=True)[source]

Converts this table to a Spark DataFrame.

Because Spark cannot represent complex types, types are expanded before flattening or conversion.

Parameters:flatten (bool) – If True, flatten() before converting to Spark DataFrame.
Returns:pyspark.sql.DataFrame
transmute(**named_exprs)[source]

Add new fields and drop fields referenced.

Examples

Create a single field from an expression of C1, C2, and C3.

>>> table4.show()
+-------+------+-------+-------+-------+-------+-------+-------+
|     A | B.B0 | B.B1  | C     | D.cat | D.dog |   E.A |   E.B |
+-------+------+-------+-------+-------+-------+-------+-------+
| int32 | bool | str   | bool  | int32 | int32 | int32 | int32 |
+-------+------+-------+-------+-------+-------+-------+-------+
|    32 | true | hello | false |     5 |     7 |     5 |     7 |
+-------+------+-------+-------+-------+-------+-------+-------+
>>> table_result = table4.transmute(F=table4.A + 2 * table4.E.B)
>>> table_result.show()
+------+-------+-------+-------+-------+-------+
| B.B0 | B.B1  | C     | D.cat | D.dog |     F |
+------+-------+-------+-------+-------+-------+
| bool | str   | bool  | int32 | int32 | int32 |
+------+-------+-------+-------+-------+-------+
| true | hello | false |     5 |     7 |    46 |
+------+-------+-------+-------+-------+-------+

Notes

This method functions to create new row-indexed fields and consume fields found in the expressions in named_exprs.

All row-indexed top-level fields found in an expression are dropped after the new fields are created.

Note

transmute() will not drop key fields.

Warning

References to fields inside a top-level struct will remove the entire struct, as field E was removed in the example above since E.B was referenced.

Note

This method does not support aggregation.

Parameters:named_exprs (keyword args of Expression) – New field expressions.
Returns:Table – Table with transmuted fields.
transmute_globals(**named_exprs)[source]

Similar to Table.annotate_globals(), but drops referenced fields.

Notes

This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See Table.transmute() for full documentation on how transmute methods work.

Parameters:named_exprs (keyword args of Expression) – Annotation expressions.
Returns:Table
union(*tables)[source]

Union the rows of multiple tables.

Examples

Take the union of rows from two tables:

>>> union_table = table1.union(other_table)

Notes

If a row appears in both tables identically, it is duplicated in the result. The left and right tables must have the same schema and key.

Parameters:tables (varargs of Table) – Tables to union.
Returns:Table – Table with all rows from each component table.
unpersist()[source]

Unpersists this table from memory/disk.

Notes

This function will have no effect on a table that was not previously persisted.

Returns:Table – Unpersisted table.
write(output: str, overwrite=False, stage_locally: bool = False, _codec_spec: Union[str, NoneType] = None)[source]

Write to disk.

Examples

>>> table1.write('output/table1.ht')

Warning

Do not write to a path that is being read from in the same computation.

Parameters:
  • output (str) – Path at which to write.
  • stage_locally (bool) – If True, major output will be written to temporary local storage before being copied to output.
  • overwrite (bool) – If True, overwrite an existing file at the destination.