KeyTable

class hail.KeyTable(hc, jkt)[source]

Hail’s version of a SQL table where columns can be designated as keys.

Key tables may be imported from a text file or Spark DataFrame with import_table() or from_dataframe(), generated from a variant dataset with aggregate_by_key(), make_table(), samples_table(), or variants_table().

In the examples below, we have imported two key tables from text files (kt1 and kt2).

>>> kt1 = hc.import_table('data/kt_example1.tsv', impute=True)
ID HT SEX X Z C1 C2 C3
1 65 M 5 4 2 50 5
2 72 M 6 3 2 61 1
3 70 F 7 3 10 81 -5
4 60 F 8 2 11 90 -10
>>> kt2 = hc.import_table('data/kt_example2.tsv', impute=True)
ID A B
1 65 cat
2 72 dog
3 70 mouse
4 60 rabbit
Variables:hc (HailContext) – Hail Context

Attributes

columns Names of all columns.
key List of key columns.
num_columns Number of columns.
schema Table schema.

Methods

__init__ x.__init__(…) initializes x; see help(type(x)) for signature
aggregate_by_key Aggregate columns programmatically.
annotate Add new columns computed from existing columns.
cache Mark this key table to be cached in memory.
collect Collect table to a local list.
count Count the number of rows.
drop Drop columns.
exists Evaluate whether a boolean expression is true for at least one row.
expand_types Expand types Locus, Interval, AltAllele, Variant, Genotype, Char, Set and Dict.
explode Explode columns of this key table.
export Export to a TSV file.
export_cassandra Export to Cassandra.
export_elasticsearch Export to Elasticsearch.
export_mongodb Export to MongoDB
export_solr Export to Solr.
filter Filter rows.
flatten Flatten nested Structs.
forall Evaluate whether a boolean expression is true for all rows.
from_dataframe Convert Spark SQL DataFrame to key table.
from_pandas Convert Pandas DataFrame to key table.
from_py
import_bed Import a UCSC .bed file as a key table.
import_fam Import PLINK .fam file into a key table.
import_interval_list Import an interval list file in the GATK standard format.
indexed Add the numerical index of each row as a new column.
join Join two key tables together.
key_by Change which columns are keys.
maximal_independent_set Compute a maximal independent set of vertices in an undirected graph whose edges are given by this key table.
num_partitions Returns the number of partitions in the key table.
order_by Sort by the specified columns.
persist Persist this key table to memory and/or disk.
query Performs aggregation queries over columns of the table, and returns Python object(s).
query_typed Performs aggregation queries over columns of the table, and returns Python object(s) and types.
range Construct a table with rows from 0 until n.
rename Rename columns of key table.
repartition Change the number of distributed partitions.
same Test whether two key tables are identical.
select Select a subset of columns.
show Show the first few rows of the table in human-readable format.
take Take a given number of rows from the head of the table.
to_dataframe Converts this key table to a Spark DataFrame.
to_pandas Converts this key table into a Pandas DataFrame.
union Union the rows of multiple tables.
unpersist Unpersists this table from memory/disk.
write Write as KT file.
aggregate_by_key(key_expr, agg_expr)[source]

Aggregate columns programmatically.

Examples

Compute mean height by sex:

>>> kt_ht_by_sex = kt1.aggregate_by_key("SEX = SEX", "MEAN_HT = HT.stats().mean")

The result of aggregate_by_key() is a key table kt_ht_by_sex with the following data:

SEX MEAN_HT
M 68.5
F 65

Notes

The scope for both key_expr and agg_expr is all column names in the input KeyTable.

For more information, see the documentation on writing expressions and using the Hail Expression Language

Parameters:
  • key_expr (str or list of str) – Named expression(s) for how to compute the keys of the new key table.
  • agg_expr (str or list of str) – Named aggregation expression(s).
Returns:

A new key table with the keys computed from the key_expr and the remaining columns computed from the agg_expr.

Return type:

KeyTable

annotate(expr)[source]

Add new columns computed from existing columns.

Examples

Add new column Y which is equal to 5 times X:

>>> kt_result = kt1.annotate("Y = 5 * X")

Notes

The scope for expr is all column names in the input KeyTable.

For more information, see the documentation on writing expressions and using the Hail Expression Language.

Parameters:expr (str or list of str) – Annotation expression or multiple annotation expressions.
Returns:Key table with new columns specified by expr.
Return type:KeyTable
cache()[source]

Mark this key table to be cached in memory.

cache() is the same as persist("MEMORY_ONLY").

Return type:KeyTable
collect()[source]

Collect table to a local list.

Examples

>>> id_to_sex = {row.ID : row.SEX for row in kt1.collect()}

Notes

This method should be used on very small tables and as a last resort. It is very slow to convert distributed Java objects to Python (especially serially), and the resulting list may be too large to fit in memory on one machine.

Return type:list of hail.representation.Struct
columns

Names of all columns.

>>> kt1.columns
[u'ID', u'HT', u'SEX', u'X', u'Z', u'C1', u'C2', u'C3']
Return type:list of str
count()[source]

Count the number of rows.

Examples

>>> kt1.count()
Return type:int
drop(column_names)[source]

Drop columns.

Examples

Assume kt1 is a KeyTable with three columns: C1, C2 and C3.

Drop columns:

>>> kt_result = kt1.drop('C1')
>>> kt_result = kt1.drop(['C1', 'C2'])
Parameters:column_names – List of columns to be dropped.
Type:str or list of str
Returns:Key table with dropped columns.
Return type:KeyTable
exists(expr)[source]

Evaluate whether a boolean expression is true for at least one row.

Examples

Test whether any row in the key table has the value of C1 equal to 5:

>>> if kt1.exists("C1 == 5"):
...     print("At least one row has C1 equal 5.")
Parameters:expr (str) – Boolean expression.
Return type:bool
expand_types()[source]

Expand types Locus, Interval, AltAllele, Variant, Genotype, Char, Set and Dict. Char is converted to String. Set is converted to Array. Dict[K, V] is converted to

Array[Struct {
    key: K
    value: V
}]
Returns:key table with signature containing only types: Boolean, Int, Long, Float, Double, Array and Struct
Return type:KeyTable
explode(column_names)[source]

Explode columns of this key table.

The explode operation unpacks the elements in a column of type Array or Set into its own row. If an empty Array or Set is exploded, the entire row is removed from the KeyTable.

Examples

Assume kt3 is a KeyTable with three columns: c1, c2 and c3.

>>> kt3 = hc.import_table('data/kt_example3.tsv', impute=True,
...                       types={'c1': TString(), 'c2': TArray(TInt()), 'c3': TArray(TArray(TInt()))})

The types of each column are String, Array[Int], and Array[Array[Int]] respectively. c1 cannot be exploded because its type is not an Array or Set. c2 can only be exploded once because the type of c2 after the first explode operation is Int.

c1 c2 c3
a [1,2,NA] [[3,4], []]

Explode c2:

>>> kt3.explode('c2')
c1 c2 c3
a 1 [[3,4], []]
a 2 [[3,4], []]

Explode c2 once and c3 twice:

>>> kt3.explode(['c2', 'c3', 'c3'])
c1 c2 c3
a 1 3
a 2 3
a 1 4
a 2 4
Parameters:column_names (str or list of str) – Column name(s) to be exploded.
Returns:Key table with columns exploded.
Return type:KeyTable
export(output, types_file=None, header=True, parallel=False)[source]

Export to a TSV file.

Examples

Rename column names of key table and export to file:

>>> (kt1.rename({'HT' : 'Height'})
...     .export("output/kt1_renamed.tsv"))
Parameters:
  • output (str) – Output file path.
  • types_file (str) – Output path of types file.
  • header (bool) – Write a header using the column names.
  • parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.
export_cassandra(address, keyspace, table, block_size=100, rate=1000)[source]

Export to Cassandra.

Warning

export_cassandra() is EXPERIMENTAL.

export_elasticsearch(host, port, index, index_type, block_size, config=None, verbose=True)[source]

Export to Elasticsearch.

Warning

export_elasticsearch() is EXPERIMENTAL.

export_mongodb(mode='append')[source]

Export to MongoDB

Warning

export_mongodb() is EXPERIMENTAL.

export_solr(zk_host, collection, block_size=100)[source]

Export to Solr.

Warning

export_solr() is EXPERIMENTAL.

filter(expr, keep=True)[source]

Filter rows.

Examples

Keep rows where C1 equals 5:

>>> kt_result = kt1.filter("C1 == 5")

Remove rows where C1 equals 10:

>>> kt_result = kt1.filter("C1 == 10", keep=False)

Notes

The scope for expr is all column names in the input KeyTable.

For more information, see the documentation on writing expressions and using the Hail Expression Language.

Caution

When expr evaluates to missing, the row will be removed regardless of whether keep=True or keep=False.

Parameters:
  • expr (str) – Boolean filter expression.
  • keep (bool) – Keep rows where expr is true.
Returns:

Filtered key table.

Return type:

KeyTable

flatten()[source]

Flatten nested Structs. Column names will be concatenated with dot (.).

Examples

Flatten Structs in key table:

>>> kt_result = kt3.flatten()

Consider a key table kt with signature

a: Struct {
    p: Int
    q: Double
}
b: Int
c: Struct {
    x: String
    y: Array[Struct {
    z: Map[Int]
    }]
}

and a single key column a. The result of flatten is

a.p: Int
a.q: Double
b: Int
c.x: String
c.y: Array[Struct {
    z: Map[Int]
}]

with key columns a.p, a.q.

Note, structures inside non-struct types will not be flattened.

Returns:Key table with no columns of type Struct.
Return type:KeyTable
forall(expr)[source]

Evaluate whether a boolean expression is true for all rows.

Examples

Test whether all rows in the key table have the value of C1 equal to 5:

>>> if kt1.forall("C1 == 5"):
...     print("All rows have C1 equal 5.")
Parameters:expr (str) – Boolean expression.
Return type:bool
static from_dataframe(df, key=[])[source]

Convert Spark SQL DataFrame to key table.

Examples

>>> kt = KeyTable.from_dataframe(df) 

Notes

Spark SQL data types are converted to Hail types as follows:

BooleanType => Boolean
IntegerType => Int
LongType => Long
FloatType => Float
DoubleType => Double
StringType => String
BinaryType => Binary
ArrayType => Array
StructType => Struct

Unlisted Spark SQL data types are currently unsupported.

Parameters:
  • df (DataFrame) – PySpark DataFrame.
  • key (str or list of str) – Key column(s).
Returns:

Key table constructed from the Spark SQL DataFrame.

Return type:

KeyTable

static from_pandas(df)[source]

Convert Pandas DataFrame to key table.

Examples

>>> KeyTable.from_pandas(KeyTable.range(10).to_pandas()).query('index.take(10)')
Parameters:df (DataFrame) – Pandas DataFrame.
Returns:Key table constructed from the Spark SQL DataFrame.
Return type:KeyTable
static from_py(hc, rows_py, schema, key_names=[], num_partitions=None)[source]
static import_bed(path)[source]

Import a UCSC .bed file as a key table.

Examples

Add the variant annotation va.cnvRegion: Boolean indicating inclusion in at least one interval of the three-column BED file file1.bed:

>>> bed = KeyTable.import_bed('data/file1.bed')
>>> vds_result = vds.annotate_variants_table(bed, root='va.cnvRegion')

Add a variant annotation va.cnvRegion (String) with value given by the fourth column of file2.bed:

>>> bed = KeyTable.import_bed('data/file2.bed')
>>> vds_result = vds.annotate_variants_table(bed, root='va.cnvID')

The file formats are

$ cat data/file1.bed
track name="BedTest"
20    1          14000000
20    17000000   18000000
...

$ cat file2.bed
track name="BedTest"
20    1          14000000  cnv1
20    17000000   18000000  cnv2
...

Notes

The key table produced by this method has one of two possible structures. If the .bed file has only three fields (chrom, chromStart, and chromEnd), then the produced key table has only one column:

  • interval (Interval) - Genomic interval.

If the .bed file has four or more columns, then Hail will store the fourth column in the table:

  • interval (Interval) - Genomic interval.
  • target (String) - Fourth column of .bed file.

UCSC bed files can have up to 12 fields, but Hail will only ever look at the first four. Hail ignores header lines in BED files.

Caution

UCSC BED files are 0-indexed and end-exclusive. The line “5 100 105” will contain locus 5:105 but not 5:100. Details here.

Parameters:path (str) – Path to .bed file.
Return type:KeyTable
static import_fam(path, quantitative=False, delimiter='\\\\s+', missing='NA')[source]

Import PLINK .fam file into a key table.

Examples

Import case-control phenotype data from a tab-separated PLINK .fam file into sample annotations:

>>> fam_kt = KeyTable.import_fam('data/myStudy.fam')

In Hail, unlike PLINK, the user must explicitly distinguish between case-control and quantitative phenotypes. Importing a quantitative phenotype without quantitative=True will return an error (unless all values happen to be 0, 1, 2, and -9):

>>> fam_kt = KeyTable.import_fam('data/myStudy.fam', quantitative=True)

Columns

The column, types, and missing values are shown below.

  • ID (String) – Sample ID (key column)
  • famID (String) – Family ID (missing = “0”)
  • patID (String) – Paternal ID (missing = “0”)
  • matID (String) – Maternal ID (missing = “0”)
  • isFemale (Boolean) – Sex (missing = “NA”, “-9”, “0”)

One of:

  • isCase (Boolean) – Case-control phenotype (missing = “0”, “-9”, non-numeric or the missing argument, if given.
  • qPheno (Double) – Quantitative phenotype (missing = “NA” or the missing argument, if given.
Parameters:
  • path (str) – Path to .fam file.
  • quantitative (bool) – If True, .fam phenotype is interpreted as quantitative.
  • delimiter (str) – .fam file field delimiter regex.
  • missing (str) – The string used to denote missing values. For case-control, 0, -9, and non-numeric are also treated as missing.
Returns:

Key table with information from .fam file.

Return type:

KeyTable

static import_interval_list(path)[source]

Import an interval list file in the GATK standard format.

>>> intervals = KeyTable.import_interval_list('data/capture_intervals.txt')

The File Format

Hail expects an interval file to contain either three or five fields per line in the following formats:

  • contig:start-end
  • contig  start  end (tab-separated)
  • contig  start  end  direction  target (tab-separated)

A file in either of the first two formats produces a key table with one column:

  • interval (Interval), key column

A file in the third format (with a “target” column) produces a key with two columns:

  • interval (Interval), key column
  • target (String)

Note

start and end match positions inclusively, e.g. start <= position <= end. parse() is exclusive of the end position.

Note

Hail uses the following ordering for contigs: 1-22 sorted numerically, then X, Y, MT, then alphabetically for any contig not matching the standard human chromosomes.

Caution

The interval parser for these files does not support the full range of formats supported by the python parser parse(). ‘k’, ‘m’, ‘start’, and ‘end’ are all invalid motifs in the contig:start-end format here.

Parameters:filename (str) – Path to file.
Returns:Interval-keyed table.
Return type:KeyTable
indexed(name='index')[source]

Add the numerical index of each row as a new column.

Examples

>>> ind_kt = kt1.indexed()

Notes

This method returns a table with a new column whose name is given by the name parameter, with type Long. The value of this column is the numerical index of each row, starting from 0. Methods that respect ordering (like KeyTable.take() or KeyTable.export() will return rows in order.

This method is helpful for creating a unique integer index for rows of a table, so that more complex types can be encoded as a simple number.

Parameters:name (str) – Name of index column.
Returns:Table with a new index column.
Return type:KeyTable
join(right, how='inner')[source]

Join two key tables together.

Examples

Join kt1 to kt2 to produce kt_joined:

>>> kt_result = kt1.key_by('ID').join(kt2.key_by('ID'))

Notes:

Hail supports four types of joins specified by how:

  • inner – Key must be present in both kt1 and kt2.
  • outer – Key present in kt1 or kt2. For keys only in kt1, the value of non-key columns from kt2 is set to missing. Likewise, for keys only in kt2, the value of non-key columns from kt1 is set to missing.
  • left – Key present in kt1. For keys only in kt1, the value of non-key columns from kt2 is set to missing.
  • right – Key present in kt2. For keys only in kt2, the value of non-key columns from kt1 is set to missing.

The non-key fields in kt2 must have non-overlapping column names with kt1.

Both key tables must have the same number of keys and the corresponding types of each key must be the same (order matters), but the key names can be different. For example, if kt1 has the key schema Struct{("a", Int), ("b", String)}, kt1 can be merged with a key table that has a key schema equal to Struct{("b", Int), ("c", String)} but cannot be merged to a key table with key schema Struct{("b", "String"), ("a", Int)}. kt_joined will have the same key names and schema as kt1.

Parameters:
  • right (KeyTable) – Key table to join
  • how (str) – Method for joining two tables together. One of “inner”, “outer”, “left”, “right”.
Returns:

Key table that results from joining this key table with another.

Return type:

KeyTable

key

List of key columns.

>>> kt1.key
[u'ID']
Return type:list of str
key_by(key)[source]

Change which columns are keys.

Examples

Assume kt is a KeyTable with three columns: c1, c2 and c3 and key c1.

Change key columns:

>>> kt_result = kt1.key_by(['C2', 'C3'])
>>> kt_result = kt1.key_by('C2')

Set to no keys:

>>> kt_result = kt1.key_by([])
Parameters:key (str or list of str) – List of columns to be used as keys.
Returns:Key table whose key columns are given by key.
Return type:KeyTable
maximal_independent_set(i, j, tie_breaker=None)[source]

Compute a maximal independent set of vertices in an undirected graph whose edges are given by this key table.

Examples

Prune individuals from a dataset until no close relationships remain with respect to a PC-Relate measure of kinship.

>>> related_pairs = vds.pc_relate(2, 0.001).filter("kin > 0.125")
>>> related_samples = related_pairs.query('i.flatMap(i => [i,j]).collectAsSet()')
>>> related_samples_to_keep = related_pairs.maximal_independent_set("i", "j")
>>> related_samples_to_remove = related_samples - set(related_samples_to_keep)
>>> vds.filter_samples_list(list(related_samples_to_remove))

Prune individuals from a dataset, prefering to keep cases over controls.

>>> related_pairs = vds.pc_relate(2, 0.001).filter("kin > 0.125")
>>> related_samples = related_pairs.query('i.flatMap(i => [i,j]).collectAsSet()')
>>> related_nodes_to_keep = (related_pairs
...   .key_by("i").join(vds.samples_table()).annotate('iAndCase = { id: i, isCase: sa.isCase }')
...   .select(['j', 'iAndCase'])
...   .key_by("j").join(vds.samples_table()).annotate('jAndCase = { id: j, isCase: sa.isCase }')
...   .select(['iAndCase', 'jAndCase'])
...   .maximal_independent_set("iAndCase", "jAndCase",
...     'if (l.isCase && !r.isCase) -1 else if (!l.isCase && r.isCase) 1 else 0'))
>>> related_samples_to_remove = related_samples - {x.id for x in related_nodes_to_keep}
>>> vds.filter_samples_list(list(related_samples_to_remove))

Notes

The vertex set of the graph is implicitly all the values realized by i and j on the rows of this key table. Each row of the key table corresponds to an undirected edge between the vertices given by evaluating i and j on that row. An undirected edge may appear multiple times in the key table and will not affect the output. Vertices with self-edges are removed as they are not independent of themselves.

The expressions for i and j must have the same type.

This method implements a greedy algorithm which iteratively removes a vertex of highest degree until the graph contains no edges.

tie_breaker is a Hail expression that defines an ordering on nodes. It has two values in scope, l and r, that refer the two nodes being compared. A pair of nodes can be ordered in one of three ways, and tie_breaker must encode the relationship as follows:

  • if l < r then tie_breaker evaluates to some negative integer
  • if l == r then tie_breaker evaluates to 0
  • if l > r then tie_breaker evaluates to some positive integer

For example, the usual ordering on the integers is defined by: l - r.

When multiple nodes have the same degree, this algorithm will order the nodes according to tie_breaker and remove the largest node.

Parameters:
  • i (str) – expression to compute one endpoint.
  • j (str) – expression to compute another endpoint.
  • tie_breaker – Expression used to order nodes with equal degree.
Returns:

a list of vertices in a maximal independent set.

Return type:

list of elements with the same type as i and j

num_columns

Number of columns.

>>> kt1.num_columns
8
Return type:int
num_partitions()[source]

Returns the number of partitions in the key table.

Return type:int
order_by(*cols)[source]

Sort by the specified columns. Missing values are sorted after non-missing values. Sort by the first column, then the second, etc.

Parameters:cols – Columns to sort by.
Type:str or asc(str) or desc(str)
Returns:Key table sorted by cols.
Return type:KeyTable
persist(storage_level='MEMORY_AND_DISK')[source]

Persist this key table to memory and/or disk.

Examples

Persist the key table to both memory and disk:

>>> kt = kt.persist() 

Notes

The persist() and cache() methods allow you to store the current table on disk or in memory to avoid redundant computation and improve the performance of Hail pipelines.

cache() is an alias for persist("MEMORY_ONLY"). Most users will want “MEMORY_AND_DISK”. See the Spark documentation for a more in-depth discussion of persisting data.

Parameters:storage_level – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP
Return type:KeyTable
query(exprs)[source]

Performs aggregation queries over columns of the table, and returns Python object(s).

Examples

>>> mean_value = kt1.query('C1.stats().mean')
>>> [hist, counter] = kt1.query(['HT.hist(50, 80, 10)', 'SEX.counter()'])

Notes

This method evaluates Hail expressions over the rows of the key table. The exprs argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.

The namespace of the expressions includes one aggregable for each column of the key table. We use the example kt1 here, which contains columns ID, HT, Sex, X, Z, C1, C2, and C3. Queries in this key table will contain the following namespace:

  • ID: (Aggregable[Int])
  • HT: (Aggregable[Int])
  • SEX: (Aggregable[String])
  • X: (Aggregable[Int])
  • Z: (Aggregable[Int])
  • C1: (Aggregable[Int])
  • C2: (Aggregable[Int])
  • C3: (Aggregable[Int])

Map and filter expressions on these aggregables have the same additional scope, which is all the columns in the key table. In our example, this includes:

  • ID: (Int)
  • HT: (Int)
  • SEX: (String)
  • X: (Int)
  • Z: (Int)
  • C1: (Int)
  • C2: (Int)
  • C3: (Int)

This scope means that operations like the below are permitted:

>>> fraction_tall_male = kt1.query('HT.filter(x => SEX == "M").fraction(x => x > 70)')
>>> ids = kt1.query('ID.filter(x => C2 < C3).collect()')
Parameters:exprs (str or list of str) –
Return type:annotation or list of annotation
query_typed(exprs)[source]

Performs aggregation queries over columns of the table, and returns Python object(s) and types.

Examples

>>> mean_value, t = kt1.query_typed('C1.stats().mean')
>>> [hist, counter], [t1, t2] = kt1.query_typed(['HT.hist(50, 80, 10)', 'SEX.counter()'])

See Keytable.query() for more information.

Parameters:exprs (str or list of str) –
Return type:(annotation or list of annotation, Type or list of Type)
static range(n, num_partitions=None)[source]

Construct a table with rows from 0 until n.

Examples

Construct a table with 100 rows:

>>> range_kt = KeyTable.range(100)

Construct a table with one million rows and twenty partitions:

>>> range_kt = KeyTable.range(1000000, num_partitions=20)

Notes

The resulting table has one column:

  • index (Int) – Unique row index from 0 until n
Parameters:
  • n (int) – Number of rows.
  • num_partitions (int or None) – Number of partitions.
Return type:

KeyTable

rename(column_names)[source]

Rename columns of key table.

column_names can be either a list of new names or a dict mapping old names to new names. If column_names is a list, its length must be the number of columns in this KeyTable.

Examples

Rename using a list:

>>> kt2.rename(['newColumn1', 'newColumn2', 'newColumn3'])

Rename using a dict:

>>> kt2.rename({'A' : 'C1'})
Parameters:column_names – list of new column names or a dict mapping old names to new names.
Returns:Key table with renamed columns.
Return type:KeyTable
repartition(n, shuffle=True)[source]

Change the number of distributed partitions.

Warning

When shuffle is False, repartition can only decrease the number of partitions and simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance and so can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.

Parameters:
  • n (int) – Desired number of partitions.
  • shuffle (bool) – Whether to shuffle or naively coalesce.
Return type:

KeyTable

same(other)[source]

Test whether two key tables are identical.

Examples

>>> if kt1.same(kt2):
...     print("KeyTables are the same!")
Parameters:other (KeyTable) – key table to compare against
Return type:bool
schema

Table schema.

Examples

>>> print(kt1.schema)

The pprint module can be used to print the schema in a more human-readable format:

>>> from pprint import pprint
>>> pprint(kt1.schema)
Return type:TStruct
select(column_names)[source]

Select a subset of columns.

Examples

Assume kt1 is a KeyTable with three columns: C1, C2 and C3.

Select/drop columns:

>>> kt_result = kt1.select('C1')

Reorder the columns:

>>> kt_result = kt1.select(['C3', 'C1', 'C2'])

Drop all columns:

>>> kt_result = kt1.select([])
Parameters:column_names – List of columns to be selected.
Type:str or list of str
Returns:Key table with selected columns.
Return type:KeyTable
show(n=10, truncate_to=None, print_types=True)[source]

Show the first few rows of the table in human-readable format.

Examples

Show, with default parameters (10 rows, no truncation, and column types):

>>> kt1.show()

Truncate long columns to 25 characters and only write 5 rows:

>>> kt1.show(5, truncate_to=25)

Notes

If the truncate_to argument is None, then no truncation will occur. This is the default behavior. An integer argument will truncate each cell to the specified number of characters.

Parameters:
  • n (int) – Number of rows to show.
  • truncate_to (int or None) – Truncate columns to the desired number of characters.
  • print_types (bool) – Print a line with column types.
take(n)[source]

Take a given number of rows from the head of the table.

Examples

Take the first ten rows:

>>> first10 = kt1.take(10)

Notes

This method does not need to look at all the data, and allows for fast queries of the start of the table.

Parameters:n (int) – Number of rows to take.
Returns:Rows from the start of the table.
Return type:list of Struct
to_dataframe(expand=True, flatten=True)[source]

Converts this key table to a Spark DataFrame.

Parameters:
  • expand (bool) – If true, expand_types before converting to DataFrame.
  • flatten (bool) – If true, flatten before converting to DataFrame. If both are true, flatten is run after expand so that expanded types are flattened.
Return type:

pyspark.sql.DataFrame

to_pandas(expand=True, flatten=True)[source]

Converts this key table into a Pandas DataFrame.

Parameters:
  • expand (bool) – If true, expand_types before converting to Pandas DataFrame.
  • flatten (bool) – If true, flatten before converting to Pandas DataFrame. If both are true, flatten is run after expand so that expanded types are flattened.
Returns:

Pandas DataFrame constructed from the key table.

Return type:

pandas.DataFrame

union(*kts)[source]

Union the rows of multiple tables.

Examples

Take the union of rows from two tables:

>>> other = hc.import_table('data/kt_example1.tsv', impute=True)
>>> union_kt = kt1.union(other)

Notes

If a row appears in both tables identically, it is duplicated in the result. The left and right tables must have the same schema and key.

Parameters:kts (args of type KeyTable) – Tables to merge.
Returns:A table with all rows from the left and right tables.
Return type:KeyTable
unpersist()[source]

Unpersists this table from memory/disk.

Notes This function will have no effect on a table that was not previously persisted.

There’s nothing stopping you from continuing to use a table that has been unpersisted, but doing so will result in all previous steps taken to compute the table being performed again since the table must be recomputed. Only unpersist a table when you are done with it.

write(output, overwrite=False)[source]

Write as KT file.

*Examples*

>>> kt1.write('output/kt1.kt')

Note

The write path must end in “.kt”.

Parameters:
  • output (str) – Path of KT file to write.
  • overwrite (bool) – If True, overwrite any existing KT file. Cannot be used to read from and write to the same path.