KeyTable¶
-
class
hail.
KeyTable
(hc, jkt)[source]¶ Hail’s version of a SQL table where columns can be designated as keys.
Key tables may be imported from a text file or Spark DataFrame with
import_table()
orfrom_dataframe()
, generated from a variant dataset withaggregate_by_key()
,make_table()
,samples_table()
, orvariants_table()
.In the examples below, we have imported two key tables from text files (
kt1
andkt2
).>>> kt1 = hc.import_table('data/kt_example1.tsv', impute=True)
ID HT SEX X Z C1 C2 C3 1 65 M 5 4 2 50 5 2 72 M 6 3 2 61 1 3 70 F 7 3 10 81 -5 4 60 F 8 2 11 90 -10 >>> kt2 = hc.import_table('data/kt_example2.tsv', impute=True)
ID A B 1 65 cat 2 72 dog 3 70 mouse 4 60 rabbit Variables: hc ( HailContext
) – Hail ContextAttributes
columns
Names of all columns. key
List of key columns. num_columns
Number of columns. schema
Table schema. Methods
__init__
x.__init__(…) initializes x; see help(type(x)) for signature aggregate_by_key
Aggregate columns programmatically. annotate
Add new columns computed from existing columns. cache
Mark this key table to be cached in memory. collect
Collect table to a local list. count
Count the number of rows. drop
Drop columns. exists
Evaluate whether a boolean expression is true for at least one row. expand_types
Expand types Locus, Interval, AltAllele, Variant, Genotype, Char, Set and Dict. explode
Explode columns of this key table. export
Export to a TSV file. export_cassandra
Export to Cassandra. export_elasticsearch
Export to Elasticsearch. export_mongodb
Export to MongoDB export_solr
Export to Solr. filter
Filter rows. flatten
Flatten nested Structs. forall
Evaluate whether a boolean expression is true for all rows. from_dataframe
Convert Spark SQL DataFrame to key table. from_pandas
Convert Pandas DataFrame to key table. from_py
import_bed
Import a UCSC .bed file as a key table. import_fam
Import PLINK .fam file into a key table. import_interval_list
Import an interval list file in the GATK standard format. indexed
Add the numerical index of each row as a new column. join
Join two key tables together. key_by
Change which columns are keys. maximal_independent_set
Compute a maximal independent set of vertices in an undirected graph whose edges are given by this key table. num_partitions
Returns the number of partitions in the key table. order_by
Sort by the specified columns. persist
Persist this key table to memory and/or disk. query
Performs aggregation queries over columns of the table, and returns Python object(s). query_typed
Performs aggregation queries over columns of the table, and returns Python object(s) and types. range
Construct a table with rows from 0 until n
.rename
Rename columns of key table. repartition
Change the number of distributed partitions. same
Test whether two key tables are identical. select
Select a subset of columns. show
Show the first few rows of the table in human-readable format. take
Take a given number of rows from the head of the table. to_dataframe
Converts this key table to a Spark DataFrame. to_pandas
Converts this key table into a Pandas DataFrame. union
Union the rows of multiple tables. unpersist
Unpersists this table from memory/disk. write
Write as KT file. -
aggregate_by_key
(key_expr, agg_expr)[source]¶ Aggregate columns programmatically.
Examples
Compute mean height by sex:
>>> kt_ht_by_sex = kt1.aggregate_by_key("SEX = SEX", "MEAN_HT = HT.stats().mean")
The result of
aggregate_by_key()
is a key tablekt_ht_by_sex
with the following data:SEX MEAN_HT M 68.5 F 65 Notes
The scope for both
key_expr
andagg_expr
is all column names in the inputKeyTable
.For more information, see the documentation on writing expressions and using the Hail Expression Language
Parameters: - key_expr (str or list of str) – Named expression(s) for how to compute the keys of the new key table.
- agg_expr (str or list of str) – Named aggregation expression(s).
Returns: A new key table with the keys computed from the
key_expr
and the remaining columns computed from theagg_expr
.Return type:
-
annotate
(expr)[source]¶ Add new columns computed from existing columns.
Examples
Add new column
Y
which is equal to 5 timesX
:>>> kt_result = kt1.annotate("Y = 5 * X")
Notes
The scope for
expr
is all column names in the inputKeyTable
.For more information, see the documentation on writing expressions and using the Hail Expression Language.
Parameters: expr (str or list of str) – Annotation expression or multiple annotation expressions. Returns: Key table with new columns specified by expr
.Return type: KeyTable
-
cache
()[source]¶ Mark this key table to be cached in memory.
cache()
is the same aspersist("MEMORY_ONLY")
.Return type: KeyTable
-
collect
()[source]¶ Collect table to a local list.
Examples
>>> id_to_sex = {row.ID : row.SEX for row in kt1.collect()}
Notes
This method should be used on very small tables and as a last resort. It is very slow to convert distributed Java objects to Python (especially serially), and the resulting list may be too large to fit in memory on one machine.
Return type: list of hail.representation.Struct
-
columns
¶ Names of all columns.
>>> kt1.columns [u'ID', u'HT', u'SEX', u'X', u'Z', u'C1', u'C2', u'C3']
Return type: list of str
-
drop
(column_names)[source]¶ Drop columns.
Examples
Assume
kt1
is aKeyTable
with three columns: C1, C2 and C3.Drop columns:
>>> kt_result = kt1.drop('C1')
>>> kt_result = kt1.drop(['C1', 'C2'])
Parameters: column_names – List of columns to be dropped. Type: str or list of str Returns: Key table with dropped columns. Return type: KeyTable
-
exists
(expr)[source]¶ Evaluate whether a boolean expression is true for at least one row.
Examples
Test whether any row in the key table has the value of
C1
equal to 5:>>> if kt1.exists("C1 == 5"): ... print("At least one row has C1 equal 5.")
Parameters: expr (str) – Boolean expression. Return type: bool
-
expand_types
()[source]¶ Expand types Locus, Interval, AltAllele, Variant, Genotype, Char, Set and Dict. Char is converted to String. Set is converted to Array. Dict[K, V] is converted to
Array[Struct { key: K value: V }]
Returns: key table with signature containing only types: Boolean, Int, Long, Float, Double, Array and Struct Return type: KeyTable
-
explode
(column_names)[source]¶ Explode columns of this key table.
The explode operation unpacks the elements in a column of type
Array
orSet
into its own row. If an emptyArray
orSet
is exploded, the entire row is removed from theKeyTable
.Examples
Assume
kt3
is aKeyTable
with three columns: c1, c2 and c3.>>> kt3 = hc.import_table('data/kt_example3.tsv', impute=True, ... types={'c1': TString(), 'c2': TArray(TInt()), 'c3': TArray(TArray(TInt()))})
The types of each column are
String
,Array[Int]
, andArray[Array[Int]]
respectively. c1 cannot be exploded because its type is not anArray
orSet
. c2 can only be exploded once because the type of c2 after the first explode operation isInt
.c1 c2 c3 a [1,2,NA] [[3,4], []] Explode c2:
>>> kt3.explode('c2')
c1 c2 c3 a 1 [[3,4], []] a 2 [[3,4], []] Explode c2 once and c3 twice:
>>> kt3.explode(['c2', 'c3', 'c3'])
c1 c2 c3 a 1 3 a 2 3 a 1 4 a 2 4 Parameters: column_names (str or list of str) – Column name(s) to be exploded. Returns: Key table with columns exploded. Return type: KeyTable
-
export
(output, types_file=None, header=True, parallel=False)[source]¶ Export to a TSV file.
Examples
Rename column names of key table and export to file:
>>> (kt1.rename({'HT' : 'Height'}) ... .export("output/kt1_renamed.tsv"))
Parameters: - output (str) – Output file path.
- types_file (str) – Output path of types file.
- header (bool) – Write a header using the column names.
- parallel (bool) – If true, writes a set of files (one per partition) rather than serially concatenating these files.
-
export_cassandra
(address, keyspace, table, block_size=100, rate=1000)[source]¶ Export to Cassandra.
Warning
export_cassandra()
is EXPERIMENTAL.
-
export_elasticsearch
(host, port, index, index_type, block_size, config=None, verbose=True)[source]¶ Export to Elasticsearch.
Warning
export_elasticsearch()
is EXPERIMENTAL.
-
export_mongodb
(mode='append')[source]¶ Export to MongoDB
Warning
export_mongodb()
is EXPERIMENTAL.
-
export_solr
(zk_host, collection, block_size=100)[source]¶ Export to Solr.
Warning
export_solr()
is EXPERIMENTAL.
-
filter
(expr, keep=True)[source]¶ Filter rows.
Examples
Keep rows where
C1
equals 5:>>> kt_result = kt1.filter("C1 == 5")
Remove rows where
C1
equals 10:>>> kt_result = kt1.filter("C1 == 10", keep=False)
Notes
The scope for
expr
is all column names in the inputKeyTable
.For more information, see the documentation on writing expressions and using the Hail Expression Language.
Caution
When
expr
evaluates to missing, the row will be removed regardless of whetherkeep=True
orkeep=False
.Parameters: - expr (str) – Boolean filter expression.
- keep (bool) – Keep rows where
expr
is true.
Returns: Filtered key table.
Return type:
-
flatten
()[source]¶ Flatten nested Structs. Column names will be concatenated with dot (.).
Examples
Flatten Structs in key table:
>>> kt_result = kt3.flatten()
Consider a key table
kt
with signaturea: Struct { p: Int q: Double } b: Int c: Struct { x: String y: Array[Struct { z: Map[Int] }] }
and a single key column
a
. The result of flatten isa.p: Int a.q: Double b: Int c.x: String c.y: Array[Struct { z: Map[Int] }]
with key columns
a.p, a.q
.Note, structures inside non-struct types will not be flattened.
Returns: Key table with no columns of type Struct. Return type: KeyTable
-
forall
(expr)[source]¶ Evaluate whether a boolean expression is true for all rows.
Examples
Test whether all rows in the key table have the value of
C1
equal to 5:>>> if kt1.forall("C1 == 5"): ... print("All rows have C1 equal 5.")
Parameters: expr (str) – Boolean expression. Return type: bool
-
static
from_dataframe
(df, key=[])[source]¶ Convert Spark SQL DataFrame to key table.
Examples
>>> kt = KeyTable.from_dataframe(df)
Notes
Spark SQL data types are converted to Hail types as follows:
BooleanType => Boolean IntegerType => Int LongType => Long FloatType => Float DoubleType => Double StringType => String BinaryType => Binary ArrayType => Array StructType => Struct
Unlisted Spark SQL data types are currently unsupported.
Parameters: - df (
DataFrame
) – PySpark DataFrame. - key (str or list of str) – Key column(s).
Returns: Key table constructed from the Spark SQL DataFrame.
Return type: - df (
-
static
from_pandas
(df)[source]¶ Convert Pandas DataFrame to key table.
Examples
>>> KeyTable.from_pandas(KeyTable.range(10).to_pandas()).query('index.take(10)')
Parameters: df ( DataFrame
) – Pandas DataFrame.Returns: Key table constructed from the Spark SQL DataFrame. Return type: KeyTable
-
static
import_bed
(path)[source]¶ Import a UCSC .bed file as a key table.
Examples
Add the variant annotation
va.cnvRegion: Boolean
indicating inclusion in at least one interval of the three-column BED file file1.bed:>>> bed = KeyTable.import_bed('data/file1.bed') >>> vds_result = vds.annotate_variants_table(bed, root='va.cnvRegion')
Add a variant annotation va.cnvRegion (String) with value given by the fourth column of
file2.bed
:>>> bed = KeyTable.import_bed('data/file2.bed') >>> vds_result = vds.annotate_variants_table(bed, root='va.cnvID')
The file formats are
$ cat data/file1.bed track name="BedTest" 20 1 14000000 20 17000000 18000000 ... $ cat file2.bed track name="BedTest" 20 1 14000000 cnv1 20 17000000 18000000 cnv2 ...
Notes
The key table produced by this method has one of two possible structures. If the .bed file has only three fields (
chrom
,chromStart
, andchromEnd
), then the produced key table has only one column:- interval (Interval) - Genomic interval.
If the .bed file has four or more columns, then Hail will store the fourth column in the table:
- interval (Interval) - Genomic interval.
- target (String) - Fourth column of .bed file.
UCSC bed files can have up to 12 fields, but Hail will only ever look at the first four. Hail ignores header lines in BED files.
Caution
UCSC BED files are 0-indexed and end-exclusive. The line “5 100 105” will contain locus
5:105
but not5:100
. Details here.Parameters: path (str) – Path to .bed file. Return type: KeyTable
-
static
import_fam
(path, quantitative=False, delimiter='\\\\s+', missing='NA')[source]¶ Import PLINK .fam file into a key table.
Examples
Import case-control phenotype data from a tab-separated PLINK .fam file into sample annotations:
>>> fam_kt = KeyTable.import_fam('data/myStudy.fam')
In Hail, unlike PLINK, the user must explicitly distinguish between case-control and quantitative phenotypes. Importing a quantitative phenotype without
quantitative=True
will return an error (unless all values happen to be0
,1
,2
, and-9
):>>> fam_kt = KeyTable.import_fam('data/myStudy.fam', quantitative=True)
Columns
The column, types, and missing values are shown below.
- ID (String) – Sample ID (key column)
- famID (String) – Family ID (missing = “0”)
- patID (String) – Paternal ID (missing = “0”)
- matID (String) – Maternal ID (missing = “0”)
- isFemale (Boolean) – Sex (missing = “NA”, “-9”, “0”)
One of:
- isCase (Boolean) – Case-control phenotype (missing = “0”, “-9”, non-numeric or the
missing
argument, if given. - qPheno (Double) – Quantitative phenotype (missing = “NA” or the
missing
argument, if given.
Parameters: - path (str) – Path to .fam file.
- quantitative (bool) – If True, .fam phenotype is interpreted as quantitative.
- delimiter (str) – .fam file field delimiter regex.
- missing (str) – The string used to denote missing values. For case-control, 0, -9, and non-numeric are also treated as missing.
Returns: Key table with information from .fam file.
Return type:
-
static
import_interval_list
(path)[source]¶ Import an interval list file in the GATK standard format.
>>> intervals = KeyTable.import_interval_list('data/capture_intervals.txt')
The File Format
Hail expects an interval file to contain either three or five fields per line in the following formats:
contig:start-end
contig start end
(tab-separated)contig start end direction target
(tab-separated)
A file in either of the first two formats produces a key table with one column:
- interval (Interval), key column
A file in the third format (with a “target” column) produces a key with two columns:
- interval (Interval), key column
- target (String)
Note
start
andend
match positions inclusively, e.g.start <= position <= end
.parse()
is exclusive of the end position.Note
Hail uses the following ordering for contigs: 1-22 sorted numerically, then X, Y, MT, then alphabetically for any contig not matching the standard human chromosomes.
Caution
The interval parser for these files does not support the full range of formats supported by the python parser
parse()
. ‘k’, ‘m’, ‘start’, and ‘end’ are all invalid motifs in thecontig:start-end
format here.Parameters: filename (str) – Path to file. Returns: Interval-keyed table. Return type: KeyTable
-
indexed
(name='index')[source]¶ Add the numerical index of each row as a new column.
Examples
>>> ind_kt = kt1.indexed()
Notes
This method returns a table with a new column whose name is given by the
name
parameter, with typeLong
. The value of this column is the numerical index of each row, starting from 0. Methods that respect ordering (likeKeyTable.take()
orKeyTable.export()
will return rows in order.This method is helpful for creating a unique integer index for rows of a table, so that more complex types can be encoded as a simple number.
Parameters: name (str) – Name of index column. Returns: Table with a new index column. Return type: KeyTable
-
join
(right, how='inner')[source]¶ Join two key tables together.
Examples
Join
kt1
tokt2
to producekt_joined
:>>> kt_result = kt1.key_by('ID').join(kt2.key_by('ID'))
Notes:
Hail supports four types of joins specified by
how
:- inner – Key must be present in both
kt1
andkt2
. - outer – Key present in
kt1
orkt2
. For keys only inkt1
, the value of non-key columns fromkt2
is set to missing. Likewise, for keys only inkt2
, the value of non-key columns fromkt1
is set to missing. - left – Key present in
kt1
. For keys only inkt1
, the value of non-key columns fromkt2
is set to missing. - right – Key present in
kt2
. For keys only inkt2
, the value of non-key columns fromkt1
is set to missing.
The non-key fields in
kt2
must have non-overlapping column names withkt1
.Both key tables must have the same number of keys and the corresponding types of each key must be the same (order matters), but the key names can be different. For example, if
kt1
has the key schemaStruct{("a", Int), ("b", String)}
,kt1
can be merged with a key table that has a key schema equal toStruct{("b", Int), ("c", String)}
but cannot be merged to a key table with key schemaStruct{("b", "String"), ("a", Int)}
.kt_joined
will have the same key names and schema askt1
.Parameters: - right (
KeyTable
) – Key table to join - how (str) – Method for joining two tables together. One of “inner”, “outer”, “left”, “right”.
Returns: Key table that results from joining this key table with another.
Return type: - inner – Key must be present in both
-
key
¶ List of key columns.
>>> kt1.key [u'ID']
Return type: list of str
-
key_by
(key)[source]¶ Change which columns are keys.
Examples
Assume
kt
is aKeyTable
with three columns: c1, c2 and c3 and key c1.Change key columns:
>>> kt_result = kt1.key_by(['C2', 'C3'])
>>> kt_result = kt1.key_by('C2')
Set to no keys:
>>> kt_result = kt1.key_by([])
Parameters: key (str or list of str) – List of columns to be used as keys. Returns: Key table whose key columns are given by key
.Return type: KeyTable
-
maximal_independent_set
(i, j, tie_breaker=None)[source]¶ Compute a maximal independent set of vertices in an undirected graph whose edges are given by this key table.
Examples
Prune individuals from a dataset until no close relationships remain with respect to a PC-Relate measure of kinship.
>>> related_pairs = vds.pc_relate(2, 0.001).filter("kin > 0.125") >>> related_samples = related_pairs.query('i.flatMap(i => [i,j]).collectAsSet()') >>> related_samples_to_keep = related_pairs.maximal_independent_set("i", "j") >>> related_samples_to_remove = related_samples - set(related_samples_to_keep) >>> vds.filter_samples_list(list(related_samples_to_remove))
Prune individuals from a dataset, prefering to keep cases over controls.
>>> related_pairs = vds.pc_relate(2, 0.001).filter("kin > 0.125") >>> related_samples = related_pairs.query('i.flatMap(i => [i,j]).collectAsSet()') >>> related_nodes_to_keep = (related_pairs ... .key_by("i").join(vds.samples_table()).annotate('iAndCase = { id: i, isCase: sa.isCase }') ... .select(['j', 'iAndCase']) ... .key_by("j").join(vds.samples_table()).annotate('jAndCase = { id: j, isCase: sa.isCase }') ... .select(['iAndCase', 'jAndCase']) ... .maximal_independent_set("iAndCase", "jAndCase", ... 'if (l.isCase && !r.isCase) -1 else if (!l.isCase && r.isCase) 1 else 0')) >>> related_samples_to_remove = related_samples - {x.id for x in related_nodes_to_keep} >>> vds.filter_samples_list(list(related_samples_to_remove))
Notes
The vertex set of the graph is implicitly all the values realized by
i
andj
on the rows of this key table. Each row of the key table corresponds to an undirected edge between the vertices given by evaluatingi
andj
on that row. An undirected edge may appear multiple times in the key table and will not affect the output. Vertices with self-edges are removed as they are not independent of themselves.The expressions for
i
andj
must have the same type.This method implements a greedy algorithm which iteratively removes a vertex of highest degree until the graph contains no edges.
tie_breaker
is a Hail expression that defines an ordering on nodes. It has two values in scope,l
andr
, that refer the two nodes being compared. A pair of nodes can be ordered in one of three ways, andtie_breaker
must encode the relationship as follows:- if
l < r
thentie_breaker
evaluates to some negative integer - if
l == r
thentie_breaker
evaluates to 0 - if
l > r
thentie_breaker
evaluates to some positive integer
For example, the usual ordering on the integers is defined by:
l - r
.When multiple nodes have the same degree, this algorithm will order the nodes according to
tie_breaker
and remove the largest node.Parameters: - i (str) – expression to compute one endpoint.
- j (str) – expression to compute another endpoint.
- tie_breaker – Expression used to order nodes with equal degree.
Returns: a list of vertices in a maximal independent set.
Return type: list of elements with the same type as
i
andj
- if
-
num_columns
¶ Number of columns.
>>> kt1.num_columns 8
Return type: int
-
order_by
(*cols)[source]¶ Sort by the specified columns. Missing values are sorted after non-missing values. Sort by the first column, then the second, etc.
Parameters: cols – Columns to sort by. Type: str or asc(str) or desc(str) Returns: Key table sorted by cols
.Return type: KeyTable
-
persist
(storage_level='MEMORY_AND_DISK')[source]¶ Persist this key table to memory and/or disk.
Examples
Persist the key table to both memory and disk:
>>> kt = kt.persist()
Notes
The
persist()
andcache()
methods allow you to store the current table on disk or in memory to avoid redundant computation and improve the performance of Hail pipelines.cache()
is an alias forpersist("MEMORY_ONLY")
. Most users will want “MEMORY_AND_DISK”. See the Spark documentation for a more in-depth discussion of persisting data.Parameters: storage_level – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP Return type: KeyTable
-
query
(exprs)[source]¶ Performs aggregation queries over columns of the table, and returns Python object(s).
Examples
>>> mean_value = kt1.query('C1.stats().mean')
>>> [hist, counter] = kt1.query(['HT.hist(50, 80, 10)', 'SEX.counter()'])
Notes
This method evaluates Hail expressions over the rows of the key table. The
exprs
argument requires either a single string or a list of strings. If a single string was passed, then a single result is returned. If a list is passed, a list is returned.The namespace of the expressions includes one aggregable for each column of the key table. We use the example
kt1
here, which contains columnsID
,HT
,Sex
,X
,Z
,C1
,C2
, andC3
. Queries in this key table will contain the following namespace:ID
: (Aggregable[Int])HT
: (Aggregable[Int])SEX
: (Aggregable[String])X
: (Aggregable[Int])Z
: (Aggregable[Int])C1
: (Aggregable[Int])C2
: (Aggregable[Int])C3
: (Aggregable[Int])
Map and filter expressions on these aggregables have the same additional scope, which is all the columns in the key table. In our example, this includes:
ID
: (Int)HT
: (Int)SEX
: (String)X
: (Int)Z
: (Int)C1
: (Int)C2
: (Int)C3
: (Int)
This scope means that operations like the below are permitted:
>>> fraction_tall_male = kt1.query('HT.filter(x => SEX == "M").fraction(x => x > 70)')
>>> ids = kt1.query('ID.filter(x => C2 < C3).collect()')
Parameters: exprs (str or list of str) – Return type: annotation or list of annotation
-
query_typed
(exprs)[source]¶ Performs aggregation queries over columns of the table, and returns Python object(s) and types.
Examples
>>> mean_value, t = kt1.query_typed('C1.stats().mean')
>>> [hist, counter], [t1, t2] = kt1.query_typed(['HT.hist(50, 80, 10)', 'SEX.counter()'])
See
Keytable.query()
for more information.Parameters: exprs (str or list of str) – Return type: (annotation or list of annotation, Type
or list ofType
)
-
static
range
(n, num_partitions=None)[source]¶ Construct a table with rows from 0 until
n
.Examples
Construct a table with 100 rows:
>>> range_kt = KeyTable.range(100)
Construct a table with one million rows and twenty partitions:
>>> range_kt = KeyTable.range(1000000, num_partitions=20)
Notes
The resulting table has one column:
- index (Int) – Unique row index from 0 until
n
Parameters: - n (int) – Number of rows.
- num_partitions (int or None) – Number of partitions.
Return type: - index (Int) – Unique row index from 0 until
-
rename
(column_names)[source]¶ Rename columns of key table.
column_names
can be either a list of new names or a dict mapping old names to new names. Ifcolumn_names
is a list, its length must be the number of columns in thisKeyTable
.Examples
Rename using a list:
>>> kt2.rename(['newColumn1', 'newColumn2', 'newColumn3'])
Rename using a dict:
>>> kt2.rename({'A' : 'C1'})
Parameters: column_names – list of new column names or a dict mapping old names to new names. Returns: Key table with renamed columns. Return type: KeyTable
-
repartition
(n, shuffle=True)[source]¶ Change the number of distributed partitions.
Warning
When shuffle is False, repartition can only decrease the number of partitions and simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance and so can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.
Parameters: - n (int) – Desired number of partitions.
- shuffle (bool) – Whether to shuffle or naively coalesce.
Return type:
-
same
(other)[source]¶ Test whether two key tables are identical.
Examples
>>> if kt1.same(kt2): ... print("KeyTables are the same!")
Parameters: other ( KeyTable
) – key table to compare againstReturn type: bool
-
schema
¶ Table schema.
Examples
>>> print(kt1.schema)
The
pprint
module can be used to print the schema in a more human-readable format:>>> from pprint import pprint >>> pprint(kt1.schema)
Return type: TStruct
-
select
(column_names)[source]¶ Select a subset of columns.
Examples
Assume
kt1
is aKeyTable
with three columns: C1, C2 and C3.Select/drop columns:
>>> kt_result = kt1.select('C1')
Reorder the columns:
>>> kt_result = kt1.select(['C3', 'C1', 'C2'])
Drop all columns:
>>> kt_result = kt1.select([])
Parameters: column_names – List of columns to be selected. Type: str or list of str Returns: Key table with selected columns. Return type: KeyTable
-
show
(n=10, truncate_to=None, print_types=True)[source]¶ Show the first few rows of the table in human-readable format.
Examples
Show, with default parameters (10 rows, no truncation, and column types):
>>> kt1.show()
Truncate long columns to 25 characters and only write 5 rows:
>>> kt1.show(5, truncate_to=25)
Notes
If the
truncate_to
argument isNone
, then no truncation will occur. This is the default behavior. An integer argument will truncate each cell to the specified number of characters.Parameters: - n (int) – Number of rows to show.
- truncate_to (int or None) – Truncate columns to the desired number of characters.
- print_types (bool) – Print a line with column types.
-
take
(n)[source]¶ Take a given number of rows from the head of the table.
Examples
Take the first ten rows:
>>> first10 = kt1.take(10)
Notes
This method does not need to look at all the data, and allows for fast queries of the start of the table.
Parameters: n (int) – Number of rows to take. Returns: Rows from the start of the table. Return type: list of Struct
-
to_dataframe
(expand=True, flatten=True)[source]¶ Converts this key table to a Spark DataFrame.
Parameters: - expand (bool) – If true, expand_types before converting to DataFrame.
- flatten (bool) – If true, flatten before converting to DataFrame. If both are true, flatten is run after expand so that expanded types are flattened.
Return type: pyspark.sql.DataFrame
-
to_pandas
(expand=True, flatten=True)[source]¶ Converts this key table into a Pandas DataFrame.
Parameters: - expand (bool) – If true, expand_types before converting to Pandas DataFrame.
- flatten (bool) – If true, flatten before converting to Pandas DataFrame. If both are true, flatten is run after expand so that expanded types are flattened.
Returns: Pandas DataFrame constructed from the key table.
Return type: pandas.DataFrame
-
union
(*kts)[source]¶ Union the rows of multiple tables.
Examples
Take the union of rows from two tables:
>>> other = hc.import_table('data/kt_example1.tsv', impute=True) >>> union_kt = kt1.union(other)
Notes
If a row appears in both tables identically, it is duplicated in the result. The left and right tables must have the same schema and key.
Parameters: kts (args of type KeyTable
) – Tables to merge.Returns: A table with all rows from the left and right tables. Return type: KeyTable
-
unpersist
()[source]¶ Unpersists this table from memory/disk.
Notes This function will have no effect on a table that was not previously persisted.
There’s nothing stopping you from continuing to use a table that has been unpersisted, but doing so will result in all previous steps taken to compute the table being performed again since the table must be recomputed. Only unpersist a table when you are done with it.
-
write
(output, overwrite=False)[source]¶ Write as KT file.
*Examples*
>>> kt1.write('output/kt1.kt')
Note
The write path must end in “.kt”.
Parameters: - output (str) – Path of KT file to write.
- overwrite (bool) – If True, overwrite any existing KT file. Cannot be used to read from and write to the same path.
-