utils

Interval(start, end[, includes_start, …])

An object representing a range of values between start and end.

Struct(**kwargs)

Nested annotation structure.

frozendict(d)

An object representing an immutable dictionary.

hadoop_open(path[, mode, buffer_size])

Open a file through the Hadoop filesystem API.

hadoop_copy(src, dest)

Copy a file through the Hadoop filesystem API.

hadoop_exists(path)

Returns True if path exists.

hadoop_is_file(path)

Returns True if path both exists and is a file.

hadoop_is_dir(path)

Returns True if path both exists and is a directory.

hadoop_stat(path)

Returns information about the file or directory at a given path.

hadoop_ls(path)

Returns information about files at path.

copy_log(path)

Attempt to copy the session log to a hadoop-API-compatible location.

range_table(n[, n_partitions])

Construct a table with the row index and no other fields.

range_matrix_table(n_rows, n_cols[, …])

Construct a matrix table with row and column indices and no entry fields.

get_1kg(output_dir[, overwrite])

Download subset of the 1000 Genomes dataset and sample annotations.

get_movie_lens(output_dir[, overwrite])

Download public Movie Lens dataset.

class hail.utils.Interval(start, end, includes_start=True, includes_end=False, point_type=None)[source]

An object representing a range of values between start and end.

>>> interval2 = hl.Interval(3, 6)
Parameters
  • start (any type) – Object with type point_type.

  • end (any type) – Object with type point_type.

  • includes_start (bool) – Interval includes start.

  • includes_end (bool) – Interval includes end.

Note

This object refers to the Python value returned by taking or collecting Hail expressions, e.g. mt.interval.take(5). This is rare; it is much more common to manipulate the IntervalExpression object, which is constructed using the following functions:

class hail.utils.Struct(**kwargs)[source]

Nested annotation structure.

>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})

Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:

>>> bar.foo
>>> bar['foo']

Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:

>>> bar['1kg']

The pprint module can be used to print nested Structs in a more human-readable fashion:

>>> from pprint import pprint
>>> pprint(bar)
Parameters

attributes – Field names and values.

Note

This object refers to the Python value returned by taking or collecting Hail expressions, e.g. mt.info.take(5). This is rare; it is much more common to manipulate the StructExpression object, which is constructed using the struct() function.

class hail.utils.frozendict(d)[source]

An object representing an immutable dictionary.

>>> my_frozen_dict = hl.utils.frozendict({1:2, 7:5})

To get a normal python dictionary with the same elements from a frozendict:

>>> dict(frozendict({'a': 1, 'b': 2}))

Note

This object refers to the Python value returned by taking or collecting Hail expressions, e.g. mt.my_dict.take(5). This is rare; it is much more common to manipulate the DictExpression object, which is constructed using dict(). This class is necessary because hail supports using dicts as keys to other dicts or as elements in sets, while python does not.

hail.utils.hadoop_open(path, mode='r', buffer_size=8192)[source]

Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Warning

Due to an implementation limitation, hadoop_open() may be quite slow for large data sets (anything larger than 50 MB).

Examples

Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: 
...     pandas_df.to_csv(f)

Read and print the lines of a text file stored in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt') as f: 
...     for line in f:
...         print(line.strip())

Write two lines directly to a file in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: 
...     f.write('result1: %s\n' % result1)
...     f.write('result2: %s\n' % result2)

Unpack a packed Python struct directly from a file in Google Cloud Storage:

>>> from struct import unpack
>>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: 
...     print(unpack('<f', bytearray(f.read())))

Notes

The supported modes are:

The provided destination file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use hadoop_copy() to move your file to a distributed file system.

Parameters
  • path (str) – Path to file.

  • mode (str) – File access mode.

  • buffer_size (int) – Buffer size, in bytes.

Returns

Readable or writable file handle.

hail.utils.hadoop_copy(src, dest)[source]

Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

Copy a file from Google Cloud Storage to a local file:

>>> hadoop_copy('gs://hail-common/LCR.interval_list',
...             'file:///mnt/data/LCR.interval_list') 

Notes

Try using hadoop_open() first, it’s simpler, but not great for large data! For example:

>>> with hadoop_open('gs://my_bucket/results.csv', 'w') as f: 
...     pandas_df.to_csv(f)

The provided source and destination file paths must be URIs (uniform resource identifiers).

Parameters
  • src (str) – Source file URI.

  • dest (str) – Destination file URI.

hail.utils.hadoop_exists(path)[source]

Returns True if path exists.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_file(path)[source]

Returns True if path both exists and is a file.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_dir(path)[source]

Returns True if path both exists and is a directory.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_stat(path)[source]

Returns information about the file or directory at a given path.

Notes

Raises an error if path does not exist.

The resulting dictionary contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

dict

hail.utils.hadoop_ls(path)[source]

Returns information about files at path.

Notes

Raises an error if path does not exist.

If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).

Each dict element of the result list contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

list [dict]

hail.utils.copy_log(path)[source]

Attempt to copy the session log to a hadoop-API-compatible location.

Examples

Specify a manual path:

>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log')  
INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...

Copy to a directory:

>>> hl.copy_log('gs://my-bucket/')  
INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...

Notes

Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.

If path is a directory, then the log file will be copied using its base name to the directory (e.g. /home/hail.log would be copied as gs://my-bucket/hail.log if path is gs://my-bucket.

Parameters

path (str)

hail.utils.range_table(n, n_partitions=None)[source]

Construct a table with the row index and no other fields.

Examples

>>> df = hl.utils.range_table(100)
>>> df.count()
100

Notes

The resulting table contains one field:

  • idx (tint32) - Row index (key).

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n (int) – Number of rows.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

Table

hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None)[source]

Construct a matrix table with row and column indices and no entry fields.

Examples

>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows()
100
>>> range_ds.count_cols()
10

Notes

The resulting matrix table contains the following fields:

  • row_idx (tint32) - Row index (row key).

  • col_idx (tint32) - Column index (column key).

It contains no entry fields.

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n_rows (int) – Number of rows.

  • n_cols (int) – Number of columns.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

MatrixTable

hail.utils.get_1kg(output_dir, overwrite=False)[source]

Download subset of the 1000 Genomes dataset and sample annotations.

Notes

The download is about 15M.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite any existing files/directories at output_dir.

hail.utils.get_movie_lens(output_dir, overwrite=False)[source]

Download public Movie Lens dataset.

Notes

The download is about 6M.

See the MovieLens website for more information about this dataset.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite existing files/directories at those locations.