utils

Interval(start, end[, includes_start, …])

An object representing a range of values between start and end.

Struct(**kwargs)

Nested annotation structure.

hadoop_open(path[, mode, buffer_size])

Open a file through the Hadoop filesystem API.

hadoop_copy(src, dest)

Copy a file through the Hadoop filesystem API.

hadoop_exists(path)

Returns True if path exists.

hadoop_is_file(path)

Returns True if path both exists and is a file.

hadoop_is_dir(path)

Returns True if path both exists and is a directory.

hadoop_stat(path)

Returns information about the file or directory at a given path.

hadoop_ls(path)

Returns information about files at path.

copy_log(path)

Attempt to copy the session log to a hadoop-API-compatible location.

range_table(n[, n_partitions])

Construct a table with the row index and no other fields.

range_matrix_table(n_rows, n_cols[, …])

Construct a matrix table with row and column indices and no entry fields.

get_1kg(output_dir[, overwrite])

Download subset of the 1000 Genomes dataset and sample annotations.

get_movie_lens(output_dir[, overwrite])

Download public Movie Lens dataset.

class hail.utils.Interval(start, end, includes_start=True, includes_end=False, point_type=None)[source]

An object representing a range of values between start and end.

>>> interval2 = hl.Interval(3, 6)
Parameters
  • start (any type) – Object with type point_type.

  • end (any type) – Object with type point_type.

  • includes_start (bool) – Interval includes start.

  • includes_end (bool) – Interval includes end.

Note

This object refers to the Python value returned by taking or collecting Hail expressions, e.g. mt.interval.take(5). This is rare; it is much more common to manipulate the IntervalExpression object, which is constructed using the following functions:

class hail.utils.Struct(**kwargs)[source]

Nested annotation structure.

>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})

Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:

>>> bar.foo
>>> bar['foo']

Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:

>>> bar['1kg']

The pprint module can be used to print nested Structs in a more human-readable fashion:

>>> from pprint import pprint
>>> pprint(bar)
Parameters

attributes – Field names and values.

Note

This object refers to the Python value returned by taking or collecting Hail expressions, e.g. mt.info.take(5). This is rare; it is much more common to manipulate the StructExpression object, which is constructed using the struct() function.

hail.utils.hadoop_open(path, mode='r', buffer_size=8192)[source]

Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Warning

Due to an implementation limitation, hadoop_open() may be quite slow for large data sets (anything larger than 50 MB).

Examples

Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: 
...     pandas_df.to_csv(f)

Read and print the lines of a text file stored in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt') as f: 
...     for line in f:
...         print(line.strip())

Write two lines directly to a file in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: 
...     f.write('result1: %s\n' % result1)
...     f.write('result2: %s\n' % result2)

Unpack a packed Python struct directly from a file in Google Cloud Storage:

>>> from struct import unpack
>>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: 
...     print(unpack('<f', bytearray(f.read())))

Notes

The supported modes are:

The provided destination file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use hadoop_copy() to move your file to a distributed file system.

Parameters
  • path (str) – Path to file.

  • mode (str) – File access mode.

  • buffer_size (int) – Buffer size, in bytes.

Returns

Readable or writable file handle.

hail.utils.hadoop_copy(src, dest)[source]

Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

Copy a file from Google Cloud Storage to a local file:

>>> hadoop_copy('gs://hail-common/LCR.interval_list',
...             'file:///mnt/data/LCR.interval_list') 

Notes

Try using hadoop_open() first, it’s simpler, but not great for large data! For example:

>>> with hadoop_open('gs://my_bucket/results.csv', 'w') as f: 
...     pandas_df.to_csv(f)

The provided source and destination file paths must be URIs (uniform resource identifiers).

Parameters
  • src (str) – Source file URI.

  • dest (str) – Destination file URI.

hail.utils.hadoop_exists(path)[source]

Returns True if path exists.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_file(path)[source]

Returns True if path both exists and is a file.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_dir(path)[source]

Returns True if path both exists and is a directory.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_stat(path)[source]

Returns information about the file or directory at a given path.

Notes

Raises an error if path does not exist.

The resulting dictionary contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

dict

hail.utils.hadoop_ls(path)[source]

Returns information about files at path.

Notes

Raises an error if path does not exist.

If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).

Each dict element of the result list contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

list [dict]

hail.utils.copy_log(path)[source]

Attempt to copy the session log to a hadoop-API-compatible location.

Examples

Specify a manual path:

>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log')  
INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...

Copy to a directory:

>>> hl.copy_log('gs://my-bucket/')  
INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...

Notes

Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.

If path is a directory, then the log file will be copied using its base name to the directory (e.g. /home/hail.log would be copied as gs://my-bucket/hail.log if path is gs://my-bucket.

Parameters

path (str)

hail.utils.range_table(n, n_partitions=None)[source]

Construct a table with the row index and no other fields.

Examples

>>> df = hl.utils.range_table(100)
>>> df.count()
100

Notes

The resulting table contains one field:

  • idx (tint32) - Row index (key).

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n (int) – Number of rows.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

Table

hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None)[source]

Construct a matrix table with row and column indices and no entry fields.

Examples

>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows()
100
>>> range_ds.count_cols()
10

Notes

The resulting matrix table contains the following fields:

  • row_idx (tint32) - Row index (row key).

  • col_idx (tint32) - Column index (column key).

It contains no entry fields.

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n_rows (int) – Number of rows.

  • n_cols (int) – Number of columns.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

MatrixTable

hail.utils.get_1kg(output_dir, overwrite=False)[source]

Download subset of the 1000 Genomes dataset and sample annotations.

Notes

The download is about 15M.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite any existing files/directories at output_dir.

hail.utils.get_movie_lens(output_dir, overwrite=False)[source]

Download public Movie Lens dataset.

Notes

The download is about 6M.

See the MovieLens website for more information about this dataset.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite existing files/directories at those locations.