utils

Interval(start, end[, includes_start, …])

An object representing a range of values between start and end.

Struct(**kwargs)

Nested annotation structure.

hadoop_open(path, mode, buffer_size)

Open a file through the Hadoop filesystem API.

hadoop_copy(src, dest)

Copy a file through the Hadoop filesystem API.

hadoop_exists(path)

Returns True if path exists.

hadoop_is_file(path)

Returns True if path both exists and is a file.

hadoop_is_dir(path)

Returns True if path both exists and is a directory.

hadoop_stat(path)

Returns information about the file or directory at a given path.

hadoop_ls(path)

Returns information about files at path.

copy_log(path)

Attempt to copy the session log to a hadoop-API-compatible location.

range_table(n[, n_partitions])

Construct a table with the row index and no other fields.

range_matrix_table(n_rows, n_cols[, …])

Construct a matrix table with row and column indices and no entry fields.

get_1kg(output_dir, overwrite)

Download subset of the 1000 Genomes dataset and sample annotations.

get_movie_lens(output_dir, overwrite)

Download public Movie Lens dataset.

class hail.utils.Interval(start, end, includes_start=True, includes_end=False, point_type=None)[source]

An object representing a range of values between start and end.

>>> interval2 = hl.Interval(3, 6)
Parameters
  • start (any type) – Object with type point_type.

  • end (any type) – Object with type point_type.

  • includes_start (bool) – Interval includes start.

  • includes_end (bool) – Interval includes end.

contains(value)[source]

True if value is contained within the interval.

Examples

>>> interval2.contains(5)
True
>>> interval2.contains(6)
False
Parameters

value – Object with type point_type().

Returns

bool

end

End point of the interval.

Examples

>>> interval2.end
6
Returns

Object with type point_type()

includes_end

True if interval is inclusive of end.

Examples

>>> interval2.includes_end
False
Returns

bool

includes_start

True if interval is inclusive of start.

Examples

>>> interval2.includes_start
True
Returns

bool

overlaps(interval)[source]

True if the the supplied interval contains any value in common with this one.

Parameters

interval (Interval) – Interval object with the same point type.

Returns

bool

point_type

Type of each element in the interval.

Examples

>>> interval2.point_type
dtype('int32')
Returns

Type

start

Start point of the interval.

Examples

>>> interval2.start
3
Returns

Object with type point_type()

class hail.utils.Struct(**kwargs)[source]

Nested annotation structure.

>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})

Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:

>>> bar.foo
>>> bar['foo']

Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:

>>> bar['1kg']

The pprint module can be used to print nested Structs in a more human-readable fashion:

>>> from pprint import pprint
>>> pprint(bar)
Parameters

attributes – Field names and values.

annotate(**kwargs)[source]

Add new fields or recompute existing fields.

Notes

If an expression in kwargs shares a name with a field of the struct, then that field will be replaced but keep its position in the struct. New fields will be appended to the end of the struct.

Parameters

kwargs (keyword args) – Fields to add.

Returns

Struct – Struct with new or updated fields.

drop(*args)[source]

Drop fields from the struct.

Parameters

fields (varargs of str) – Fields to drop.

Returns

Struct – Struct without certain fields.

select(*fields, **kwargs)[source]

Select existing fields and compute new ones.

Notes

The fields argument is a list of field names to keep. These fields will appear in the resulting struct in the order they appear in fields.

The kwargs arguments are new fields to add.

Parameters
  • fields (varargs of str) – Field names to keep.

  • named_exprs (keyword args) – New field.

Returns

Struct – Struct containing specified existing fields and computed fields.

hail.utils.hadoop_open(path: str, mode: str = 'r', buffer_size: int = 8192)[source]

Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Warning

Due to an implementation limitation, hadoop_open() may be quite slow for large data sets (anything larger than 50 MB).

Examples

Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: # doctest: +SKIP
...     pandas_df.to_csv(f)

Read and print the lines of a text file stored in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt') as f: # doctest: +SKIP
...     for line in f:
...         print(line.strip())

Write two lines directly to a file in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: # doctest: +SKIP
...     f.write('result1: %s\n' % result1)
...     f.write('result2: %s\n' % result2)

Unpack a packed Python struct directly from a file in Google Cloud Storage:

>>> from struct import unpack
>>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: # doctest: +SKIP
...     print(unpack('<f', bytearray(f.read())))

Notes

The supported modes are:

  • 'r' – Readable text file (io.TextIOWrapper). Default behavior.

  • 'w' – Writable text file (io.TextIOWrapper).

  • 'x' – Exclusive writable text file (io.TextIOWrapper). Throws an error if a file already exists at the path.

  • 'rb' – Readable binary file (io.BufferedReader).

  • 'wb' – Writable binary file (io.BufferedWriter).

  • 'xb' – Exclusive writable binary file (io.BufferedWriter). Throws an error if a file already exists at the path.

The provided destination file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use hadoop_copy() to move your file to a distributed file system.

Parameters
  • path (str) – Path to file.

  • mode (str) – File access mode.

  • buffer_size (int) – Buffer size, in bytes.

Returns

Readable or writable file handle.

hail.utils.hadoop_copy(src, dest)[source]

Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

Copy a file from Google Cloud Storage to a local file:

>>> hadoop_copy('gs://hail-common/LCR.interval_list',
...             'file:///mnt/data/LCR.interval_list') # doctest: +SKIP

Notes

Try using hadoop_open() first, it’s simpler, but not great for large data! For example:

>>> with hadoop_open('gs://my_bucket/results.csv', 'w') as f: #doctest: +SKIP
...     pandas_df.to_csv(f)

The provided source and destination file paths must be URIs (uniform resource identifiers).

Parameters
  • src (str) – Source file URI.

  • dest (str) – Destination file URI.

hail.utils.hadoop_exists(path: str) → bool[source]

Returns True if path exists.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_file(path: str) → bool[source]

Returns True if path both exists and is a file.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_is_dir(path: str) → bool[source]

Returns True if path both exists and is a directory.

Parameters

path (str)

Returns

bool

hail.utils.hadoop_stat(path: str) → Dict[source]

Returns information about the file or directory at a given path.

Notes

Raises an error if path does not exist.

The resulting dictionary contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

Dict

hail.utils.hadoop_ls(path: str) → List[Dict][source]

Returns information about files at path.

Notes

Raises an error if path does not exist.

If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).

Each dict element of the result list contains the following data:

  • is_dir (bool) – Path is a directory.

  • size_bytes (int) – Size in bytes.

  • size (str) – Size as a readable string.

  • modification_time (str) – Time of last file modification.

  • owner (str) – Owner.

  • path (str) – Path.

Parameters

path (str)

Returns

List[Dict]

hail.utils.copy_log(path: str) → None[source]

Attempt to copy the session log to a hadoop-API-compatible location.

Examples

Specify a manual path:

>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log')  # doctest: +SKIP
INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...

Copy to a directory:

>>> hl.copy_log('gs://my-bucket/')  # doctest: +SKIP
INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...

Notes

Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.

If path is a directory, then the log file will be copied using its base name to the directory (e.g. /home/hail.log would be copied as gs://my-bucket/hail.log if path is gs://my-bucket.

Parameters

path (str)

hail.utils.range_table(n, n_partitions=None) → hail.Table[source]

Construct a table with the row index and no other fields.

Examples

>>> df = hl.utils.range_table(100)
>>> df.count()
100

Notes

The resulting table contains one field:

  • idx (tint32) - Row index (key).

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n (int) – Number of rows.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

Table

hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None) → hail.MatrixTable[source]

Construct a matrix table with row and column indices and no entry fields.

Examples

>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows()
100
>>> range_ds.count_cols()
10

Notes

The resulting matrix table contains the following fields:

  • row_idx (tint32) - Row index (row key).

  • col_idx (tint32) - Column index (column key).

It contains no entry fields.

This method is meant for testing and learning, and is not optimized for production performance.

Parameters
  • n_rows (int) – Number of rows.

  • n_cols (int) – Number of columns.

  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).

Returns

MatrixTable

hail.utils.get_1kg(output_dir, overwrite: bool = False)[source]

Download subset of the 1000 Genomes dataset and sample annotations.

Notes

The download is about 15M.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite any existing files/directories at output_dir.

hail.utils.get_movie_lens(output_dir, overwrite: bool = False)[source]

Download public Movie Lens dataset.

Notes

The download is about 6M.

See the MovieLens website for more information about this dataset.

Parameters
  • output_dir – Directory in which to write data.

  • overwrite – If True, overwrite existing files/directories at those locations.