utils

Interval(start, end[, includes_start, …]) An object representing a range of values between start and end.
Struct(**kwargs) Nested annotation structure.
hadoop_open(path, mode, buffer_size) Open a file through the Hadoop filesystem API.
hadoop_copy(src, dest) Copy a file through the Hadoop filesystem API.
hadoop_exists(path) Returns True if path exists.
hadoop_is_file(path) Returns True if path both exists and is a file.
hadoop_is_dir(path) Returns True if path both exists and is a directory.
hadoop_stat(path) Returns information about the file or directory at a given path.
hadoop_ls(path) Returns information about files at path.
copy_log(path) Attempt to copy the session log to a hadoop-API-compatible location.
range_table(n[, n_partitions]) Construct a table with the row index and no other fields.
range_matrix_table(n_rows, n_cols[, …]) Construct a matrix table with row and column indices and no entry fields.
get_1kg(output_dir, overwrite) Download subset of the 1000 Genomes dataset and sample annotations.
get_movie_lens(output_dir, overwrite) Download public Movie Lens dataset.
class hail.utils.Interval(start, end, includes_start=True, includes_end=False)[source]

An object representing a range of values between start and end.

>>> interval2 = hl.Interval(3, 6)
Parameters:
  • start (any type) – Object with type point_type.
  • end (any type) – Object with type point_type.
  • includes_start (bool) – Interval includes start.
  • includes_end (bool) – Interval includes end.
contains(value)[source]

True if value is contained within the interval.

Examples

>>> interval2.contains(5)
True
>>> interval2.contains(6)
False
Parameters:value – Object with type point_type().
Returns:bool
end

End point of the interval.

Examples

>>> interval2.end
6
Returns:Object with type point_type()
includes_end

True if interval is inclusive of end.

Examples

>>> interval2.includes_end
False
Returns:bool
includes_start

True if interval is inclusive of start.

Examples

>>> interval2.includes_start
True
Returns:bool
overlaps(interval)[source]

True if the the supplied interval contains any value in common with this one.

Parameters:interval (Interval) – Interval object with the same point type.
Returns:bool
point_type

Type of each element in the interval.

Examples

>>> interval2.point_type
dtype('int32')
Returns:Type
start

Start point of the interval.

Examples

>>> interval2.start
3
Returns:Object with type point_type()
class hail.utils.Struct(**kwargs)[source]

Nested annotation structure.

>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})

Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:

>>> bar.foo
>>> bar['foo']

Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:

>>> bar['1kg']

The pprint module can be used to print nested Structs in a more human-readable fashion:

>>> from pprint import pprint
>>> pprint(bar)
Parameters:attributes – Field names and values.
annotate(**kwargs)[source]

Add new fields or recompute existing fields.

Notes

If an expression in kwargs shares a name with a field of the struct, then that field will be replaced but keep its position in the struct. New fields will be appended to the end of the struct.

Parameters:kwargs (keyword args) – Fields to add.
Returns:Struct – Struct with new or updated fields.
drop(*args)[source]

Drop fields from the struct.

Parameters:fields (varargs of str) – Fields to drop.
Returns:Struct – Struct without certain fields.
select(*fields, **kwargs)[source]

Select existing fields and compute new ones.

Notes

The fields argument is a list of field names to keep. These fields will appear in the resulting struct in the order they appear in fields.

The kwargs arguments are new fields to add.

Parameters:
  • fields (varargs of str) – Field names to keep.
  • named_exprs (keyword args) – New field.
Returns:

Struct – Struct containing specified existing fields and computed fields.

hail.utils.hadoop_open(path: str, mode: str = 'r', buffer_size: int = 8192)[source]

Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Warning

Due to an implementation limitation, hadoop_open() may be quite slow for large data sets (anything larger than 50 MB).

Examples

Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: 
...     pandas_df.to_csv(f)

Read and print the lines of a text file stored in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt') as f: 
...     for line in f:
...         print(line.strip())

Write two lines directly to a file in Google Cloud Storage:

>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: 
...     f.write('result1: %s\n' % result1)
...     f.write('result2: %s\n' % result2)

Unpack a packed Python struct directly from a file in Google Cloud Storage:

>>> from struct import unpack
>>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: 
...     print(unpack('<f', bytearray(f.read())))

Notes

The supported modes are:

  • 'r' – Readable text file (io.TextIOWrapper). Default behavior.
  • 'w' – Writable text file (io.TextIOWrapper).
  • 'x' – Exclusive writable text file (io.TextIOWrapper). Throws an error if a file already exists at the path.
  • 'rb' – Readable binary file (io.BufferedReader).
  • 'wb' – Writable binary file (io.BufferedWriter).
  • 'xb' – Exclusive writable binary file (io.BufferedWriter). Throws an error if a file already exists at the path.

The provided destination file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use hadoop_copy() to move your file to a distributed file system.

Parameters:
  • path (str) – Path to file.
  • mode (str) – File access mode.
  • buffer_size (int) – Buffer size, in bytes.
Returns:

Readable or writable file handle.

hail.utils.hadoop_copy(src, dest)[source]

Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

Copy a file from Google Cloud Storage to a local file:

>>> hadoop_copy('gs://hail-common/LCR.interval_list',
...             'file:///mnt/data/LCR.interval_list') 

Notes

Try using hadoop_open() first, it’s simpler, but not great for large data! For example:

>>> with hadoop_open('gs://my_bucket/results.csv', 'w') as f: 
...     pandas_df.to_csv(f)

The provided source and destination file paths must be URIs (uniform resource identifiers).

Parameters:
  • src (str) – Source file URI.
  • dest (str) – Destination file URI.
hail.utils.hadoop_exists(path: str) → bool[source]

Returns True if path exists.

Parameters:path (str)
Returns:bool
hail.utils.hadoop_is_file(path: str) → bool[source]

Returns True if path both exists and is a file.

Parameters:path (str)
Returns:bool
hail.utils.hadoop_is_dir(path) → bool[source]

Returns True if path both exists and is a directory.

Parameters:path (str)
Returns:bool
hail.utils.hadoop_stat(path: str) → Dict[source]

Returns information about the file or directory at a given path.

Notes

Raises an error if path does not exist.

The resulting dictionary contains the following data:

  • is_dir (bool) – Path is a directory.
  • size_bytes (int) – Size in bytes.
  • size (str) – Size as a readable string.
  • modification_time (str) – Time of last file modification.
  • owner (str) – Owner.
  • path (str) – Path.
Parameters:path (str)
Returns:Dict
hail.utils.hadoop_ls(path: str) → List[Dict][source]

Returns information about files at path.

Notes

Raises an error if path does not exist.

If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).

Each dict element of the result list contains the following data:

  • is_dir (bool) – Path is a directory.
  • size_bytes (int) – Size in bytes.
  • size (str) – Size as a readable string.
  • modification_time (str) – Time of last file modification.
  • owner (str) – Owner.
  • path (str) – Path.
Parameters:path (str)
Returns:List[Dict]
hail.utils.copy_log(path: str) → None[source]

Attempt to copy the session log to a hadoop-API-compatible location.

Examples

Specify a manual path:

>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log')  
INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...

Copy to a directory:

>>> hl.copy_log('gs://my-bucket/')  
INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...

Notes

Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.

If path is a directory, then the log file will be copied using its base name to the directory (e.g. /home/hail.log would be copied as gs://my-bucket/hail.log if path is gs://my-bucket.

Parameters:path (str)
hail.utils.range_table(n, n_partitions=None) → hail.Table[source]

Construct a table with the row index and no other fields.

Examples

>>> df = hl.utils.range_table(100)
>>> df.count()
100

Notes

The resulting table contains one field:

  • idx (tint32) - Row index (key).

This method is meant for testing and learning, and is not optimized for production performance.

Parameters:
  • n (int) – Number of rows.
  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
Returns:

Table

hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None) → hail.MatrixTable[source]

Construct a matrix table with row and column indices and no entry fields.

Examples

>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows()
100
>>> range_ds.count_cols()
10

Notes

The resulting matrix table contains the following fields:

  • row_idx (tint32) - Row index (row key).
  • col_idx (tint32) - Column index (column key).

It contains no entry fields.

This method is meant for testing and learning, and is not optimized for production performance.

Parameters:
  • n_rows (int) – Number of rows.
  • n_cols (int) – Number of columns.
  • n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
Returns:

MatrixTable

hail.utils.get_1kg(output_dir, overwrite: bool = False)[source]

Download subset of the 1000 Genomes dataset and sample annotations.

Notes

The download is about 15M.

Parameters:
  • output_dir – Directory in which to write data.
  • overwrite – If True, overwrite any existing files/directories at output_dir.
hail.utils.get_movie_lens(output_dir, overwrite: bool = False)[source]

Download public Movie Lens dataset.

Notes

The download is about 6M.

See the MovieLens website for more information about this dataset.

Parameters:
  • output_dir – Directory in which to write data.
  • overwrite – If True, overwrite existing files/directories at those locations.