utils
Built-in mutable sequence. |
|
|
An object representing a range of values between start and end. |
|
Nested annotation structure. |
|
An object representing an immutable dictionary. |
|
Open a file through the Hadoop filesystem API. |
|
Copy a file through the Hadoop filesystem API. |
|
Returns |
|
Returns |
|
Returns |
|
Returns information about the file or directory at a given path. |
|
Returns information about files at path. |
|
Returns |
|
Attempt to copy the session log to a hadoop-API-compatible location. |
|
Construct a table with the row index and no other fields. |
|
Construct a matrix table with row and column indices and no entry fields. |
|
Download subset of the 1000 Genomes dataset and sample annotations. |
|
Download subset of the Human Genome Diversity Panel dataset and sample annotations. |
|
Download public Movie Lens dataset. |
- class hail.utils.Interval(start, end, includes_start=True, includes_end=False, point_type=None)[source]
An object representing a range of values between start and end.
>>> interval2 = hl.Interval(3, 6)
- Parameters:
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.interval.take(5). This is rare; it is much more common to manipulate theIntervalExpressionobject, which is constructed using the following functions:
- class hail.utils.Struct(**kwargs)[source]
Nested annotation structure.
>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})
Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:
>>> bar.foo >>> bar['foo']
Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:
>>> bar['1kg']
The
pprintmodule can be used to print nested Structs in a more human-readable fashion:>>> from pprint import pprint >>> pprint(bar)
- Parameters:
attributes – Field names and values.
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.info.take(5). This is rare; it is much more common to manipulate theStructExpressionobject, which is constructed using thestruct()function.
- class hail.utils.frozendict(d)[source]
An object representing an immutable dictionary.
>>> my_frozen_dict = hl.utils.frozendict({1:2, 7:5})
To get a normal python dictionary with the same elements from a frozendict:
>>> dict(frozendict({'a': 1, 'b': 2}))
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.my_dict.take(5). This is rare; it is much more common to manipulate theDictExpressionobject, which is constructed usingdict(). This class is necessary because hail supports using dicts as keys to other dicts or as elements in sets, while python does not.
- hail.utils.hadoop_open(path, mode='r', buffer_size=8192)[source]
Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Warning
Due to an implementation limitation,
hadoop_open()may be quite slow for large data sets (anything larger than 50 MB).Examples
Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: ... pandas_df.to_csv(f)
Read and print the lines of a text file stored in Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/notes.txt') as f: ... for line in f: ... print(line.strip())
Write two lines directly to a file in Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: ... f.write('result1: %s\n' % result1) ... f.write('result2: %s\n' % result2)
Unpack a packed Python struct directly from a file in Google Cloud Storage:
>>> from struct import unpack >>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: ... print(unpack('<f', bytearray(f.read())))
Notes
The supported modes are:
'r'– Readable text file (io.TextIOWrapper). Default behavior.'w'– Writable text file (io.TextIOWrapper).'x'– Exclusive writable text file (io.TextIOWrapper). Throws an error if a file already exists at the path.'rb'– Readable binary file (io.BufferedReader).'wb'– Writable binary file (io.BufferedWriter).'xb'– Exclusive writable binary file (io.BufferedWriter). Throws an error if a file already exists at the path.
The provided destination file path must be a URI (uniform resource identifier).
Caution
These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use
hadoop_copy()to move your file to a distributed file system.
- hail.utils.hadoop_copy(src, dest)[source]
Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
Copy a file from Google Cloud Storage to a local file:
>>> hadoop_copy('gs://hail-common/LCR.interval_list', ... 'file:///mnt/data/LCR.interval_list')
Notes
Try using
hadoop_open()first, it’s simpler, but not great for large data! For example:>>> with hadoop_open('gs://my_bucket/results.csv', 'r') as f: ... pandas_df.to_csv(f)
The provided source and destination file paths must be URIs (uniform resource identifiers).
- hail.utils.hadoop_stat(path)[source]
Returns information about the file or directory at a given path.
Notes
Raises an error if path does not exist.
The resulting dictionary contains the following data:
- hail.utils.hadoop_ls(path)[source]
Returns information about files at path.
Notes
Raises an error if path does not exist.
If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).
Each dict element of the result list contains the following data:
- hail.utils.hadoop_scheme_supported(scheme)[source]
Returns
Trueif the Hadoop filesystem supports URLs with the given scheme.Examples
>>> hadoop_scheme_supported('gs')
Notes
URLs with the https scheme are only supported if they are specifically Azure Blob Storage URLs of the form https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>
- hail.utils.copy_log(path)[source]
Attempt to copy the session log to a hadoop-API-compatible location.
Examples
Specify a manual path:
>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log') INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...
Copy to a directory:
>>> hl.copy_log('gs://my-bucket/') INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...
Notes
Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.
If path is a directory, then the log file will be copied using its base name to the directory (e.g.
/home/hail.logwould be copied asgs://my-bucket/hail.logif path isgs://my-bucket.- Parameters:
path (
str)
- hail.utils.range_table(n, n_partitions=None)[source]
Construct a table with the row index and no other fields.
Examples
>>> df = hl.utils.range_table(100)
>>> df.count() 100
Notes
The resulting table contains one field:
idx (
tint32) - Row index (key).
This method is meant for testing and learning, and is not optimized for production performance.
- Parameters:
n (int) – Number of rows.
n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
- Returns:
- hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None)[source]
Construct a matrix table with row and column indices and no entry fields.
Examples
>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows() 100
>>> range_ds.count_cols() 10
Notes
The resulting matrix table contains the following fields:
It contains no entry fields.
This method is meant for testing and learning, and is not optimized for production performance.
- Parameters:
- Returns:
- hail.utils.get_1kg(output_dir, overwrite=False)[source]
Download subset of the 1000 Genomes dataset and sample annotations.
Notes
The download is about 15M.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True, overwrite any existing files/directories at output_dir.
- hail.utils.get_hgdp(output_dir, overwrite=False)[source]
Download subset of the Human Genome Diversity Panel dataset and sample annotations.
Notes
The download is about 30MB.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True, overwrite any existing files/directories at output_dir.
- hail.utils.get_movie_lens(output_dir, overwrite=False)[source]
Download public Movie Lens dataset.
Notes
The download is about 6M.
See the MovieLens website for more information about this dataset.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True, overwrite existing files/directories at those locations.
- hail.utils.ANY_REGION
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.