utils
Built-in mutable sequence. |
|
|
An object representing a range of values between start and end. |
|
Nested annotation structure. |
|
An object representing an immutable dictionary. |
|
Open a file through the Hadoop filesystem API. |
|
Copy a file through the Hadoop filesystem API. |
|
Returns |
|
Returns |
|
Returns |
|
Returns information about the file or directory at a given path. |
|
Returns information about files at path. |
|
Returns |
|
Attempt to copy the session log to a hadoop-API-compatible location. |
|
Construct a table with the row index and no other fields. |
|
Construct a matrix table with row and column indices and no entry fields. |
|
Download subset of the 1000 Genomes dataset and sample annotations. |
|
Download subset of the Human Genome Diversity Panel dataset and sample annotations. |
|
Download public Movie Lens dataset. |
- class hail.utils.Interval(start, end, includes_start=True, includes_end=False, point_type=None)[source]
An object representing a range of values between start and end.
>>> interval2 = hl.Interval(3, 6)
- Parameters:
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.interval.take(5)
. This is rare; it is much more common to manipulate theIntervalExpression
object, which is constructed using the following functions:
- class hail.utils.Struct(**kwargs)[source]
Nested annotation structure.
>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})
Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:
>>> bar.foo >>> bar['foo']
Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:
>>> bar['1kg']
The
pprint
module can be used to print nested Structs in a more human-readable fashion:>>> from pprint import pprint >>> pprint(bar)
- Parameters:
attributes – Field names and values.
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.info.take(5)
. This is rare; it is much more common to manipulate theStructExpression
object, which is constructed using thestruct()
function.
- class hail.utils.frozendict(d)[source]
An object representing an immutable dictionary.
>>> my_frozen_dict = hl.utils.frozendict({1:2, 7:5})
To get a normal python dictionary with the same elements from a frozendict:
>>> dict(frozendict({'a': 1, 'b': 2}))
Note
This object refers to the Python value returned by taking or collecting Hail expressions, e.g.
mt.my_dict.take(5)
. This is rare; it is much more common to manipulate theDictExpression
object, which is constructed usingdict()
. This class is necessary because hail supports using dicts as keys to other dicts or as elements in sets, while python does not.
- hail.utils.hadoop_open(path, mode='r', buffer_size=8192)[source]
Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Warning
Due to an implementation limitation,
hadoop_open()
may be quite slow for large data sets (anything larger than 50 MB).Examples
Write a Pandas DataFrame as a CSV directly into Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/df.csv', 'w') as f: ... pandas_df.to_csv(f)
Read and print the lines of a text file stored in Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/notes.txt') as f: ... for line in f: ... print(line.strip())
Write two lines directly to a file in Google Cloud Storage:
>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: ... f.write('result1: %s\n' % result1) ... f.write('result2: %s\n' % result2)
Unpack a packed Python struct directly from a file in Google Cloud Storage:
>>> from struct import unpack >>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: ... print(unpack('<f', bytearray(f.read())))
Notes
The supported modes are:
'r'
– Readable text file (io.TextIOWrapper
). Default behavior.'w'
– Writable text file (io.TextIOWrapper
).'x'
– Exclusive writable text file (io.TextIOWrapper
). Throws an error if a file already exists at the path.'rb'
– Readable binary file (io.BufferedReader
).'wb'
– Writable binary file (io.BufferedWriter
).'xb'
– Exclusive writable binary file (io.BufferedWriter
). Throws an error if a file already exists at the path.
The provided destination file path must be a URI (uniform resource identifier).
Caution
These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use
hadoop_copy()
to move your file to a distributed file system.
- hail.utils.hadoop_copy(src, dest)[source]
Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
Copy a file from Google Cloud Storage to a local file:
>>> hadoop_copy('gs://hail-common/LCR.interval_list', ... 'file:///mnt/data/LCR.interval_list')
Notes
Try using
hadoop_open()
first, it’s simpler, but not great for large data! For example:>>> with hadoop_open('gs://my_bucket/results.csv', 'r') as f: ... pandas_df.to_csv(f)
The provided source and destination file paths must be URIs (uniform resource identifiers).
- hail.utils.hadoop_stat(path)[source]
Returns information about the file or directory at a given path.
Notes
Raises an error if path does not exist.
The resulting dictionary contains the following data:
- hail.utils.hadoop_ls(path)[source]
Returns information about files at path.
Notes
Raises an error if path does not exist.
If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).
Each dict element of the result list contains the following data:
- hail.utils.hadoop_scheme_supported(scheme)[source]
Returns
True
if the Hadoop filesystem supports URLs with the given scheme.Examples
>>> hadoop_scheme_supported('gs')
Notes
URLs with the https scheme are only supported if they are specifically Azure Blob Storage URLs of the form https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>
- hail.utils.copy_log(path)[source]
Attempt to copy the session log to a hadoop-API-compatible location.
Examples
Specify a manual path:
>>> hl.copy_log('gs://my-bucket/analysis-10-jan19.log') INFO: copying log to 'gs://my-bucket/analysis-10-jan19.log'...
Copy to a directory:
>>> hl.copy_log('gs://my-bucket/') INFO: copying log to 'gs://my-bucket/hail-20180924-2018-devel-46e5fad57524.log'...
Notes
Since Hail cannot currently log directly to distributed file systems, this function is provided as a utility for offloading logs from ephemeral nodes.
If path is a directory, then the log file will be copied using its base name to the directory (e.g.
/home/hail.log
would be copied asgs://my-bucket/hail.log
if path isgs://my-bucket
.- Parameters:
path (
str
)
- hail.utils.range_table(n, n_partitions=None)[source]
Construct a table with the row index and no other fields.
Examples
>>> df = hl.utils.range_table(100)
>>> df.count() 100
Notes
The resulting table contains one field:
idx (
tint32
) - Row index (key).
This method is meant for testing and learning, and is not optimized for production performance.
- Parameters:
n (int) – Number of rows.
n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
- Returns:
- hail.utils.range_matrix_table(n_rows, n_cols, n_partitions=None)[source]
Construct a matrix table with row and column indices and no entry fields.
Examples
>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows() 100
>>> range_ds.count_cols() 10
Notes
The resulting matrix table contains the following fields:
It contains no entry fields.
This method is meant for testing and learning, and is not optimized for production performance.
- Parameters:
- Returns:
- hail.utils.get_1kg(output_dir, overwrite=False)[source]
Download subset of the 1000 Genomes dataset and sample annotations.
Notes
The download is about 15M.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True
, overwrite any existing files/directories at output_dir.
- hail.utils.get_hgdp(output_dir, overwrite=False)[source]
Download subset of the Human Genome Diversity Panel dataset and sample annotations.
Notes
The download is about 30MB.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True
, overwrite any existing files/directories at output_dir.
- hail.utils.get_movie_lens(output_dir, overwrite=False)[source]
Download public Movie Lens dataset.
Notes
The download is about 6M.
See the MovieLens website for more information about this dataset.
- Parameters:
output_dir – Directory in which to write data.
overwrite – If
True
, overwrite existing files/directories at those locations.
- hail.utils.ANY_REGION
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.