utils¶
-
hail.utils.
hadoop_copy
(src, dest)[source]¶ Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
>>> hadoop_copy('gs://hail-common/LCR.interval_list', 'file:///mnt/data/LCR.interval_list')
Notes
The provided source and destination file paths must be URIs (uniform resource identifiers).
Parameters: - src (str) – Source file URI.
- dest (str) – Destination file URI.
-
hail.utils.
hadoop_read
(path, buffer_size=8192)[source]¶ Open a readable file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
>>> with hadoop_read('gs://my-bucket/notes.txt') as f: ... for line in f: ... print(line.strip())
Notes
The provided source file path must be a URI (uniform resource identifier).
Caution
These file handles are slower than standard Python file handles. If you are reading a file larger than ~50M, it will be faster to use
hadoop_copy()
to copy the file locally, then read it with standard Python I/O tools.Parameters: - path (str) – Source file URI.
- buffer_size (int) – Size of internal buffer.
Returns: Iterable file reader.
Return type:
-
hail.utils.
hadoop_write
(path, buffer_size=8192)[source]¶ Open a writable file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
>>> with hadoop_write('gs://my-bucket/notes.txt') as f: ... f.write('result1: %s\n' % result1) ... f.write('result2: %s\n' % result2)
Notes
The provided destination file path must be a URI (uniform resource identifier).
Caution
These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use
hadoop_copy()
to move your file to a distributed file system.Parameters: path (str) – Destination file URI. Returns: File writer object. Return type: io.BufferedWriter
-
class
hail.utils.
Summary
[source]¶ Class holding summary statistics about a dataset.
Variables: - samples (int) – Number of samples.
- variants (int) – Number of variants.
- call_rate (float) – Fraction of all genotypes called.
- contigs (list of str) – Unique contigs found in dataset.
- multiallelics (int) – Number of multiallelic variants.
- snps (int) – Number of SNP alternate alleles.
- mnps (int) – Number of MNP alternate alleles.
- insertions (int) – Number of insertion alternate alleles.
- deletions (int) – Number of deletion alternate alleles.
- complex (int) – Number of complex alternate alleles.
- star (int) – Number of star (upstream deletion) alternate alleles.
- max_alleles (int) – Highest number of alleles at any variant.