utils

hail.utils.hadoop_copy(src, dest)[source]

Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

>>> hadoop_copy('gs://hail-common/LCR.interval_list', 'file:///mnt/data/LCR.interval_list') 

Notes

The provided source and destination file paths must be URIs (uniform resource identifiers).

Parameters:
  • src (str) – Source file URI.
  • dest (str) – Destination file URI.
hail.utils.hadoop_read(path, buffer_size=8192)[source]

Open a readable file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

>>> with hadoop_read('gs://my-bucket/notes.txt') as f:
...     for line in f:
...         print(line.strip())

Notes

The provided source file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are reading a file larger than ~50M, it will be faster to use hadoop_copy() to copy the file locally, then read it with standard Python I/O tools.

Parameters:
  • path (str) – Source file URI.
  • buffer_size (int) – Size of internal buffer.
Returns:

Iterable file reader.

Return type:

io.BufferedReader

hail.utils.hadoop_write(path, buffer_size=8192)[source]

Open a writable file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.

Examples

>>> with hadoop_write('gs://my-bucket/notes.txt') as f:
...     f.write('result1: %s\n' % result1)
...     f.write('result2: %s\n' % result2)

Notes

The provided destination file path must be a URI (uniform resource identifier).

Caution

These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use hadoop_copy() to move your file to a distributed file system.

Parameters:path (str) – Destination file URI.
Returns:File writer object.
Return type:io.BufferedWriter
class hail.utils.Summary[source]

Class holding summary statistics about a dataset.

Variables:
  • samples (int) – Number of samples.
  • variants (int) – Number of variants.
  • call_rate (float) – Fraction of all genotypes called.
  • contigs (list of str) – Unique contigs found in dataset.
  • multiallelics (int) – Number of multiallelic variants.
  • snps (int) – Number of SNP alternate alleles.
  • mnps (int) – Number of MNP alternate alleles.
  • insertions (int) – Number of insertion alternate alleles.
  • deletions (int) – Number of deletion alternate alleles.
  • complex (int) – Number of complex alternate alleles.
  • star (int) – Number of star (upstream deletion) alternate alleles.
  • max_alleles (int) – Highest number of alleles at any variant.
report()[source]

Print the summary information.