Miscellaneous

`grep`(regex, path[, max_count, show, force, ...])	Searches given paths for all lines containing regex matches.
`maximal_independent_set`(i, j[, keep, ...])	Return a table containing the vertices in a near maximal independent set of an undirected graph whose edges are given by a two-column table.
`rename_duplicates`(dataset[, name])	Rename duplicate column keys.
`segment_intervals`(ht, points)	Segment the interval keys of ht at a given set of points.

hail.methods.grep(regex, path, max_count=100, *, show=True, force=False, force_bgz=False)[source]

Searches given paths for all lines containing regex matches.

Examples

Print all lines containing the string hello in file.txt:

>>> hl.grep('hello','data/file.txt')

Print all lines containing digits in file1.txt and file2.txt:

>>> hl.grep('\\d', ['data/file1.txt','data/file2.txt'])

Notes

grep() mimics the basic functionality of Unix grep in parallel, printing results to the screen. This command is provided as a convenience to those in the statistical genetics community who often search enormous text files like VCFs. Hail uses Java regular expression patterns. The RegExr sandbox may be helpful.

Parameters:

regex (str) – The regular expression to match.
path (str or list of str) – The files to search.
max_count (int) – The maximum number of matches to return
show (bool) – When True, show the values on stdout. When False, return a dictionary mapping file names to lines.
force_bgz (bool) – If True, read files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not '.bgz', but the file is blocked gzip, so that the file can be read in parallel and not on a single node.
force (bool) – If True, read gzipped files serially on one core. This should be used only when absolutely necessary, as processing time will be increased due to lack of parallelism.

Returns:

dict of str to list of str

hail.methods.maximal_independent_set(i, j, keep=True, tie_breaker=None, keyed=True)[source]

Return a table containing the vertices in a near maximal independent set of an undirected graph whose edges are given by a two-column table.

Examples

Run PC-relate and compute pairs of closely related individuals:

>>> pc_rel = hl.pc_relate(dataset.GT, 0.001, k=2, statistics='kin')
>>> pairs = pc_rel.filter(pc_rel['kin'] > 0.125)

Starting from the above pairs, prune individuals from a dataset until no close relationships remain:

>>> related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, False)
>>> result = dataset.filter_cols(
...     hl.is_defined(related_samples_to_remove[dataset.col_key]), keep=False)

Starting from the above pairs, prune individuals from a dataset until no close relationships remain, preferring to keep cases over controls:

>>> samples = dataset.cols()
>>> pairs_with_case = pairs.key_by(
...     i=hl.struct(id=pairs.i, is_case=samples[pairs.i].is_case),
...     j=hl.struct(id=pairs.j, is_case=samples[pairs.j].is_case))
>>> def tie_breaker(l, r):
...     return hl.if_else(l.is_case & ~r.is_case, -1,
...                       hl.if_else(~l.is_case & r.is_case, 1, 0))
>>> related_samples_to_remove = hl.maximal_independent_set(
...    pairs_with_case.i, pairs_with_case.j, False, tie_breaker)
>>> result = dataset.filter_cols(hl.is_defined(
...     related_samples_to_remove.key_by(
...        s = related_samples_to_remove.node.id.s)[dataset.col_key]), keep=False)

Notes

The vertex set of the graph is implicitly all the values realized by i and j on the rows of this table. Each row of the table corresponds to an undirected edge between the vertices given by evaluating i and j on that row. An undirected edge may appear multiple times in the table and will not affect the output. Vertices with self-edges are removed as they are not independent of themselves.

The expressions for i and j must have the same type.

The value of keep determines whether the vertices returned are those in the maximal independent set, or those in the complement of this set. This is useful if you need to filter a table without removing vertices that don’t appear in the graph at all.

This method implements a greedy algorithm which iteratively removes a vertex of highest degree until the graph contains no edges. The greedy algorithm always returns an independent set, but the set may not always be perfectly maximal.

tie_breaker is a Python function taking two arguments—say l and r—each of which is an Expression of the same type as i and j. tie_breaker returns a NumericExpression, which defines an ordering on nodes. A pair of nodes can be ordered in one of three ways, and tie_breaker must encode the relationship as follows:

if l < r then tie_breaker evaluates to some negative integer

if l == r then tie_breaker evaluates to 0

if l > r then tie_breaker evaluates to some positive integer

For example, the usual ordering on the integers is defined by: l - r.

The tie_breaker function must satisfy the following property: tie_breaker(l, r) == -tie_breaker(r, l).

When multiple nodes have the same degree, this algorithm will order the nodes according to tie_breaker and remove the largest node.

If keyed is False, then a node may appear twice in the resulting table.

Parameters:

i (Expression) – Expression to compute one endpoint of an edge.
j (Expression) – Expression to compute another endpoint of an edge.
keep (bool) – If True, return vertices in set. If False, return vertices removed.
tie_breaker (function) – Function used to order nodes with equal degree.
keyed (bool) – If True, key the resulting table by the node field, this requires a sort.

Returns:

Table – Table with the set of independent vertices. The table schema is one row field node which has the same type as input expressions i and j.

hail.methods.rename_duplicates(dataset, name='unique_id')[source]

Rename duplicate column keys.

Note

Requires the column key to be one field of type tstr

Examples

>>> renamed = hl.rename_duplicates(dataset).cols()
>>> duplicate_samples = (renamed.filter(renamed.s != renamed.unique_id)
...                             .select()
...                             .collect())

Notes

This method produces a new column field from the string column key by appending a unique suffix _N as necessary. For example, if the column key “NA12878” appears three times in the dataset, the first will produce “NA12878”, the second will produce “NA12878_1”, and the third will produce “NA12878_2”. The name of this new field is parameterized by name.

Parameters:

dataset (MatrixTable) – Dataset.
name (str) – Name of new field.

Returns:

MatrixTable

hail.methods.segment_intervals(ht, points)[source]

Segment the interval keys of ht at a given set of points.

Parameters:

ht (Table) – Table with interval keys.
points (Table or ArrayExpression) – Points at which to segment the intervals, a table or an array.

Returns:

Table