StringExpression

class hail.expr.StringExpression[source]

Expression of type tstr.

>>> s = hl.literal('The quick brown fox')

Attributes

dtype

The data type of the expression.

Methods

`contains`	Returns whether substr is contained in the string.
`endswith`	Returns whether substr is a suffix of the string.
`find`	Return the lowest index in the string where substring sub is found within the slice s[start:end].
`first_match_in`	Returns an array containing the capture groups of the first match of regex in the given character sequence.
`join`	Returns a string which is the concatenation of the strings in collection separated by the string providing this method.
`length`	Returns the length of the string.
`lower`	Returns a copy of the string, but with upper case letters converted to lower case.
`matches`	Returns `True` if the string contains any match for the given regex if full_match is false.
`replace`	Replace substrings matching pattern1 with pattern2 using regex.
`reverse`	Returns the reversed value.
`split`	Returns an array of strings generated by splitting the string at delim.
`startswith`	Returns whether substr is a prefix of the string.
`strip`	Returns a copy of the string with whitespace removed from the start and end.
`translate`	Translates characters of the string using mapping.
`upper`	Returns a copy of the string, but with lower case letters converted to upper case.

__add__(other)[source]

Concatenate strings.

Examples

>>> hl.eval(s + ' jumped over the lazy dog')
'The quick brown fox jumped over the lazy dog'

Parameters:: other (StringExpression) – String to concatenate.
Returns:: StringExpression – Concatenated string.

__eq__(other)

Returns True if the two expressions are equal.

Examples

>>> x = hl.literal(5)
>>> y = hl.literal(5)
>>> z = hl.literal(1)

>>> hl.eval(x == y)
True

>>> hl.eval(x == z)
False

Notes

This method will fail with an error if the two expressions are not of comparable types.

Parameters:: other (Expression) – Expression for equality comparison.
Returns:: BooleanExpression – True if the two expressions are equal.

__ge__(other): Return self>=value.

__getitem__(item)[source]

Slice or index into the string.

Examples

>>> hl.eval(s[:15])
'The quick brown'

>>> hl.eval(s[0])
'T'

Parameters:: item (slice or Expression of type tint32) – Slice or character index.
Returns:: StringExpression – Substring or character at index item.

__gt__(other): Return self>value.

__le__(other): Return self<=value.

__lt__(other): Return self<value.

__ne__(other)

Returns True if the two expressions are not equal.

Examples

>>> x = hl.literal(5)
>>> y = hl.literal(5)
>>> z = hl.literal(1)

>>> hl.eval(x != y)
False

>>> hl.eval(x != z)
True

Notes

This method will fail with an error if the two expressions are not of comparable types.

Parameters:: other (Expression) – Expression for inequality comparison.
Returns:: BooleanExpression – True if the two expressions are not equal.

collect(_localize=True)

Collect all records of an expression into a local list.

Examples

Collect all the values from C1:

>>> table1.C1.collect()
[2, 2, 10, 11]

Warning

Extremely experimental.

Warning

The list of records may be very large.

Returns:: list

contains(substr)[source]

Returns whether substr is contained in the string.

Examples

>>> hl.eval(s.contains('fox'))
True

>>> hl.eval(s.contains('dog'))
False

Note

This method is case-sensitive.

Parameters:: substr (StringExpression)
Returns:: BooleanExpression

describe(handler=<built-in function print>): Print information about type, index, and dependencies.

property dtype

The data type of the expression.

Returns:: HailType

endswith(substr)[source]

Returns whether substr is a suffix of the string.

Examples

>>> hl.eval(s.endswith('fox'))
True

Note

This method is case-sensitive.

Parameters:: substr (StringExpression)
Returns:: StringExpression

export(path, delimiter='\t', missing='NA', header=True)

Export a field to a text file.

Examples

>>> small_mt.GT.export('output/gt.tsv')
>>> with open('output/gt.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
locus   alleles 0       1       2       3
1:1     ["A","C"]       0/1     0/0     0/1     0/0
1:2     ["A","C"]       1/1     0/1     0/1     0/1
1:3     ["A","C"]       0/0     0/1     0/0     0/0
1:4     ["A","C"]       0/1     1/1     0/1     0/1

>>> small_mt.GT.export('output/gt-no-header.tsv', header=False)
>>> with open('output/gt-no-header.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
1:1     ["A","C"]       0/1     0/0     0/1     0/0
1:2     ["A","C"]       1/1     0/1     0/1     0/1
1:3     ["A","C"]       0/0     0/1     0/0     0/0
1:4     ["A","C"]       0/1     1/1     0/1     0/1

>>> small_mt.pop.export('output/pops.tsv')
>>> with open('output/pops.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
sample_idx      pop
0       1
1       2
2       2
3       2

>>> small_mt.ancestral_af.export('output/ancestral_af.tsv')
>>> with open('output/ancestral_af.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
locus   alleles ancestral_af
1:1     ["A","C"]       3.8152e-01
1:2     ["A","C"]       7.0588e-01
1:3     ["A","C"]       4.9991e-01
1:4     ["A","C"]       3.9616e-01

>>> small_mt.bn.export('output/bn.tsv')
>>> with open('output/bn.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
bn
{"n_populations":3,"n_samples":4,"n_variants":4,"n_partitions":4,"pop_dist":[1,1,1],"fst":[0.1,0.1,0.1],"mixture":false}

Notes

For entry-indexed expressions, if there is one column key field, the result of calling str() on that field is used as the column header. Otherwise, each compound column key is converted to JSON and used as a column header. For example:

>>> small_mt = small_mt.key_cols_by(s=small_mt.sample_idx, family='fam1')
>>> small_mt.GT.export('output/gt-no-header.tsv')
>>> with open('output/gt-no-header.tsv', 'r') as f:
...     for line in f:
...         print(line, end='')
locus   alleles {"s":0,"family":"fam1"} {"s":1,"family":"fam1"} {"s":2,"family":"fam1"} {"s":3,"family":"fam1"}
1:1     ["A","C"]       0/1     0/0     0/1     0/0
1:2     ["A","C"]       1/1     0/1     0/1     0/1
1:3     ["A","C"]       0/0     0/1     0/0     0/0
1:4     ["A","C"]       0/1     1/1     0/1     0/1

Parameters:

path (str) – The path to which to export.
delimiter (str) – The string for delimiting columns.
missing (str) – The string to output for missing values.
header (bool) – When True include a header line.

find(sub, start=None, end=None)[source]

Return the lowest index in the string where substring sub is found within the slice s[start:end]. Optional arguments start and end are interpreted as in slice notation. Evaluates to -1 if sub is not found.

Examples

>>> a = hl.str('hello, world')
>>> hl.eval(a.find('world'))
7

>>> hl.eval(a.find('hail'))
-1

Parameters:

sub (StringExpression) – substring to find
start (Int32Expression) – optional slice start index
end (Int32Expression) – optional slice end index

Returns:

Int32Expression – lowest index in the string where substring sub is found or -1.

first_match_in(regex)[source]

Returns an array containing the capture groups of the first match of regex in the given character sequence.

Examples

>>> hl.eval(s.first_match_in("The quick (\w+) fox"))
['brown']

>>> hl.eval(s.first_match_in("The (\w+) (\w+) (\w+)"))
['quick', 'brown', 'fox']

>>> hl.eval(s.first_match_in("(\w+) (\w+)"))
['The', 'quick']

Parameters:: regex (StringExpression)
Returns:: ArrayExpression with element type tstr

join(collection)[source]

Returns a string which is the concatenation of the strings in collection separated by the string providing this method. Raises TypeError if the element type of collection is not tstr.

Examples

>>> a = ['Bob', 'Charlie', 'Alice', 'Bob', 'Bob']

>>> hl.eval(hl.str(',').join(a))
'Bob,Charlie,Alice,Bob,Bob'

Parameters:: collection (ArrayExpression or SetExpression) – Collection.
Returns:: StringExpression – Joined string expression.

length()[source]

Returns the length of the string.

Examples

>>> hl.eval(s.length())
19

Returns:: Expression of type tint32 – Length of the string.

lower()[source]

Returns a copy of the string, but with upper case letters converted to lower case.

Examples

>>> hl.eval(s.lower())
'the quick brown fox'

Returns:: StringExpression

matches(regex, full_match=False)[source]

Returns True if the string contains any match for the given regex if full_match is false. Returns True if the whole string matches the given regex if full_match is true.

Examples

The regex parameter does not need to match the entire string if full_match is False:

>>> string = hl.literal('NA12878')
>>> hl.eval(string.matches('12'))
True

The regex parameter needs to match the entire string if full_match is True:

>>> string = hl.literal('NA12878')
>>> hl.eval(string.matches('12', True))
False

>>> string = hl.literal('3412878')
>>> hl.eval(string.matches('^[0-9]*$'))
True

Regex motifs can be used to match sequences of characters:

>>> string = hl.literal('NA12878')
>>> hl.eval(string.matches(r'NA\d+'))
True

>>> string = hl.literal('3412878')
>>> hl.eval(string.matches('^[0-9]*$'))
True

Notes

The regex argument is a regular expression, and uses Java regex syntax.

Parameters:

regex (StringExpression) – Pattern to match.
full_match (:obj: bool) – If True, the function considers whether the whole string matches the regex. If False, the function considers whether the string has a partial match for that regex

Returns:

BooleanExpression – If full_match is False,``True`` if the string contains any match for the regex, otherwise False. If full_match is True,``True`` if the whole string matches the regex, otherwise False.

replace(pattern1, pattern2)[source]

Replace substrings matching pattern1 with pattern2 using regex.

Examples

Replace spaces with underscores in a Hail string:

>>> hl.eval(hl.str("The quick  brown fox").replace(' ', '_'))
'The_quick__brown_fox'

Remove the leading zero in contigs in variant strings in a table:

>>> t = hl.import_table('data/leading-zero-variants.txt')
>>> t.show()
+----------------+
| variant        |
+----------------+
| str            |
+----------------+
| "01:1000:A:T"  |
| "01:10001:T:G" |
| "02:99:A:C"    |
| "02:893:G:C"   |
| "22:100:A:T"   |
| "X:10:C:A"     |
+----------------+

>>> t = t.annotate(variant = t.variant.replace("^0([0-9])", "$1"))
>>> t.show()
+---------------+
| variant       |
+---------------+
| str           |
+---------------+
| "1:1000:A:T"  |
| "1:10001:T:G" |
| "2:99:A:C"    |
| "2:893:G:C"   |
| "22:100:A:T"  |
| "X:10:C:A"    |
+---------------+

Notes

The regex expressions used should follow Java regex syntax. In the Java regular expression syntax, a dollar sign, $1, refers to the first group, not the canonical \1.

Parameters:

pattern1 (str or StringExpression)
pattern2 (str or StringExpression)

reverse()[source]

Returns the reversed value. .. rubric:: Examples

>>> string = hl.literal('ATGCC')
>>> hl.eval(string.reverse())
'CCGTA'

Returns:: StringExpression

show(n=None, width=None, truncate=None, types=True, handler=None, n_rows=None, n_cols=None)

Print the first few records of the expression to the console.

If the expression refers to a value on a keyed axis of a table or matrix table, then the accompanying keys will be shown along with the records.

Examples

>>> table1.SEX.show()
+-------+-----+
|    ID | SEX |
+-------+-----+
| int32 | str |
+-------+-----+
|     1 | "M" |
|     2 | "M" |
|     3 | "F" |
|     4 | "F" |
+-------+-----+

>>> hl.literal(123).show()
+--------+
| <expr> |
+--------+
|  int32 |
+--------+
|    123 |
+--------+

Notes

The output can be passed piped to another output source using the handler argument:

>>> ht.foo.show(handler=lambda x: logging.info(x))  

Parameters:

n (int) – Maximum number of rows to show.
width (int) – Horizontal width at which to break columns.
truncate (int, optional) – Truncate each field to the given number of characters. If None, truncate fields to the given width.
types (bool) – Print an extra header line with the type of each field.

split(delim, n=None)[source]

Returns an array of strings generated by splitting the string at delim.

Examples

>>> hl.eval(s.split('\s+'))
['The', 'quick', 'brown', 'fox']

>>> hl.eval(s.split('\s+', 2))
['The', 'quick brown fox']

Notes

The delimiter is a regex using the Java regex syntax delimiter. To split on special characters, escape them with double backslash (\\).

Parameters:

delim (str or StringExpression) – Delimiter regex.
n (Expression of type tint32, optional) – Maximum number of splits.

Returns:

ArrayExpression – Array of split strings.

startswith(substr)[source]

Returns whether substr is a prefix of the string.

Examples

>>> hl.eval(s.startswith('The'))
True

>>> hl.eval(s.startswith('the'))
False

Note

This method is case-sensitive.

Parameters:: substr (StringExpression)
Returns:: StringExpression

strip()[source]

Returns a copy of the string with whitespace removed from the start and end.

Examples

>>> s2 = hl.str('  once upon a time\n')
>>> hl.eval(s2.strip())
'once upon a time'

Returns:: StringExpression

summarize(handler=None): Compute and print summary information about the expression.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

take(n, _localize=True)

Collect the first n records of an expression.

Examples

Take the first three rows:

>>> table1.X.take(3)
[5, 6, 7]

Warning

Extremely experimental.

Parameters:: n (int) – Number of records to take.
Returns:: list

translate(mapping)[source]

Translates characters of the string using mapping.

Examples

>>> string = hl.literal('ATTTGCA')
>>> hl.eval(string.translate({'T': 'U'}))
'AUUUGCA'

Parameters:: mapping (DictExpression) – Dictionary of character-character translations.
Returns:: StringExpression