BlockMatrix¶

class
hail.linalg.
BlockMatrix
(bmir)[source]¶ Hail’s blockdistributed matrix of
tfloat64
elements.Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
A block matrix is a distributed analogue of a twodimensional NumPy ndarray with shape
(n_rows, n_cols)
and NumPy dtypefloat64
. Import the class with:>>> from hail.linalg import BlockMatrix
Under the hood, block matrices are partitioned like a checkerboard into square blocks with side length a common block size. Blocks in the final row or column of blocks may be truncated, so block size need not evenly divide the matrix dimensions. Block size defaults to the value given by
default_block_size()
.Operations and broadcasting
The core operations are consistent with NumPy:
+
,
,*
, and/
for elementwise addition, subtraction, multiplication, and division;@
for matrix multiplication;T
for transpose; and**
for elementwise exponentiation to a scalar power.For elementwise binary operations, each operand may be a block matrix, an ndarray, or a scalar (
int
orfloat
). For matrix multiplication, each operand may be a block matrix or an ndarray. If either operand is a block matrix, the result is a block matrix. Binary operations between block matrices require that both operands have the same block size.To interoperate with block matrices, ndarray operands must be one or two dimensional with dtype convertible to
float64
. Onedimensional ndarrays of shape(n)
are promoted to twodimensional ndarrays of shape(1, n)
, i.e. a single row.Block matrices support broadcasting of
+
,
,*
, and/
between matrices of different shapes, consistent with the NumPy broadcasting rules. There is one exception: block matrices do not currently support elementwise “outer product” of a single row and a single column, although the same effect can be achieved for*
by using@
.Warning
For binary operations, if the first operand is an ndarray and the second operand is a block matrix, the result will be a ndarray of block matrices. To achieve the desired behavior for
+
and*
, place the block matrix operand first; for
,/
, and@
, first convert the ndarray to a block matrix usingfrom_numpy()
.Warning
Block matrix multiplication requires special care due to each block of each operand being a dependency of multiple blocks in the product.
The \((i, j)\)block in the product
a @ b
is computed by summing the products of corresponding blocks in block row \(i\) ofa
and block column \(j\) ofb
. So overall, in addition to this multiplication and addition, the evaluation ofa @ b
realizes each block ofa
as many times as the number of block columns ofb
and realizes each block ofb
as many times as the number of block rows ofa
.This becomes a performance and resilience issue whenever
a
orb
is defined in terms of pending transformations (such as linear algebra operations). For example, evaluatinga @ (c @ d)
will effectively evaluatec @ d
as many times as the number of block rows ina
.To limit recomputation, write or cache transformed block matrix operands before feeding them into matrix multiplication:
>>> c = BlockMatrix.read('c.bm') # doctest: +SKIP >>> d = BlockMatrix.read('d.bm') # doctest: +SKIP >>> (c @ d).write('cd.bm') # doctest: +SKIP >>> a = BlockMatrix.read('a.bm') # doctest: +SKIP >>> e = a @ BlockMatrix.read('cd.bm') # doctest: +SKIP
Indexing and slicing
Block matrices also support NumPystyle 2dimensional indexing and slicing, with two differences. First, slices
start:stop:step
must be nonempty with positivestep
. Second, even if only one index is a slice, the resulting block matrix is still 2dimensional.For example, for a block matrix
bm
with 10 rows and 10 columns:bm[0, 0]
is the element in row 0 and column 0 ofbm
.bm[0:1, 0]
is a block matrix with 1 row, 1 column, and elementbm[0, 0]
.bm[2, :]
is a block matrix with 1 row, 10 columns, and elements from row 2 ofbm
.bm[:3, 1]
is a block matrix with 3 rows, 1 column, and the first 3 elements of the last column ofbm
.bm[::2, ::2]
is a block matrix with 5 rows, 5 columns, and all evenlyindexed elements ofbm
.
Use
filter()
,filter_rows()
, andfilter_cols()
to subset to nonslice subsets of rows and columns, e.g. to rows[0, 2, 5]
.Blocksparse representation
By default, block matrices compute and store all blocks explicitly. However, some applications involve block matrices in which:
 some blocks consist entirely of zeroes.
 some blocks are not of interest.
For example, statistical geneticists often want to compute and manipulate a banded correlation matrix capturing “linkage disequilibrium” between nearby variants along the genome. In this case, working with the full correlation matrix for tens of millions of variants would be prohibitively expensive, and in any case, entries far from the diagonal are either not of interest or ought to be zeroed out before downstream linear algebra.
To enable such computations, block matrices do not require that all blocks be realized explicitly. Implicit (dropped) blocks behave as blocks of zeroes, so we refer to a block matrix in which at least one block is implicitly zero as a blocksparse matrix. Otherwise, we say the matrix is blockdense. The property
is_sparse()
encodes this state.Dropped blocks are not stored in memory or on
write()
. In fact, blocks that are dropped prior to an action likeexport()
orto_numpy()
are never computed in the first place, nor are any blocks of upstream operands on which only dropped blocks depend! In addition, linear algebra is accelerated by avoiding, for example, explicit addition of or multiplication by blocks of zeroes.Blocksparse matrices may be created with
sparsify_band()
,sparsify_rectangles()
,sparsify_row_intervals()
, andsparsify_triangle()
.The following methods naturally propagate blocksparsity:
 Addition and subtraction “union” realized blocks.
 Elementwise multiplication “intersects” realized blocks.
 Transpose “transposes” realized blocks.
abs()
andsqrt()
preserve the realized blocks.sum()
along an axis realizes those blocks for which at least one block summand is realized.
These following methods always result in a blockdense matrix:
fill()
 Addition or subtraction of a scalar or broadcasted vector.
 Matrix multiplication,
@
.  Matrix slicing, and more generally
filter()
,filter_rows()
, andfilter_cols()
.
The following methods fail if any operand is blocksparse, but can be forced by first applying
densify()
. Elementwise division between two block matrices.
 Multiplication by a scalar or broadcasted vector which includes an
infinite or
nan
value.  Division by a scalar or broadcasted vector which includes a zero, infinite
or
nan
value.  Division of a scalar or broadcasted vector by a block matrix.
 Elementwise exponentiation by a negative exponent.
 Natural logarithm,
log()
.
Attributes
T
Matrix transpose. block_size
Block size. element_type
is_sparse
Returns True
if blocksparse.n_cols
Number of columns. n_rows
Number of rows. shape
Shape of matrix. Methods
__init__
Initialize self. abs
Elementwise absolute value. cache
Persist this block matrix in memory. ceil
Elementwise ceiling. default_block_size
Default block side length. densify
Restore all dropped blocks as explicit blocks of zeros. diagonal
Extracts diagonal elements as a row vector. entries
Returns a table with the indices and value of each block matrix entry. export
Exports a stored block matrix as a delimited text file. export_blocks
Export each block of the block matrix as its own delimited text or binary file. export_rectangles
Export rectangular regions from a block matrix to delimited text or binary files. fill
Creates a block matrix with all elements the same value. filter
Filters matrix rows and columns. filter_cols
Filters matrix columns. filter_rows
Filters matrix rows. floor
Elementwise floor. from_entry_expr
Creates a block matrix using a matrix table entry expression. from_numpy
Distributes a NumPy ndarray as a block matrix. fromfile
Creates a block matrix from a binary file. log
Elementwise natural logarithm. persist
Persists this block matrix in memory or on disk. random
Creates a block matrix with standard normal or uniform random entries. read
Reads a block matrix. rectangles_to_numpy
Instantiates a NumPy ndarray from files of rectangles written out using export_rectangles()
orexport_blocks()
.sparsify_band
Filter to a diagonal band. sparsify_rectangles
Filter to blocks overlapping the union of rectangular regions. sparsify_row_intervals
Creates a blocksparse matrix by filtering to an interval for each row. sparsify_triangle
Filter to the upper or lower triangle. sqrt
Elementwise square root. sum
Sums array elements over one or both axes. svd
Computes the reduced singular value decomposition. to_matrix_table_row_major
Returns a matrix table with row key of row_idx and col key col_idx, whose entries are structs of a single field element. to_numpy
Collects the block matrix into a NumPy ndarray. to_table_row_major
Returns a table where each row represents a row in the block matrix. tofile
Collects and writes data to a binary file. unpersist
Unpersists this block matrix from memory/disk. write
Writes the block matrix. write_from_entry_expr
Writes a block matrix from a matrix table entry expression. 
T
¶ Matrix transpose.
Returns: BlockMatrix

abs
()[source]¶ Elementwise absolute value.
Returns: BlockMatrix

block_size
¶ Block size.
Returns: int

cache
()[source]¶ Persist this block matrix in memory.
Notes
This method is an alias for
persist("MEMORY_ONLY")
.Returns: BlockMatrix
– Cached block matrix.

ceil
()[source]¶ Elementwise ceiling.
Returns: BlockMatrix

densify
()[source]¶ Restore all dropped blocks as explicit blocks of zeros.
Returns: BlockMatrix

diagonal
()[source]¶ Extracts diagonal elements as a row vector.
Returns: BlockMatrix

element_type
¶

entries
(keyed=True)[source]¶ Returns a table with the indices and value of each block matrix entry.
Examples
>>> import numpy as np >>> block_matrix = BlockMatrix.from_numpy(np.array([[5, 7], [2, 8]]), 2) >>> entries_table = block_matrix.entries() >>> entries_table.show() ++++  i  j  entry  ++++  int64  int64  float64  ++++  0  0  5.00e+00   0  1  7.00e+00   1  0  2.00e+00   1  1  8.00e+00  ++++
Notes
The resulting table may be filtered, aggregated, and queried, but should only be directly exported to disk if the block matrix is very small.
For blocksparse matrices, only realized blocks are included. To force inclusion of zeroes in dropped blocks, apply
densify()
first.The resulting table has the following fields:
 i (
tint64
, key field) – Row index.  j (
tint64
, key field) – Column index.  entry (
tfloat64
) – Value of entry.
Returns: Table
– Table with a row for each entry. i (

static
export
(path_in, path_out, delimiter='\t', header=None, add_index=False, parallel=None, partition_size=None, entries='full')[source]¶ Exports a stored block matrix as a delimited text file.
Examples
Consider the following matrix.
>>> import numpy as np >>> nd = np.array([[1.0, 0.8, 0.7], ... [0.8, 1.0 ,0.3], ... [0.7, 0.3, 1.0]]) >>> BlockMatrix.from_numpy(nd).write('output/example.bm', overwrite=True, force_row_major=True)
Export the full matrix as a file with tabseparated values:
>>> BlockMatrix.export('output/example.bm', 'output/example.tsv')
Export the uppertriangle of the matrix as a block gzipped file of commaseparated values.
>>> BlockMatrix.export(path_in='output/example.bm', ... path_out='output/example.csv.bgz', ... delimiter=',', ... entries='upper')
Export the full matrix with row indices in parallel as a folder of gzipped files, each with a header line for columns
idx
,A
,B
, andC
.>>> BlockMatrix.export(path_in='output/example.bm', ... path_out='output/example.gz', ... header=' '.join(['idx', 'A', 'B', 'C']), ... add_index=True, ... parallel='header_per_shard', ... partition_size=2)
This produces two compressed files which uncompress to:
idx A B C 0 1.0 0.8 0.7 1 0.8 1.0 0.3
idx A B C 2 0.7 0.3 1.0
Warning
The block matrix must be stored in rowmajor format, as results from
BlockMatrix.write()
withforce_row_major=True
and fromBlockMatrix.write_from_entry_expr()
. Otherwise,export()
will fail.Notes
The five options for entries are illustrated below.
Full:
1.0 0.8 0.7 0.8 1.0 0.3 0.7 0.3 1.0
Lower triangle:
1.0 0.8 1.0 0.7 0.3 1.0
Strict lower triangle:
0.8 0.7 0.3
Upper triangle:
1.0 0.8 0.7 1.0 0.3 1.0
Strict upper triangle:
0.8 0.7 0.3
The number of columns must be less than \(2^{31}\).
The number of partitions (file shards) exported equals the ceiling of
n_rows / partition_size
. By default, there is one partition per row of blocks in the block matrix. The number of partitions should be at least the number of cores for efficient parallelism. Setting the partition size to an exact (rather than approximate) divisor or multiple of the block size reduces superfluous shuffling of data.If parallel is
None
, these file shards are then serially concatenated by one core into one file, a slow process. See other options below.It is highly recommended to export large files with a
.bgz
extension, which will use a block gzipped compression codec. These files can be read natively with Python’sgzip.open
and R’sread.table
.Parameters:  path_in (
str
) – Path to input block matrix, stored rowmajor on disk.  path_out (
str
) – Path for export. Use extension.gz
for gzip or.bgz
for block gzip.  delimiter (
str
) – Column delimiter.  header (
str
, optional) – If provided, header is prepended before the first row of data.  add_index (
bool
) – IfTrue
, add an initial column with the absolute row index.  parallel (
str
, optional) – If'header_per_shard'
, create a folder with one file per partition, each with a header if provided. If'separate_header'
, create a folder with one file per partition without a header; write the header, if provided, in a separate file. IfNone
, serially concatenate the header and all partitions into one file; export will be slower. If header isNone
then'header_per_shard'
and'separate_header'
are equivalent.  partition_size (
int
, optional) – Number of rows to group per partition for export. Default given by block size of the block matrix.  entries (
str)  Describes which entries to export. One of: `
’full’, ``'lower'
,'strict_lower'
,'upper'
,'strict_upper'
.
 path_in (

export_blocks
(path_out, delimiter='\t', binary=False)[source]¶ Export each block of the block matrix as its own delimited text or binary file. This is a special case of
export_rectangles()
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0], ... [ 4.0, 5.0, 6.0], ... [ 7.0, 8.0, 9.0]])
>>> BlockMatrix.from_numpy(nd, block_size=2).export_blocks('output/example')
This produces four files in the folder
output/example
.The first file is
rect0_0202
:1.0 2.0 4.0 5.0
The second file is
rect1_0223
:3.0 6.0
The third file is
rect2_2302
:7.0 8.0
And the fourth file is
rect3_3434
:9.0
Notes
This method does not have any matrix size limitations.
If exporting to binary files, note that they are not platform independent. No byteorder or datatype information is saved.
See also
Parameters:  path_out (
str
) – Path for folder of exported files.  delimiter (
str
) – Column delimiter.  binary (
bool
) – If true, export elements as raw bytes in row major order.
 path_out (

export_rectangles
(path_out, rectangles, delimiter='\t', binary=False)[source]¶ Export rectangular regions from a block matrix to delimited text or binary files.
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0, 4.0], ... [ 5.0, 6.0, 7.0, 8.0], ... [ 9.0, 10.0, 11.0, 12.0], ... [13.0, 14.0, 15.0, 16.0]])
Filter to the three rectangles and export as TSV files.
>>> rectangles = [[0, 1, 0, 1], [0, 3, 0, 2], [1, 2, 0, 4]] >>> >>> (BlockMatrix.from_numpy(nd) ... .export_rectangles('output/example.bm', rectangles))
This produces three files in the folder
output/example
.The first file is
rect0_0101
:1.0
The second file is
rect1_0302
:1.0 2.0 5.0 6.0 9.0 10.0
The third file is
rect2_1204
:5.0 6.0 7.0 8.0
Notes
This method exports rectangular regions of a stored block matrix to delimited text or binary files, in parallel by region.
Each rectangle is encoded as a list of length four of the form
[row_start, row_stop, col_start, col_stop]
, where starts are inclusive and stops are exclusive. These must satisfy0 <= row_start <= row_stop <= n_rows
and0 <= col_start <= col_stop <= n_cols
.For example
[0, 2, 1, 3]
corresponds to the rowindex range[0, 2)
and columnindex range[1, 3)
, i.e. the elements at positions(0, 1)
,(0, 2)
,(1, 1)
, and(1, 2)
.Each file name encodes the index of the rectangle in rectangles and the bounds as formatted in the example.
The block matrix can be sparse provided all blocks overlapping the rectangles are present, i.e. this method does not currently support implicit zeros.
If binary is true, each element is exported as 8 bytes, in row major order with no delimiting, new lines, or shape information. Such files can instantiate, for example, NumPy ndarrays using fromfile and reshape. Note however that these binary files are not platform independent; in particular, no byteorder or datatype information is saved.
The number of rectangles must be less than \(2^{29}\).
Parameters:  path_out (
str
) – Path for folder of exported files.  rectangles (
list
oflist
ofint
) – List of rectangles of the form[row_start, row_stop, col_start, col_stop]
.  delimiter (
str
) – Column delimiter.  binary (
bool
) – If true, export elements as raw bytes in row major order.
 path_out (

classmethod
fill
(n_rows, n_cols, value, block_size=None)[source]¶ Creates a block matrix with all elements the same value.
Examples
Create a block matrix with 10 rows, 20 columns, and all elements equal to
1.0
:>>> bm = BlockMatrix.fill(10, 20, 1.0)
Parameters:  n_rows (
int
) – Number of rows.  n_cols (
int
) – Number of columns.  value (
float
) – Value of all elements.  block_size (
int
, optional) – Block size. Default given bydefault_block_size()
.
Returns:  n_rows (

filter
(rows_to_keep, cols_to_keep)[source]¶ Filters matrix rows and columns.
Notes
This method has the same effect as
BlockMatrix.filter_cols()
followed byBlockMatrix.filter_rows()
(or vice versa), but filters the block matrix in a single pass which may be more efficient.Parameters:  rows_to_keep (
list
ofint
) – Indices of rows to keep. Must be nonempty and increasing.  cols_to_keep (
list
ofint
) – Indices of columns to keep. Must be nonempty and increasing.
Returns:  rows_to_keep (

filter_cols
(cols_to_keep)[source]¶ Filters matrix columns.
Parameters: cols_to_keep ( list
ofint
) – Indices of columns to keep. Must be nonempty and increasing.Returns: BlockMatrix

filter_rows
(rows_to_keep)[source]¶ Filters matrix rows.
Parameters: rows_to_keep ( list
ofint
) – Indices of rows to keep. Must be nonempty and increasing.Returns: BlockMatrix

floor
()[source]¶ Elementwise floor.
Returns: BlockMatrix

classmethod
from_entry_expr
(entry_expr, mean_impute=False, center=False, normalize=False, axis='rows', block_size=None)[source]¶ Creates a block matrix using a matrix table entry expression.
Examples
>>> mt = hl.balding_nichols_model(3, 25, 50) >>> bm = BlockMatrix.from_entry_expr(mt.GT.n_alt_alleles())
Notes
This convenience method writes the block matrix to a temporary file on persistent disk and then reads the file. If you want to store the resulting block matrix, use
write_from_entry_expr()
directly to avoid writing the result twice. Seewrite_from_entry_expr()
for further documentation.Warning
If the rows of the matrix table have been filtered to a small fraction, then
MatrixTable.repartition()
before this method to improve performance.If you encounter a Hadoop write/replication error, increase the number of persistent workers or the disk size per persistent worker, or use
write_from_entry_expr()
to write to external storage.This method opens
n_cols / block_size
files concurrently per task. To not blow out memory when the number of columns is very large, limit the Hadoop write buffer size; e.g. on GCP, set this property on cluster startup (the default is 64MB):properties 'core:fs.gs.io.buffersize.write=1048576
.Parameters:  entry_expr (
Float64Expression
) – Entry expression for numeric matrix entries.  mean_impute (
bool
) – If true, set missing values to the row mean before centering or normalizing. If false, missing values will raise an error.  center (
bool
) – If true, subtract the row mean.  normalize (
bool
) – If true andcenter=False
, divide by the row magnitude. If true andcenter=True
, divide the centered value by the centered row magnitude.  axis (
str
) – One of “rows” or “cols”: axis by which to normalize or center.  block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
 entry_expr (

classmethod
from_numpy
(ndarray, block_size=None)[source]¶ Distributes a NumPy ndarray as a block matrix.
Examples
>>> import numpy as np >>> a = np.random.rand(10, 20) >>> bm = BlockMatrix.from_numpy(a)
Notes
The ndarray must have two dimensions, each of nonzero size.
The number of entries must be less than \(2^{31}\).
Parameters:  ndarray (
numpy.ndarray
) – ndarray with two dimensions, each of nonzero size.  block_size (
int
, optional) – Block size. Default given bydefault_block_size()
.
Returns:  ndarray (

classmethod
fromfile
(uri, n_rows, n_cols, block_size=None)[source]¶ Creates a block matrix from a binary file.
Examples
>>> import numpy as np >>> a = np.random.rand(10, 20) >>> a.tofile('/local/file') # doctest: +SKIP
To create a block matrix of the same dimensions:
>>> bm = BlockMatrix.fromfile('file:///local/file', 10, 20) # doctest: +SKIP
Notes
This method, analogous to numpy.fromfile, reads a binary file of float64 values in rowmajor order, such as that produced by numpy.tofile or
BlockMatrix.tofile()
.Binary files produced and consumed by
tofile()
andfromfile()
are not platform independent, so should only be used for interoperating with NumPy, not storage. UseBlockMatrix.write()
andBlockMatrix.read()
to save and load block matrices, since these methods write and read blocks in parallel and are platform independent.A NumPy ndarray must have type float64 for the output of func:numpy.tofile to be a valid binary input to
fromfile()
. This is not checked.The number of entries must be less than \(2^{31}\).
Parameters:  uri (
str
, optional) – URI of binary input file.  n_rows (
int
) – Number of rows.  n_cols (
int
) – Number of columns.  block_size (
int
, optional) – Block size. Default given bydefault_block_size()
.
See also
 uri (

is_sparse
¶ Returns
True
if blocksparse.Notes
A block matrix is blocksparse if at least of its blocks is dropped, i.e. implicitly a block of zeros.
Returns: bool

log
()[source]¶ Elementwise natural logarithm.
Returns: BlockMatrix

n_cols
¶ Number of columns.
Returns: int

n_rows
¶ Number of rows.
Returns: int

persist
(storage_level='MEMORY_AND_DISK')[source]¶ Persists this block matrix in memory or on disk.
Notes
The
BlockMatrix.persist()
andBlockMatrix.cache()
methods store the current block matrix on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution forBlockMatrix.write()
, which stores a permanent file.Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more indepth discussion of persisting data.
Parameters: storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP Returns: BlockMatrix
– Persisted block matrix.

classmethod
random
(n_rows, n_cols, block_size=None, seed=0, gaussian=True)[source]¶ Creates a block matrix with standard normal or uniform random entries.
Examples
Create a block matrix with 10 rows, 20 columns, and standard normal entries:
>>> bm = BlockMatrix.random(10, 20)
Parameters:  n_rows (
int
) – Number of rows.  n_cols (
int
) – Number of columns.  block_size (
int
, optional) – Block size. Default given bydefault_block_size()
.  seed (
int
) – Random seed.  gaussian (
bool
) – IfTrue
, entries are drawn from the standard normal distribution. IfFalse
, entries are drawn from the uniform distribution on [0,1].
Returns:  n_rows (

classmethod
read
(path)[source]¶ Reads a block matrix.
Parameters: path ( str
) – Path to input file.Returns: BlockMatrix

static
rectangles_to_numpy
(path, binary=False)[source]¶ Instantiates a NumPy ndarray from files of rectangles written out using
export_rectangles()
orexport_blocks()
. For any given dimension, the ndarray will have length equal to the upper bound of that dimension across the union of the rectangles. Entries not covered by any rectangle will be initialized to 0.Examples
Consider the following:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0], ... [ 4.0, 5.0, 6.0], ... [ 7.0, 8.0, 9.0]])
>>> BlockMatrix.from_numpy(nd).export_rectangles('output/example', [[0, 3, 0, 1], [1, 2, 0, 2]]) >>> BlockMatrix.rectangles_to_numpy('output/example')
This would produce the following NumPy ndarray:
1.0 0.0 4.0 5.0 7.0 0.0
Notes
If exporting to binary files, note that they are not platform independent. No byteorder or datatype information is saved.
See also
Parameters:  path (
str
) – Path to directory where rectangles were written.  binary (
bool
) – If true, reads the files as binary, otherwise as text delimited.
Returns: numpy.ndarray
 path (

shape
¶ Shape of matrix.
Returns: ( int
,int
) – Number of rows and number of columns.

sparsify_band
(lower=0, upper=0, blocks_only=False)[source]¶ Filter to a diagonal band.
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0, 4.0], ... [ 5.0, 6.0, 7.0, 8.0], ... [ 9.0, 10.0, 11.0, 12.0], ... [13.0, 14.0, 15.0, 16.0]]) >>> bm = BlockMatrix.from_numpy(nd, block_size=2)
Filter to a band from one below the diagonal to two above the diagonal and collect to NumPy:
>>> bm.sparsify_band(lower=1, upper=2).to_numpy() # doctest: +NOTEST array([[ 1., 2., 3., 0.], [ 5., 6., 7., 8.], [ 0., 10., 11., 12.], [ 0., 0., 15., 16.]])
Set all blocks fully outside the diagonal to zero and collect to NumPy:
>>> bm.sparsify_band(lower=0, upper=0, blocks_only=True).to_numpy() # doctest: +NOTEST array([[ 1., 2., 0., 0.], [ 5., 6., 0., 0.], [ 0., 0., 11., 12.], [ 0., 0., 15., 16.]])
Notes
This method creates a blocksparse matrix by zeroing out all blocks which are disjoint from a diagonal band. By default, all elements outside the band but inside blocks that overlap the band are set to zero as well.
The band is defined in terms of inclusive lower and upper indices relative to the diagonal. For example, the indices 1, 0, and 1 correspond to the subdiagonal, diagonal, and superdiagonal, respectively. The diagonal band contains the elements at positions \((i, j)\) such that
\[\mathrm{lower} \leq j  i \leq \mathrm{upper}.\]lower must be less than or equal to upper, but their values may exceed the dimensions of the matrix, the band need not include the diagonal, and the matrix need not be square.
Parameters:  lower (
int
) – Index of lowest band relative to the diagonal.  upper (
int
) – Index of highest band relative to the diagonal.  blocks_only (
bool
) – IfFalse
, set all elements outside the band to zero. IfTrue
, only set all blocks outside the band to blocks of zeros; this is more efficient.
Returns: BlockMatrix
– Sparse block matrix. lower (

sparsify_rectangles
(rectangles)[source]¶ Filter to blocks overlapping the union of rectangular regions.
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0, 4.0], ... [ 5.0, 6.0, 7.0, 8.0], ... [ 9.0, 10.0, 11.0, 12.0], ... [13.0, 14.0, 15.0, 16.0]]) >>> bm = BlockMatrix.from_numpy(nd, block_size=2)
Filter to blocks covering three rectangles and collect to NumPy:
>>> bm.sparsify_rectangles([[0, 1, 0, 1], [0, 3, 0, 2], [1, 2, 0, 4]]).to_numpy() # doctest: +NOTEST array([[ 1., 2., 3., 4.], [ 5., 6., 7., 8.], [ 9., 10., 0., 0.], [13., 14., 0., 0.]])
Notes
This method creates a blocksparse matrix by zeroing out (dropping) all blocks which are disjoint from the union of a set of rectangular regions. Partially overlapping blocks are not modified.
Each rectangle is encoded as a list of length four of the form
[row_start, row_stop, col_start, col_stop]
, where starts are inclusive and stops are exclusive. These must satisfy0 <= row_start <= row_stop <= n_rows
and0 <= col_start <= col_stop <= n_cols
.For example
[0, 2, 1, 3]
corresponds to the rowindex range[0, 2)
and columnindex range[1, 3)
, i.e. the elements at positions(0, 1)
,(0, 2)
,(1, 1)
, and(1, 2)
.The number of rectangles must be less than \(2^{29}\).
Parameters: rectangles ( list
oflist
ofint
) – List of rectangles of the form[row_start, row_stop, col_start, col_stop]
.Returns: BlockMatrix
– Sparse block matrix.

sparsify_row_intervals
(starts, stops, blocks_only=False)[source]¶ Creates a blocksparse matrix by filtering to an interval for each row.
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0, 4.0], ... [ 5.0, 6.0, 7.0, 8.0], ... [ 9.0, 10.0, 11.0, 12.0], ... [13.0, 14.0, 15.0, 16.0]]) >>> bm = BlockMatrix.from_numpy(nd, block_size=2)
Set all elements outside the given row intervals to zero and collect to NumPy:
>>> (bm.sparsify_row_intervals(starts=[1, 0, 2, 2], ... stops= [2, 0, 3, 4]) ... .to_numpy()) # doctest: +NOTEST array([[ 0., 2., 0., 0.], [ 0., 0., 0., 0.], [ 0., 0., 11., 0.], [ 0., 0., 15., 16.]])
Set all blocks fully outside the given row intervals to blocks of zeros and collect to NumPy:
>>> (bm.sparsify_row_intervals(starts=[1, 0, 2, 2], ... stops= [2, 0, 3, 4], ... blocks_only=True) ... .to_numpy()) # doctest: +NOTEST array([[ 1., 2., 0., 0.], [ 5., 6., 0., 0.], [ 0., 0., 11., 12.], [ 0., 0., 15., 16.]])
Notes
This method creates a blocksparse matrix by zeroing out all blocks which are disjoint from all row intervals. By default, all elements outside the row intervals but inside blocks that overlap the row intervals are set to zero as well.
starts and stops must both have length equal to the number of rows. The interval for row
i
is[starts[i], stops[i])
. In particular,0 <= starts[i] <= stops[i] <= n_cols
is required for alli
.This method requires the number of rows to be less than \(2^{31}\).
Parameters:  starts (
list
ofint
, orndarray
ofint32
orint64
) – Start indices for each row (inclusive).  stops (
list
ofint
, orndarray
ofint32
orint64
) – Stop indices for each row (exclusive).  blocks_only (
bool
) – IfFalse
, set all elements outside row intervals to zero. IfTrue
, only set all blocks outside row intervals to blocks of zeros; this is more efficient.
Returns: BlockMatrix
– Sparse block matrix. starts (

sparsify_triangle
(lower=False, blocks_only=False)[source]¶ Filter to the upper or lower triangle.
Examples
Consider the following block matrix:
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0, 4.0], ... [ 5.0, 6.0, 7.0, 8.0], ... [ 9.0, 10.0, 11.0, 12.0], ... [13.0, 14.0, 15.0, 16.0]]) >>> bm = BlockMatrix.from_numpy(nd, block_size=2)
Filter to the upper triangle and collect to NumPy:
>>> bm.sparsify_triangle().to_numpy() # doctest: +NOTEST array([[ 1., 2., 3., 4.], [ 0., 6., 7., 8.], [ 0., 0., 11., 12.], [ 0., 0., 0., 16.]])
Set all blocks fully outside the upper triangle to zero and collect to NumPy:
>>> bm.sparsify_triangle(blocks_only=True).to_numpy() # doctest: +NOTEST array([[ 1., 2., 3., 4.], [ 5., 6., 7., 8.], [ 0., 0., 11., 12.], [ 0., 0., 15., 16.]])
Notes
This method creates a blocksparse matrix by zeroing out all blocks which are disjoint from the (nonstrict) upper or lower triangle. By default, all elements outside the triangle but inside blocks that overlap the triangle are set to zero as well.
Parameters:  lower (
bool
) – IfFalse
, keep the upper triangle. IfTrue
, keep the lower triangle.  blocks_only (
bool
) – IfFalse
, set all elements outside the triangle to zero. IfTrue
, only set all blocks outside the triangle to blocks of zeros; this is more efficient.
Returns: BlockMatrix
– Sparse block matrix. lower (

sqrt
()[source]¶ Elementwise square root.
Returns: BlockMatrix

sum
(axis=None)[source]¶ Sums array elements over one or both axes.
Examples
>>> import numpy as np >>> nd = np.array([[ 1.0, 2.0, 3.0], ... [ 4.0, 5.0, 6.0]]) >>> bm = BlockMatrix.from_numpy(nd) >>> bm.sum() 21.0
>>> bm.sum(axis=0).to_numpy() array([[5., 7., 9.]])
>>> bm.sum(axis=1).to_numpy() array([[ 6.], [15.]])
Parameters: axis ( int
, optional) – Axis over which to sum. By default, sum all elements. If0
, sum over rows. If1
, sum over columns.Returns: float
orBlockMatrix
– If None, returns a float. If0
, returns a block matrix with a single row. If1
, returns a block matrix with a single column.

svd
(compute_uv=True, complexity_bound=8192)[source]¶ Computes the reduced singular value decomposition.
Examples
>>> x = BlockMatrix.from_numpy(np.array([[2.0, 0.0, 3.0], ... [1.0, 2.0, 4.0]])) >>> x.svd() (array([[0.60219551, 0.79834865], [0.79834865, 0.60219551]]), array([5.61784832, 1.56197958]), array([[ 0.35649586, 0.28421866, 0.89001711], [ 0.6366932 , 0.77106707, 0.00879404]]))
Notes
This method leverages distributed matrix multiplication to compute reduced singular value decomposition (SVD) for matrices that would otherwise be too large to work with locally, provided that at least one dimension is less than or equal to 46300.
Let \(X\) be an \(n \times m\) matrix and let \(r = \min(n, m)\). In particular, \(X\) can have at most \(r\) nonzero singular values. The reduced SVD of \(X\) has the form
\[X = U \Sigma V^T\]where
 \(U\) is an \(n \times r\) matrix whose columns are (orthonormal) left singular vectors,
 \(\Sigma\) is an \(r \times r\) diagonal matrix of nonnegative singular values in descending order,
 \(V^T\) is an \(r \times m\) matrix whose rows are (orthonormal) right singular vectors.
If the singular values in \(\Sigma\) are distinct, then the decomposition is unique up to multiplication of corresponding left and right singular vectors by 1. The computational complexity of SVD is roughly \(nmr\).
We now describe the implementation in more detail. If \(\sqrt[3]{nmr}\) is less than or equal to complexity_bound, then \(X\) is localized to an ndarray on which
scipy.linalg.svd()
is called. In this case, all components are returned as ndarrays.If \(\sqrt[3]{nmr}\) is greater than complexity_bound, then the reduced SVD is computed via the smaller gramian matrix of \(X\). For \(n > m\), the three stages are:
 Compute (and localize) the gramian matrix \(X^T X\),
 Compute the eigenvalues and right singular vectors via the
symmetric eigendecomposition \(X^T X = V S V^T\) with
numpy.linalg.eigh()
orscipy.linalg.eigh()
,  Compute the singular values as \(\Sigma = S^\frac{1}{2}\) and the the left singular vectors as the block matrix \(U = X V \Sigma^{1}\).
In this case, since block matrix multiplication is lazy, it is efficient to subsequently slice \(U\) (e.g. based on the singular values), or discard \(U\) entirely.
If \(n \leq m\), the three stages instead use the gramian \(X X^T = U S U^T\) and return \(V^T\) as the block matrix \(\Sigma^{1} U^T X\).
Warning
Computing reduced SVD via the gramian presents an added wrinkle when \(X\) is not full rank, as the blockmatrixside nullbasis is not computable by the formula in the third stage. Furthermore, due to finite precision, the zero eigenvalues of \(X^T X\) or \(X X^T\) will only be approximately zero.
If the rank is not known ahead, examining the relative sizes of the trailing singular values should reveal where the spectrum switches from nonzero to “zero” eigenvalues. With 64bit floating point, zero eigenvalues are typically about 1e16 times the largest eigenvalue. The corresponding singular vectors should be sliced away before an action which realizes the blockmatrixside singular vectors.
svd()
sets the singular values corresponding to negative eigenvalues to exactly0.0
.Warning
The first and third stages invoke distributed matrix multiplication with parallelism bounded by the number of resulting blocks, whereas the second stage is executed on the master node. For matrices of large minimum dimension, it may be preferable to run these stages separately.
The performance of the second stage depends critically on the number of master cores and the NumPy / SciPy configuration, viewable with
np.show_config()
. For Intel machines, we recommend installing the MKL package for Anaconda, as is done by cloudtools.Consequently, the optimal value of complexity_bound is highly configurationdependent.
Parameters:  compute_uv (
bool
) – If False, only compute the singular values (or eigenvalues).  complexity_bound (
int
) – Maximum value of \(\sqrt[3]{nmr}\) for whichscipy.linalg.svd()
is used.
Returns:  u (
ndarray
orBlockMatrix
) – Left singular vectors \(U\), as a block matrix if \(n > m\) and \(\sqrt[3]{nmr}\) exceeds complexity_bound. Only returned if compute_uv is True.  s (
ndarray
) – Singular values from \(\Sigma\) in descending order.  vt (
ndarray
orBlockMatrix
) – Right singular vectors \(V^T\), as a block matrix if \(n \leq m\) and \(\sqrt[3]{nmr}\) exceeds complexity_bound. Only returned if compute_uv is True.

to_matrix_table_row_major
(n_partitions=None)[source]¶ Returns a matrix table with row key of row_idx and col key col_idx, whose entries are structs of a single field element.
Parameters: n_partitions (int or None) – Number of partitions of the matrix table. Notes
Does not support blocksparse matrices.
Returns: MatrixTable
– Matrix table where each entry corresponds to an entry in the block matrix.

to_numpy
(_force_blocking=False)[source]¶ Collects the block matrix into a NumPy ndarray.
Examples
>>> bm = BlockMatrix.random(10, 20) >>> a = bm.to_numpy()
Notes
The resulting ndarray will have the same shape as the block matrix.
Returns: numpy.ndarray

to_table_row_major
(n_partitions=None)[source]¶ Returns a table where each row represents a row in the block matrix.
 The resulting table has the following fields:
 row_idx (:py:data.`tint64`, key field) – Row index
 entries (
.tarray
) – Entries for the row
Examples
>>> import numpy as np >>> block_matrix = BlockMatrix.from_numpy(np.array([[1, 2], [3, 4], [5, 6]]), 2) >>> t = block_matrix.to_table_row_major() >>> t.show() +++  row_idx  entries  +++  int64  array<float64>  +++  0  [1.00e+00,2.00e+00]   1  [3.00e+00,4.00e+00]   2  [5.00e+00,6.00e+00]  +++
Parameters: n_partitions (int or None) – Number of partitions of the table. Notes
Does not support blocksparse matrices.
Returns: Table
– Table where each row corresponds to a row in the block matrix.

tofile
(uri)[source]¶ Collects and writes data to a binary file.
Examples
>>> import numpy as np >>> bm = BlockMatrix.random(10, 20) >>> bm.tofile('file:///local/file') # doctest: +SKIP
To create a
numpy.ndarray
of the same dimensions:>>> a = np.fromfile('/local/file').reshape((10, 20)) # doctest: +SKIP
Notes
This method, analogous to numpy.tofile, produces a binary file of float64 values in rowmajor order, which can be read by functions such as numpy.fromfile (if a local file) and
BlockMatrix.fromfile()
.Binary files produced and consumed by
tofile()
andfromfile()
are not platform independent, so should only be used for interoperating with NumPy, not storage. UseBlockMatrix.write()
andBlockMatrix.read()
to save and load block matrices, since these methods write and read blocks in parallel and are platform independent.The number of entries must be less than \(2^{31}\).
Parameters: uri ( str
, optional) – URI of binary output file.See also

unpersist
()[source]¶ Unpersists this block matrix from memory/disk.
Notes
This function will have no effect on a block matrix that was not previously persisted.
Returns: BlockMatrix
– Unpersisted block matrix.

write
(path, overwrite=False, force_row_major=False, stage_locally=False)[source]¶ Writes the block matrix.
Parameters:  path (
str
) – Path for output file.  overwrite (
bool
) – IfTrue
, overwrite an existing file at the destination.  force_row_major (
bool
) – IfTrue
, transform blocks in columnmajor format to rowmajor format before writing. IfFalse
, write blocks in their current format.  stage_locally (
bool
) – IfTrue
, major output will be written to temporary local storage before being copied tooutput
.
 path (

static
write_from_entry_expr
(entry_expr, path, overwrite=False, mean_impute=False, center=False, normalize=False, axis='rows', block_size=None)[source]¶ Writes a block matrix from a matrix table entry expression.
Examples
>>> mt = hl.balding_nichols_model(3, 25, 50) >>> BlockMatrix.write_from_entry_expr(mt.GT.n_alt_alleles(), ... 'output/model.bm')
Notes
The resulting file can be loaded with
BlockMatrix.read()
. Blocks are stored rowmajor.If a pipelined transformation significantly downsamples the rows of the underlying matrix table, then repartitioning the matrix table ahead of this method will greatly improve its performance.
By default, this method will fail if any values are missing (to be clear, special float values like
nan
are not missing values). Set mean_impute to replace missing values with the row mean before
possibly centering or normalizing. If all values are missing, the row
mean is
nan
.  Set center to shift each row to have mean zero before possibly normalizing.
 Set normalize to normalize each row to have unit length.
To standardize each row, regarded as an empirical distribution, to have mean 0 and variance 1, set center and normalize and then multiply the result by
sqrt(n_cols)
.Warning
If the rows of the matrix table have been filtered to a small fraction, then
MatrixTable.repartition()
before this method to improve performance.This method opens
n_cols / block_size
files concurrently per task. To not blow out memory when the number of columns is very large, limit the Hadoop write buffer size; e.g. on GCP, set this property on cluster startup (the default is 64MB):properties 'core:fs.gs.io.buffersize.write=1048576
.Parameters:  entry_expr (
Float64Expression
) – Entry expression for numeric matrix entries.  path (
str
) – Path for output.  overwrite (
bool
) – IfTrue
, overwrite an existing file at the destination.  mean_impute (
bool
) – If true, set missing values to the row mean before centering or normalizing. If false, missing values will raise an error.  center (
bool
) – If true, subtract the row mean.  normalize (
bool
) – If true andcenter=False
, divide by the row magnitude. If true andcenter=True
, divide the centered value by the centered row magnitude.  axis (
str
) – One of “rows” or “cols”: axis by which to normalize or center.  block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
 Set mean_impute to replace missing values with the row mean before
possibly centering or normalizing. If all values are missing, the row
mean is