LinearMixedModel

class hail.stats.LinearMixedModel(py, px, s, y=None, x=None, p_path=None)[source]

Class representing a linear mixed model.

Danger

This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.

LinearMixedModel represents a linear model of the form

\[y \sim \mathrm{N}(X \beta, \, \sigma^2 K + \tau^2 I)\]

where

  • \(\mathrm{N}\) is a \(n\)-dimensional normal distribution.
  • \(y\) is a known vector of \(n\) observations.
  • \(X\) is a known \(n \times p\) design matrix for \(p\) fixed effects.
  • \(K\) is a known \(n \times n\) positive semi-definite kernel.
  • \(I\) is the \(n \times n\) identity matrix.
  • \(\beta\) is a \(p\)-parameter vector of fixed effects.
  • \(\sigma^2\) is the variance parameter on \(K\).
  • \(\tau^2\) is the variance parameter on \(I\).

In particular, the residuals for the \(i^\mathit{th}\) and \(j^\mathit{th}\) observations have covariance \(\sigma^2 K_{ij}\) for \(i \neq j\).

This model is equivalent to a mixed model of the form

\[y = X \beta + Z u + \epsilon\]

by setting \(K = ZZ^T\) where

  • \(Z\) is a known \(n \times r\) design matrix for \(r\) random effects.
  • \(u\) is a \(r\)-vector of random effects drawn from \(\mathrm{N}(0, \sigma^2 I)\).
  • \(\epsilon\) is a \(n\)-vector of random errors drawn from \(\mathrm{N}(0, \tau^2 I)\).

However, LinearMixedModel does not itself realize \(K\) as a linear kernel with respect to random effects, nor does it take \(K\) explicitly as input. Rather, via the eigendecomposion \(K = U S U^T\), the the class leverages a third, decorrelated form of the model

\[Py \sim \mathrm{N}(PX \beta, \, \sigma^2 (\gamma S + I))\]

where

  • \(P = U^T: \mathbb{R}^n \rightarrow \mathbb{R}^n\) is an orthonormal transformation that decorrelates the observations. The rows of \(P\) are an eigenbasis for \(K\).
  • \(S\) is the \(n \times n\) diagonal matrix of corresponding eigenvalues.
  • \(\gamma = \frac{\sigma^2}{\tau^2}\) is the ratio of variance parameters.

Hence, the triple \((Py, PX, S)\) determines the probability of the observations for any choice of model parameters, and is therefore sufficient for inference. This triple, with S encoded as a vector, is the default (“full-rank”) initialization of the class.

LinearMixedModel also provides an efficient strategy to fit the model above with \(K\) replaced by its rank-\(r\) approximation \(K_r = P_r^T S_r P_r\) where

  • \(P_r: \mathbb{R}^n \rightarrow \mathbb{R}^r\) has orthonormal rows consisting of the top \(r\) eigenvectors of \(K\).
  • \(S_r\) is the \(r \times r\) diagonal matrix of corresponding non-zero eigenvalues.

For this low-rank model, the quintuple \((P_r y, P_r X, S_r, y, X)\) is similarly sufficient for inference and corresponds to the “low-rank” initialization of the class. Morally, \(y\) and \(X\) are required for low-rank inference because the diagonal \(\gamma S + I\) is always full-rank.

If \(K\) actually has rank \(r\), then \(K = K_r\) and the low-rank and full-rank models are equivalent. Hence low-rank inference provides a more efficient, equally-exact algorithm for fitting the full-rank model. This situation arises, for example, when \(K\) is the linear kernel of a mixed model with fewer random effects than observations.

Even when \(K\) has full rank, using a lower-rank approximation may be an effective from of regularization, in addition to boosting computational efficiency.

Initialization

The class may be initialized directly or with one of two methods:

  • from_kinship() takes \(y\), \(X\), and \(K\) as ndarrays. The model is always full-rank.
  • from_random_effects() takes \(y\) and \(X\) as ndarrays and \(Z\) as an ndarray or block matrix. The model is full-rank if and only if \(n \leq m\).

Direct full-rank initialization takes \(Py\), \(PX\), and \(S\) as ndarrays. The following class attributes are set:

Attribute Type Value
low_rank bool False
n int Number of observations \(n\)
f int Number of fixed effects \(p\)
r int Effective number of random effects, must equal \(n\)
py ndarray Rotated response vector \(P y\) with shape \((n)\)
px ndarray Rotated design matrix \(P X\) with shape \((n, p)\)
s ndarray Eigenvalues vector \(S\) of \(K\) with shape \((n)\)
p_path str Path at which \(P\) is stored as a block matrix

Direct low-rank initialization takes \(P_r y\), \(P_r X\), \(S_r\), \(y\), and \(X\) as ndarrays. The following class attributes are set:

Attribute Type Value
low_rank bool True
n int Number of observations \(n\)
f int Number of fixed effects \(p\)
r int Effective number of random effects, must be less than \(n\)
py ndarray Projected response vector \(P_r y\) with shape \((r)\)
px ndarray Projected design matrix \(P_r X\) with shape \((r, p)\)
s ndarray Eigenvalues vector \(S_r\) of \(K_r\) with shape \((r)\)
y ndarray Response vector with shape \((n)\)
x ndarray Design matrix with shape \((n, p)\)
p_path str Path at which \(P\) is stored as a block matrix

Fitting the model

fit() uses restricted maximum likelihood (REML) to estimate \((\beta, \sigma^2, \tau^2)\).

This is done by numerical optimization of the univariate function compute_neg_log_reml(), which itself optimizes REML constrained to a fixed ratio of variance parameters. Each evaluation of compute_neg_log_reml() has computational complexity

\[\mathit{O}(rp^2 + p^3).\]

fit() adds the following attributes at this estimate.

Attribute Type Value
beta ndarray \(\beta\)
sigma_sq float \(\sigma^2\)
tau_sq float \(\tau^2\)
gamma float \(\gamma = \frac{\sigma^2}{\tau^2}\)
log_gamma float \(\log{\gamma}\)
h_sq float \(\mathit{h}^2 = \frac{\sigma^2}{\sigma^2 + \tau^2}\)
h_sq_standard_error float asymptotic estimate of \(\mathit{h}^2\) standard error

Testing alternative models

The model is also equivalent to its augmentation

\[y \sim \mathrm{N}\left(x_\star\beta_\star + X \beta, \, \sigma^2 K + \tau^2 I\right)\]

by an additional covariate of interest \(x_\star\) under the null hypothesis that the corresponding fixed effect parameter \(\beta_\star\) is zero. Similarly to initialization, full-rank testing of the alternative hypothesis \(\beta_\star \neq 0\) requires \(P x_\star\), whereas the low-rank testing requires \(P_r x_\star\) and \(x_\star\).

After running fit() to fit the null model, one can test each of a collection of alternatives using either of two implementations of the likelihood ratio test:

  • fit_alternatives_numpy() takes one or two ndarrays. It is a pure Python method that evaluates alternatives serially on master.
  • fit_alternatives() takes one or two paths to block matrices. It evaluates alternatives in parallel on the workers.

Per alternative, both have computational complexity

\[\mathit{O}(rp + p^3).\]
Parameters:
  • py (ndarray) – Projected response vector \(P_r y\) with shape \((r)\).
  • px (ndarray) – Projected design matrix \(P_r X\) with shape \((r, p)\).
  • s (ndarray) – Eigenvalues vector \(S\) with shape \((r)\).
  • y (ndarray, optional) – Response vector with shape \((n)\). Include for low-rank inference.
  • x (ndarray, optional) – Design matrix with shape \((n, p)\). Include for low-rank inference.
  • p_path (str, optional) – Path at which \(P\) has been stored as a block matrix.

Methods

__init__ Initialize self.
compute_neg_log_reml Compute negative log REML constrained to a fixed value of \(\log{\gamma}\).
fit Find the triple \((\beta, \sigma^2, \tau^2)\) maximizing REML.
fit_alternatives Fit and test alternative model for each augmented design matrix in parallel.
fit_alternatives_numpy Fit and test alternative model for each augmented design matrix.
from_kinship Initializes a model from \(y\), \(X\), and \(K\).
from_random_effects Initializes a model from \(y\), \(X\), and \(Z\).
h_sq_normalized_lkhd Estimate the normalized likelihood of \(\mathit{h}^2\) over the discrete grid of percentiles.
compute_neg_log_reml(log_gamma, return_parameters=False)[source]

Compute negative log REML constrained to a fixed value of \(\log{\gamma}\).

This function computes the triple \((\beta, \sigma^2, \tau^2)\) with \(\gamma = \frac{\sigma^2}{\tau^2}\) at which the restricted likelihood is maximized and returns the negative of the restricted log likelihood at these parameters (shifted by the constant defined below).

The implementation has complexity \(\mathit{O}(rp^2 + p^3)\) and is inspired by FaST linear mixed models for genome-wide association studies (2011).

The formulae follow from Bayesian Inference for Variance Components Using Only Error Contrasts (1974). Harville derives that for fixed covariance \(V\), the restricted likelihood of the variance parameter \(V\) in the model

\[y \sim \mathrm{N}(X \beta, \, V)\]

is given by

\[(2\pi)^{-\frac{1}{2}(n - p)} \det(X^T X)^\frac{1}{2} \det(V)^{-\frac{1}{2}} \det(X^T V^{-1} X)^{-\frac{1}{2}} e^{-\frac{1}{2}(y - X\hat\beta)^T V^{-1}(y - X\hat\beta)}.\]

with

\[\hat\beta = (X^T V^{-1} X)^{-1} X^T V^{-1} y.\]

In our case, the variance is

\[V = \sigma^2 K + \tau^2 I = \sigma^2 (K + \gamma^{-1} I)\]

which is determined up to scale by any fixed value of the ratio \(\gamma\). So for input \(\log \gamma\), the negative restricted log likelihood is minimized at \((\hat\beta, \hat\sigma^2)\) with \(\hat\beta\) as above and

\[\hat\sigma^2 = \frac{1}{n - p}(y - X\hat\beta)^T (K + \gamma^{-1} I)^{-1}(y - X\hat\beta).\]

For \(\hat V\) at this \((\hat\beta, \hat\sigma^2, \gamma)\), the exponent in the likelihood reduces to \(-\frac{1}{2}(n-p)\), so the negative restricted log likelihood may be expressed as

\[\frac{1}{2}\left(\log \det(\hat V) + \log\det(X^T \hat V^{-1} X)\right) + C\]

where

\[C = \frac{1}{2}\left(n - p + (n - p)\log(2\pi) - \log\det(X^T X)\right)\]

only depends on \(X\). compute_neg_log_reml() returns the value of the first term, omitting the constant term.

Parameters:
  • log_gamma (float) – Value of \(\log{\gamma}\).
  • return_parameters – If True, also return \(\beta\), \(\sigma^2\), and \(\tau^2\).
Returns:

float or (float, ndarray, float, float) – If return_parameters is False, returns (shifted) negative log REML. Otherwise, returns (shifted) negative log REML, \(\beta\), \(\sigma^2\), and \(\tau^2\).

fit(log_gamma=None, bounds=(-8.0, 8.0), tol=1e-08, maxiter=500)[source]

Find the triple \((\beta, \sigma^2, \tau^2)\) maximizing REML.

This method sets the attributes beta, sigma_sq, tau_sq, gamma, log_gamma, h_sq, and h_sq_standard_error as described in the top-level class documentation.

If log_gamma is provided, fit() finds the REML solution with \(\log{\gamma}\) constrained to this value. In this case, h_sq_standard_error is None since h_sq is not estimated.

Otherwise, fit() searches for the value of \(\log{\gamma}\) that minimizes compute_neg_log_reml(), and also sets the attribute optimize_result of type scipy.optimize.OptimizeResult.

Parameters:
  • log_gamma (float, optional) – If provided, the solution is constrained to have this value of \(\log{\gamma}\).
  • bounds (float, float) – Lower and upper bounds for \(\log{\gamma}\).
  • tol (float) – Absolute tolerance for optimizing \(\log{\gamma}\).
  • maxiter (float) – Maximum number of iterations for optimizing \(\log{\gamma}\).
fit_alternatives(pa_t_path, a_t_path=None, partition_size=None)[source]

Fit and test alternative model for each augmented design matrix in parallel.

Notes

The alternative model is fit using REML constrained to the value of \(\gamma\) set by fit().

The likelihood ratio test of fixed effect parameter \(\beta_\star\) uses (non-restricted) maximum likelihood:

\[\chi^2 = 2 \log\left(\frac{ \max_{\beta_\star, \beta, \sigma^2}\mathrm{N} (y \, | \, x_\star \beta_\star + X \beta; \sigma^2(K + \gamma^{-1}I)} {\max_{\beta, \sigma^2} \mathrm{N} (y \, | \, x_\star \cdot 0 + X \beta; \sigma^2(K + \gamma^{-1}I)} \right)\]

The p-value is given by the tail probability under a chi-squared distribution with one degree of freedom.

The resulting table has the following fields:

Field Type Value
idx int64 Index of augmented design matrix.
beta float64 \(\beta_\star\)
sigma_sq float64 \(\sigma^2\)
chi_sq float64 \(\chi^2\)
p_value float64 p-value

\((P_r A)^T\) and \(A^T\) (if given) must have the same number of rows (augmentations). These rows are grouped into partitions for parallel processing. The number of partitions equals the ceiling of n_rows / partition_size, and should be at least the number or cores to make use of all cores. By default, there is one partition per row of blocks in \((P_r A)^T\). Setting the partition size to an exact (rather than approximate) divisor or multiple of the block size reduces superfluous shuffling of data.

The number of columns in each block matrix must be less than \(2^{31}\).

Warning

The block matrices must be stored in row-major format, as results from BlockMatrix.write() with force_row_major=True and from BlockMatrix.write_from_entry_expr(). Otherwise, this method will produce an error message.

Parameters:
  • pa_t_path (str) – Path to block matrix \((P_r A)^T\) with shape \((m, r)\). Each row is a projected augmentation \(P_r x_\star\) of \(P_r X\).
  • a_t_path (str, optional) – Path to block matrix \(A^T\) with shape \((m, n)\). Each row is an augmentation \(x_\star\) of \(X\). Include for low-rank inference.
  • partition_size (int, optional) – Number of rows to process per partition. Default given by block size of \((P_r A)^T\).
Returns:

Table – Table of results for each augmented design matrix.

fit_alternatives_numpy(pa, a=None, return_pandas=False)[source]

Fit and test alternative model for each augmented design matrix.

Notes

This Python-only implementation runs serially on master. See the scalable implementation fit_alternatives() for documentation of the returned table.

Parameters:
  • pa (ndarray) – Projected matrix \(P_r A\) of alternatives with shape \((r, m)\). Each column is a projected augmentation \(P_r x_\star\) of \(P_r X\).
  • a (ndarray, optional) – Matrix \(A\) of alternatives with shape \((n, m)\). Each column is an augmentation \(x_\star\) of \(X\). Required for low-rank inference.
  • return_pandas (bool) – If true, return pandas dataframe. If false, return Hail table.
Returns:

Table or pandas.DataFrame – Table of results for each augmented design matrix.

classmethod from_kinship(y, x, k, p_path=None, overwrite=False)[source]

Initializes a model from \(y\), \(X\), and \(K\).

Examples

>>> from hail.stats import LinearMixedModel
>>> y = np.array([0.0, 1.0, 8.0, 9.0])
>>> x = np.array([[1.0, 0.0],
...               [1.0, 2.0],
...               [1.0, 1.0],
...               [1.0, 4.0]])
>>> k = np.array([[ 1.        , -0.8727875 ,  0.96397335,  0.94512946],
...               [-0.8727875 ,  1.        , -0.93036112, -0.97320323],
...               [ 0.96397335, -0.93036112,  1.        ,  0.98294169],
...               [ 0.94512946, -0.97320323,  0.98294169,  1.        ]])
>>> model, p = LinearMixedModel.from_kinship(y, x, k)
>>> model.fit()
>>> model.h_sq
0.2525148830695317
>>> model.s
array([3.83501295, 0.13540343, 0.02454114, 0.00504248])

Truncate to a rank \(r=2\) model:

>>> r = 2
>>> s_r = model.s[:r]
>>> p_r = p[:r, :]
>>> model_r = LinearMixedModel(p_r @ y, p_r @ x, s_r, y, x)
>>> model.fit()
>>> model.h_sq
0.25193197591429695

Notes

This method eigendecomposes \(K = P^T S P\) on the master and returns LinearMixedModel(p @ y, p @ x, s) and p.

The performance of eigendecomposition depends critically on the number of master cores and the NumPy / SciPy configuration, viewable with np.show_config(). For Intel machines, we recommend installing the MKL package for Anaconda, as is done by cloudtools.

k must be positive semi-definite; symmetry is not checked as only the lower triangle is used.

Parameters:
  • y (ndarray) – \(n\) vector of observations.
  • x (ndarray) – \(n \times p\) matrix of fixed effects.
  • k (ndarray) – \(n \times n\) positive semi-definite kernel \(K\).
  • p_path (str, optional) – Path at which to write \(P\) as a block matrix.
  • overwrite (bool) – If True, overwrite an existing file at p_path.
Returns:

  • model (LinearMixedModel) – Model constructed from \(y\), \(X\), and \(K\).
  • p (ndarray) – Matrix \(P\) whose rows are the eigenvectors of \(K\).

classmethod from_random_effects(y, x, z, p_path=None, overwrite=False, max_condition_number=1e-10, complexity_bound=8192)[source]

Initializes a model from \(y\), \(X\), and \(Z\).

Examples

>>> from hail.stats import LinearMixedModel
>>> y = np.array([0.0, 1.0, 8.0, 9.0])
>>> x = np.array([[1.0, 0.0],
...               [1.0, 2.0],
...               [1.0, 1.0],
...               [1.0, 4.0]])
>>> z = np.array([[0.0, 0.0, 1.0],
...               [0.0, 1.0, 2.0],
...               [1.0, 2.0, 4.0],
...               [2.0, 4.0, 8.0]])
>>> model, p = LinearMixedModel.from_random_effects(y, x, z)
>>> model.fit()
>>> model.h_sq
0.38205307244271675

Notes

If \(n \leq m\), the returned model is full rank.

If \(n > m\), the returned model is low rank. In this case only, eigenvalues less than or equal to max_condition_number times the top eigenvalue are dropped from \(S\), with the corresponding eigenvectors dropped from \(P\). This guards against precision loss on left eigenvectors computed via the right gramian \(Z^T Z\) in BlockMatrix.svd().

In either case, one can truncate to a rank \(r\) model as follows. If p is an ndarray:

>>> p_r = p[:r, :]     
>>> s_r = model.s[:r]  
>>> model_r = LinearMixedModel(p_r @ y, p_r @ x, s_r, y, x)  

If p is a block matrix:

>>> p[:r, :].write(p_r_path)          
>>> p_r = BlockMatrix.read(p_r_path)  
>>> s_r = model.s[:r]                 
>>> model_r = LinearMixedModel(p_r @ y, p_r @ x, s_r, y, x, p_r_path)  

This method applies no standardization to z.

Warning

If z is a block matrix, then ideally z should be the result of directly reading from disk (and possibly a transpose). This is most critical if \(n > m\), because in this case multiplication by z will result in all preceding transformations being repeated n / block_size times, as explained in BlockMatrix.

At least one dimension must be less than or equal to 46300. See the warning in BlockMatrix.svd() for performance considerations.

Parameters:
  • y (ndarray) – \(n\) vector of observations \(y\).
  • x (ndarray) – \(n \times p\) matrix of fixed effects \(X\).
  • z (ndarray or BlockMatrix) – \(n \times m\) matrix of random effects \(Z\).
  • p_path (str, optional) – Path at which to write \(P\) as a block matrix. Required if z is a block matrix.
  • overwrite (bool) – If True, overwrite an existing file at p_path.
  • max_condition_number (float) – Maximum condition number. Must be greater than 1e-16.
  • complexity_bound (int) – Complexity bound for BlockMatrix.svd() when z is a block matrix.
Returns:

  • model (LinearMixedModel) – Model constructed from \(y\), \(X\), and \(Z\).
  • p (ndarray or BlockMatrix) – Matrix \(P\) whose rows are the eigenvectors of \(K\). The type is block matrix if z is a block matrix and BlockMatrix.svd() of z returns \(U\) as a block matrix.

h_sq_normalized_lkhd()[source]

Estimate the normalized likelihood of \(\mathit{h}^2\) over the discrete grid of percentiles.

Examples

Plot the estimated normalized likelihood function:

>>> import matplotlib.pyplot as plt                     
>>> plt.plot(range(101), model.h_sq_normalized_lkhd())  

Notes

This method may be used to visualize the approximate posterior on \(\mathit{h}^2\) under a flat prior.

The resulting ndarray a has length 101 with a[i] equal to the maximum likelihood over all \(\beta\) and \(\sigma^2\) with \(\mathit{h}^2\) constrained to i / 100. The values for 1 <= i <= 99 are normalized to sum to 1, and a[0] and a[100] are set to nan.

Returns:ndarray of float64 – Normalized likelihood values for \(\mathit{h}^2\).