Statistics

`linear_mixed_model`(y, x[, z_t, k, p_path, ...])	Initialize a linear mixed model from a matrix table.
`linear_mixed_regression_rows`(entry_expr, model)	For each row, test an input variable for association using a linear mixed model.
`linear_regression_rows`(y, x, covariates[, ...])	For each row, test an input variable for association with response variables using linear regression.
`logistic_regression_rows`(test, y, x, covariates)	For each row, test an input variable for association with a binary response variable using logistic regression.
`poisson_regression_rows`(test, y, x, covariates)	For each row, test an input variable for association with a count response variable using Poisson regression.
`pca`(entry_expr[, k, compute_loadings])	Run principal component analysis (PCA) on numeric columns derived from a matrix table.
`row_correlation`(entry_expr[, block_size])	Computes the correlation matrix between row vectors.

hail.methods.linear_mixed_model(y, x, z_t=None, k=None, p_path=None, overwrite=False, standardize=True, mean_impute=True)[source]: Initialize a linear mixed model from a matrix table.

Warning

This functionality is no longer implemented/supported as of Hail 0.2.94.

hail.methods.linear_mixed_regression_rows(entry_expr, model, pa_t_path=None, a_t_path=None, mean_impute=True, partition_size=None, pass_through=())[source]: For each row, test an input variable for association using a linear mixed model.

Warning

This functionality is no longer implemented/supported as of Hail 0.2.94.

hail.methods.linear_regression_rows(y, x, covariates, block_size=16, pass_through=(), *, weights=None)[source]

For each row, test an input variable for association with response variables using linear regression.

Examples

>>> result_ht = hl.linear_regression_rows(
...     y=dataset.pheno.height,
...     x=dataset.GT.n_alt_alleles(),
...     covariates=[1, dataset.pheno.age, dataset.pheno.is_female])

Warning

As in the example, the intercept covariate 1 must be included explicitly if desired.

Warning

If y is a single value or a list, linear_regression_rows() considers the same set of columns (i.e., samples, points) for every response variable and row, namely those columns for which all response variables and covariates are defined.

If y is a list of lists, then each inner list is treated as an independent group, subsetting columns for missingness separately.

Notes

With the default root and y a single expression, the following row-indexed fields are added.

<row key fields> (Any) – Row key fields.
<pass_through fields> (Any) – Row fields in pass_through.
n (tint32) – Number of columns used.
sum_x (tfloat64) – Sum of input values x.
y_transpose_x (tfloat64) – Dot product of response vector y with the input vector x.
beta (tfloat64) – Fit effect coefficient of x, \(\hat\beta_1\) below.
standard_error (tfloat64) – Estimated standard error, \(\widehat{\mathrm{se}}_1\).
t_stat (tfloat64) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}_1\).
p_value (tfloat64) – \(p\)-value.

If y is a list of expressions, then the last five fields instead have type tarray of tfloat64, with corresponding indexing of the list and each array.

If y is a list of lists of expressions, then n and sum_x are of type array<float64>, and the last five fields are of type array<array<float64>>. Index into these arrays with a[index_in_outer_list, index_in_inner_list]. For example, if y=[[a], [b, c]] then the p-value for b is p_value[1][0].

In the statistical genetics example above, the input variable x encodes genotype as the number of alternate alleles (0, 1, or 2). For each variant (row), genotype is tested for association with height controlling for age and sex, by fitting the linear regression model:

\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{genotype} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]

Boolean covariates like \(\mathrm{is\_female}\) are encoded as 1 for True and 0 for False. The null model sets \(\beta_1 = 0\).

The standard least-squares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the t-statistic which follows the t-distribution with \(n - k - 1\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to x.

Note

Use the pass_through parameter to include additional row fields from matrix table underlying x. For example, to include an “rsid” field, set pass_through=['rsid'] or pass_through=[mt.rsid].

Parameters:

y (Float64Expression or list of Float64Expression) – One or more column-indexed response expressions.
x (Float64Expression) – Entry-indexed expression for input variable.
covariates (list of Float64Expression) – List of column-indexed covariate expressions.
block_size (int) – Number of row regressions to perform simultaneously per core. Larger blocks require more memory but may improve performance.
pass_through (list of str or Expression) – Additional row fields to include in the resulting table.
weights (Float64Expression or list of Float64Expression) – Optional column-indexed weighting for doing weighted least squares regression. Specify a single weight if a single y or list of ys is specified. If a list of lists of ys is specified, specify one weight per inner list.

Returns:

Table

hail.methods.logistic_regression_rows(test, y, x, covariates, pass_through=(), *, max_iterations=None, tolerance=None)[source]

For each row, test an input variable for association with a binary response variable using logistic regression.

Examples

Run the logistic regression Wald test per variant using a Boolean phenotype, intercept and two covariates stored in column-indexed fields:

>>> result_ht = hl.logistic_regression_rows(
...     test='wald',
...     y=dataset.pheno.is_case,
...     x=dataset.GT.n_alt_alleles(),
...     covariates=[1, dataset.pheno.age, dataset.pheno.is_female])

Run the logistic regression Wald test per variant using a list of binary (0/1) phenotypes, intercept and two covariates stored in column-indexed fields:

>>> result_ht = hl.logistic_regression_rows(
...     test='wald',
...     y=[dataset.pheno.is_case, dataset.pheno.is_case],  # where pheno values are 0, 1, or missing
...     x=dataset.GT.n_alt_alleles(),
...     covariates=[1, dataset.pheno.age, dataset.pheno.is_female])

As above but with at most 100 Newton iterations and a stricter-than-default tolerance of 1e-8:

>>> result_ht = hl.logistic_regression_rows(
...     test='wald',
...     y=[dataset.pheno.is_case, dataset.pheno.is_case],  # where pheno values are 0, 1, or missing
...     x=dataset.GT.n_alt_alleles(),
...     covariates=[1, dataset.pheno.age, dataset.pheno.is_female],
...     max_iterations=100,
...     tolerance=1e-8)

Warning

logistic_regression_rows() considers the same set of columns (i.e., samples, points) for every row, namely those columns for which all response variables and covariates are defined. For each row, missing values of x are mean-imputed over these columns. As in the example, the intercept covariate 1 must be included explicitly if desired.

Notes

This method performs, for each row, a significance test of the input variable in predicting a binary (case-control) response variable based on the logistic regression model. The response variable type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.

Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes columns for which the response variable and all covariates are defined. For each row, Hail imputes missing input values as the mean of the non-missing values.

The example above considers a model of the form

\[\mathrm{Prob}(\mathrm{is\_case}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]

where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{is\_female}\) is coded as for True (female) and 0 for False (male). The null model sets \(\beta_1 = 0\).

The structure of the emitted row field depends on the test statistic as shown in the tables below.

Test	Field	Type	Value
Wald	beta	float64	fit effect coefficient, \(\hat\beta_1\)
Wald	standard_error	float64	estimated standard error, \(\widehat{\mathrm{se}}\)
Wald	z_stat	float64	Wald \(z\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
Wald	p_value	float64	Wald p-value testing \(\beta_1 = 0\)
LRT, Firth	beta	float64	fit effect coefficient, \(\hat\beta_1\)
LRT, Firth	chi_sq_stat	float64	deviance statistic
LRT, Firth	p_value	float64	LRT / Firth p-value testing \(\beta_1 = 0\)
Score	chi_sq_stat	float64	score statistic
Score	p_value	float64	score p-value testing \(\beta_1 = 0\)

For the Wald and likelihood ratio tests, Hail fits the logistic model for each row using Newton iteration and only emits the above fields when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three fields which summarize the iterative fitting process:

Test	Field	Type	Value
Wald, LRT, Firth	fit.n_iterations	int32	number of iterations until convergence, explosion, or reaching the max (by default, 25 for Wald, LRT; 100 for Firth)
Wald, LRT, Firth	fit.converged	bool	`True` if iteration converged
Wald, LRT, Firth	fit.exploded	bool	`True` if iteration exploded

We consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{-6}\) by default. For Wald and LRT, up to 25 iterations are attempted by default; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to low-level numerical linear algebra exceptions caused by manipulating ill-conditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.

A more common situation in genetics is quasi-complete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) p-values. To not miss such variants, consider using Firth logistic regression, linear regression, or group-based tests.

Here’s a concrete illustration of quasi-complete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:

Status	HomRef	Het	HomVar
Case	1000	10	0
Control	1000	0	0

The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where x is genotype, y is phenotype, and logistf is from the logistf package:

x <- c(rep(0,1000), rep(1,1000), rep(1,10)
y <- c(rep(0,1000), rep(0,1000), rep(1,10))
logfit <- glm(y ~ x, family=binomial())
firthfit <- logistf(y ~ x)
linfit <- lm(y ~ x)

The resulting p-values for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasi-complete separation. Moving one of the 10 hets from case to control eliminates this quasi-complete separation; the p-values from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.

The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted by default for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the logreg.fit fields reflect the null model; otherwise, they reflect the full model.

See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.

Hail’s logistic regression tests correspond to the b.wald, b.lrt, and b.score tests in EPACTS. For each variant, Hail imputes missing input values as the mean of non-missing input values, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.

Note

Use the pass_through parameter to include additional row fields from matrix table underlying x. For example, to include an “rsid” field, set pass_through=['rsid'] or pass_through=[mt.rsid].

Parameters:

test ({‘wald’, ‘lrt’, ‘score’, ‘firth’}) – Statistical test.
y (Float64Expression or list of Float64Expression) – One or more column-indexed response expressions. All non-missing values must evaluate to 0 or 1. Note that a BooleanExpression will be implicitly converted to a Float64Expression with this property.
x (Float64Expression) – Entry-indexed expression for input variable.
covariates (list of Float64Expression) – Non-empty list of column-indexed covariate expressions.
pass_through (list of str or Expression) – Additional row fields to include in the resulting table.
max_iterations (int) – The maximum number of iterations.
tolerance (float, optional) – The iterative fit of this model is considered “converged” if the change in the estimated beta is smaller than tolerance. By default the tolerance is 1e-6.

Returns:

Table

hail.methods.poisson_regression_rows(test, y, x, covariates, pass_through=(), *, max_iterations=25, tolerance=None)[source]

For each row, test an input variable for association with a count response variable using Poisson regression.

Notes

See logistic_regression_rows() for more info on statistical tests of general linear models.

Note

Use the pass_through parameter to include additional row fields from matrix table underlying x. For example, to include an “rsid” field, set pass_through=['rsid'] or pass_through=[mt.rsid].

Parameters:

y (Float64Expression) – Column-indexed response expression. All non-missing values must evaluate to a non-negative integer.
x (Float64Expression) – Entry-indexed expression for input variable.
covariates (list of Float64Expression) – Non-empty list of column-indexed covariate expressions.
pass_through (list of str or Expression) – Additional row fields to include in the resulting table.
tolerance (float, optional) – The iterative fit of this model is considered “converged” if the change in the estimated beta is smaller than tolerance. By default the tolerance is 1e-6.

Returns:

Table

hail.methods.pca(entry_expr, k=10, compute_loadings=False)[source]

Run principal component analysis (PCA) on numeric columns derived from a matrix table.

Examples

For a matrix table with variant rows, sample columns, and genotype entries, compute the top 2 PC sample scores and eigenvalues of the matrix of 0s and 1s encoding missingness of genotype calls.

>>> eigenvalues, scores, _ = hl.pca(hl.int(hl.is_defined(dataset.GT)),
...                                 k=2)

Warning

This method does not automatically mean-center or normalize each column. If desired, such transformations should be incorporated in entry_expr.

Hail will return an error if entry_expr evaluates to missing, nan, or infinity on any entry.

Notes

PCA is run on the columns of the numeric matrix obtained by evaluating entry_expr on each entry of the matrix table, or equivalently on the rows of the transposed numeric matrix \(M\) referenced below.

PCA computes the SVD

\[M = USV^T\]

where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.

From the perspective of the rows of \(M\) as samples (data points), \(V_k\) contains the loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the Gramian \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the loadings parameter is specified.

Scores are stored in a Table with the column key of the matrix table as key and a field scores of type array<float64> containing the principal component scores.

Loadings are stored in a Table with the row key of the matrix table as key and a field loadings of type array<float64> containing the principal component loadings.

The eigenvalues are returned in descending order, with scores and loadings given the corresponding array order.

Parameters:

entry_expr (Expression) – Numeric expression for matrix entries.
k (int) – Number of principal components.
compute_loadings (bool) – If True, compute row loadings.

Returns:

(list of float, Table, Table) – List of eigenvalues, table with column scores, table with row loadings.

hail.methods.row_correlation(entry_expr, block_size=None)[source]

Computes the correlation matrix between row vectors.

Examples

Consider the following dataset with three variants and four samples:

>>> data = [{'v': '1:1:A:C', 's': 'a', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C', 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C', 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:1:A:C', 's': 'd', 'GT': hl.Call([1, 1])},
...         {'v': '1:2:G:T', 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '1:2:G:T', 's': 'b', 'GT': hl.Call([1, 1])},
...         {'v': '1:2:G:T', 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:2:G:T', 's': 'd', 'GT': hl.Call([0, 0])},
...         {'v': '1:3:C:G', 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '1:3:C:G', 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '1:3:C:G', 's': 'c', 'GT': hl.Call([1, 1])},
...         {'v': '1:3:C:G', 's': 'd', 'GT': hl.missing(hl.tcall)}]
>>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, GT: call}'))
>>> mt = ht.to_matrix_table(row_key=['v'], col_key=['s'])

Compute genotype correlation between all pairs of variants:

>>> ld = hl.row_correlation(mt.GT.n_alt_alleles())
>>> ld.to_numpy()
array([[ 1.        , -0.85280287,  0.42640143],
       [-0.85280287,  1.        , -0.5       ],
       [ 0.42640143, -0.5       ,  1.        ]])

Compute genotype correlation between consecutively-indexed variants:

>>> ld.sparsify_band(lower=0, upper=1).to_numpy()
array([[ 1.        , -0.85280287,  0.        ],
       [ 0.        ,  1.        , -0.5       ],
       [ 0.        ,  0.        ,  1.        ]])

Warning

Rows with a constant value (i.e., zero variance) will result nan correlation values. To avoid this, first check that all rows vary or filter out constant rows (for example, with the help of aggregators.stats()).

Notes

In this method, each row of entries is regarded as a vector with elements defined by entry_expr and missing values mean-imputed per row. The (i, j) element of the resulting block matrix is the correlation between rows i and j (as 0-indexed by order in the matrix table; see add_row_index()).

The correlation of two vectors is defined as the Pearson correlation coeffecient between the corresponding empirical distributions of elements, or equivalently as the cosine of the angle between the vectors.

This method has two stages:

writing the row-normalized block matrix to a temporary file on persistent disk with BlockMatrix.from_entry_expr(). The parallelism is n_rows / block_size.
reading and multiplying this block matrix by its transpose. The parallelism is (n_rows / block_size)^2 if all blocks are computed.

Warning

See all warnings on BlockMatrix.from_entry_expr(). In particular, for large matrices, it may be preferable to run the two stages separately, saving the row-normalized block matrix to a file on external storage with BlockMatrix.write_from_entry_expr().

The resulting number of matrix elements is the square of the number of rows in the matrix table, so computing the full matrix may be infeasible. For example, ten million rows would produce 800TB of float64 values. The block-sparse representation on BlockMatrix may be used to work efficiently with regions of such matrices, as in the second example above and ld_matrix().

To prevent excessive re-computation, be sure to write and read the (possibly block-sparsified) result before multiplication by another matrix.

Parameters:

entry_expr (Float64Expression) – Entry-indexed numeric expression on matrix table.
block_size (int, optional) – Block size. Default given by BlockMatrix.default_block_size().

Returns:

BlockMatrix – Correlation matrix between row vectors. Row and column indices correspond to matrix table row index.