ldscsim

Models for SNP effects:

Infinitesimal (can simulate n correlated traits)
Spike & slab (can simulate 2 correlated traits)
Annotation-informed

Features:

Field aggregation tools for annotation-informed model and population stratification with many covariates.
Automatic adjustment of genetic correlation parameters to allow for the joint simulation of up to 100 randomly correlated phenotypes.
Methods for binarizing phenotypes to have a certain prevalence and for adding ascertainment bias to binarized phenotypes.

`simulate_phenotypes`(mt, genotype, h2[, pi, ...])	Simulate phenotypes for testing LD score regression.
`make_betas`(mt, h2[, pi, annot, rg])	Generates betas under different models.
`multitrait_inf`(mt[, h2, rg, cov_matrix, seed])	Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes.
`multitrait_ss`(mt, h2, pi[, rg, seed])	Generates spike & slab betas for simulation of two correlated phenotypes.
`get_cov_matrix`(h2, rg[, psd_rg])	Creates covariance matrix for simulating correlated SNP effects.
`calculate_phenotypes`(mt, genotype, beta, h2)	Calculates phenotypes by multiplying genotypes and betas.
`normalize_genotypes`(genotype)	Normalizes genotypes to have mean 0 and variance 1 at each SNP
`annotate_all`(mt[, row_exprs, col_exprs, ...])	Equivalent of _annotate_all, but checks source MatrixTable of exprs
`ascertainment_bias`(mt, y, P)	Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls).
`binarize`(mt, y, K[, exact])	Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold.
`agg_fields`(tb[, coef_dict, str_expr, axis])	Aggregates by linear combination fields matching either keys in coef_dict or str_expr.
`get_coef_dict`(tb[, str_expr, ref_coef_dict, ...])	Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict.

hail.experimental.ldscsim.simulate_phenotypes(mt, genotype, h2, pi=None, rg=None, annot=None, popstrat=None, popstrat_var=None, exact_h2=False)[source]

Simulate phenotypes for testing LD score regression.

Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed. Optionally adds population stratification.

Parameters:

mt (MatrixTable) – MatrixTable containing genotypes to be used. Also should contain variant annotations as row fields if running the annotation-informed model or covariates as column fields if adding population stratification.
genotype (Expression or CallExpression) – Entry field containing genotypes of individuals to be used for the simulation.
h2 (float or int or list or numpy.ndarray) – SNP-based heritability of simulated trait.
pi (float or int or list or numpy.ndarray, optional) – Probability of SNP being causal when simulating under the spike & slab model.
rg (float or int or list or numpy.ndarray, optional) – Genetic correlation between traits.
annot (Expression, optional) – Row field to use as our aggregated annotations.
popstrat (Expression, optional) – Column field to use as our aggregated covariates for adding population stratification.
exact_h2 (bool, optional) – Whether to exactly simulate ratio of variance of genetic component of phenotype to variance of phenotype to be h2. If False, ratio will be h2 in expectation. Observed h2 in the simulation will be close to expected h2 for large-scale simulations.

Returns:

MatrixTable – MatrixTable with simulated betas and phenotypes, simulated according to specified model.

hail.experimental.ldscsim.make_betas(mt, h2, pi=None, annot=None, rg=None)[source]

Generates betas under different models.

Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed.

Parameters:

mt (MatrixTable) – MatrixTable containing genotypes to be used. Also should contain variant annotations as row fields if running the annotation-informed model or covariates as column fields if adding population stratification.
h2 (float or int or list or numpy.ndarray) – SNP-based heritability of simulated trait(s).
pi (float or int or list or numpy.ndarray, optional) – Probability of SNP being causal when simulating under the spike & slab model. If doing two-trait spike & slab pi is a list of probabilities for overlapping causal SNPs (see docstring of multitrait_ss())
annot (Expression, optional) – Row field of aggregated annotations for annotation-informed model.
rg (float or int or list or numpy.ndarray, optional) – Genetic correlation between traits.

Returns:

mt (MatrixTable) – MatrixTable with betas as a row field, simulated according to specified model.
pi (list) – Probability of a SNP being causal for different traits, possibly altered from input pi if covariance matrix for multitrait simulation was not positive semi-definite.
rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix for multitrait simulation was not positive semi-definite.

hail.experimental.ldscsim.multitrait_inf(mt, h2=None, rg=None, cov_matrix=None, seed=None)[source]

Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes.

Parameters:

mt (MatrixTable) – MatrixTable for simulated phenotype.
h2 (float or int or list or numpy.ndarray, optional) – Desired SNP-based heritability (\(h^2\)) of simulated traits. If h2 is None, \(h^2\) is based on diagonal of cov_matrix.
rg (float or int or list or numpy.ndarray, optional) – Desired genetic correlation (\(r_g\)) between simulated traits. If simulating more than two correlated traits, rg should be a list of \(rg\) values corresponding to the upper right triangle of the covariance matrix. If rg is None and cov_matrix is None, \(r_g\) is assumed to be 0 between traits. If rg and cov_matrix are both not None, \(r_g\) values from cov_matrix take precedence.
cov_matrix (numpy.ndarray, optional) – Covariance matrix for traits, unscaled by :math:`M`, the number of SNPs. Overrides h2 and rg even when h2 or rg are not None.
seed (int, optional) – Seed for random number generator. If seed is None, seed is set randomly.

Returns:

mt (MatrixTable) – MatrixTable with simulated SNP effects as a row field of arrays.
rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.

hail.experimental.ldscsim.multitrait_ss(mt, h2, pi, rg=0, seed=None)[source]

Generates spike & slab betas for simulation of two correlated phenotypes.

Parameters:

mt (MatrixTable) – MatrixTable for simulated phenotype.
h2 (list or numpy.ndarray) – Desired SNP-based heritability of simulated traits.
pi (list or numpy.ndarray) – List of proportion of SNPs: \(p_{TT}\), \(p_{TF}\), \(p_{FT}\) \(p_{TT}\) is the proportion of SNPs that are causal for both traits, \(p_{TF}\) is the proportion of SNPs that are causal for trait 1 but not trait 2, \(p_{FT}\) is the proportion of SNPs that are causal for trait 2 but not trait 1.
rg (float or int) – Genetic correlation between traits.
seed (int, optional) – Seed for random number generator. If seed is None, seed is set randomly.

Warning

May give inaccurate results if chosen parameters make the covariance matrix not positive semi-definite. Covariance matrix is likely to not be positive semi-definite when \(p_{TT}\) is small and rg is large.

Returns:

mt (MatrixTable) – MatrixTable with simulated SNP effects as a row field of arrays.
pi (list or numpy.ndarray) – List of proportion of SNPs: \(p_{TT}\), \(p_{TF}\), \(p_{FT}\). Possibly altered if covariance matrix of traits was not positive semi-definite.
rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.

hail.experimental.ldscsim.get_cov_matrix(h2, rg, psd_rg=False)[source]

Creates covariance matrix for simulating correlated SNP effects.

Given a list of heritabilities and a list of genetic correlations, get_cov_matrix() constructs the covariance matrix necessary to draw from a multivariate normal distribution to generate correlated SNP effects.

Examples

Suppose we have three traits enumerated as trait 1, trait 2, and trait 3. Each trait has a heritability: \(h^2_1\),:math:h^2_2,:math:h^2_3 Traits have the following genetic correlations: \(r_{g, 12}\),:math:r_{g, 13}, \(r_{g, 23}\) The ordering of indices in the subscript is arbitrary (e.g. \(r_{g, 12}\) = \(r_{g, 21}\)) as both values are the genetic correlation between trait 1 and trait 2. We can calculate \(\rho_{g,ab}\), the genetic covariance between two traits \(a\) and \(b\), as \(\rho_{g,ab}=r_{g,ab}\sqrt{h^2_a\cdot h^2_b}\). The covariance matrix is thus:

\[\begin{pmatrix} h^2_1 & r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & r_{g, 13}\sqrt{h^2_1\cdot h^2_3} \\ r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & h^2_2 & r_{g, 23}\sqrt{h^2_2\cdot h^2_3} \\ r_{g, 13}\sqrt{h^2_1\cdot h^2_3} & r_{g, 23}*\sqrt{h^2_2\cdot h^2_3} & h^2_3 \end{pmatrix}\]

Now suppose we have four traits with the following heritabilities (\(h^2\)): 0.1, 0.3, 0.2, 0.6. That is, trait 1 has an \(h^2\) of 0.1, trait 2 has an \(h^2\) of 0.3 and so on. Suppose the genetic correlations (\(r_g\)) between traits are the following: trait 1 & trait 2 \(r_g\) = 0.4 trait 1 & trait 3 \(r_g\) = 0.3 trait 1 & trait 4 \(r_g\) = 0.1 trait 2 & trait 3 \(r_g\) = 0.2 trait 2 & trait 4 \(r_g\) = 0.15 trait 3 & trait 4 \(r_g\) = 0.6 To obtain the covariance matrix corresponding to this scenario \(h^2\) values are ordered according to user specification and \(r_g\) values are ordered by the order in which the corresponding genetic covariance terms will appear in the covariance matrix, reading lines in the upper triangular matrix from left to right, top to bottom (read first row left to right, read second row left to right, etc.), exluding the diagonal.

>>> cov_matrix, rg = get_cov_matrix(h2=[0.1, 0.3, 0.2, 0.6], rg=[0.4, 0.3, 0.1, 0.2, 0.15, 0.6])
>>> cov_matrix
array([[0.1       , 0.06928203, 0.04242641, 0.0244949 ],
       [0.06928203, 0.3       , 0.04898979, 0.06363961],
       [0.04242641, 0.04898979, 0.2       , 0.2078461 ],
       [0.0244949 , 0.06363961, 0.2078461 , 0.6       ]])

The diagonal corresponds directly to h2, the list of h2 values for all traits. In the upper triangular matrix, excluding the diagonal, the entry \((a, b)\), where \(a\) and \(b\) are in \({1,2,3,4}\), is the genetic covariance (\(\rho_g\)) between traits \(a\) and \(b\). Genetic covariance is calculated as \(\rho_g= r_g*\sqrt{h^2_a*h^2_b}\) where \(r_g\) is the genetic correlation between traits \(a\) and \(b\) and \(h^2_a\) and \(h^2_b\) are heritabilities corresponding to traits \(a\) and \(b\).

Notes

Covariance matrix is not scaled by number of SNPs.

If the h2 and rg parameters passed cause the resulting covariance matrix to not be positive semidefinite, this may cause the distribution of SNP effects generated by this covariance matrix to not have the properties specified by the h2 and rg parameters. To automatically adjust rg values so that the covariance matrix is positive semidefinite, set psd_rg = True.

Parameters:

h2 (list or numpy.ndarray) – \(h^2\) values for traits. \(h^2\) values in list should be ordered by their order in the diagonal of the covariance array, reading from top left to bottom right.
rg (list or numpy.ndarray) – \(r_g\) values for traits. \(r_g\) values should be ordered in the order they appear in the upper triangle of the covariance matrix, from left to right, top to bottom.
psd_rg (bool) – Whether to automatically adjust rg values to get a positive semi-definite covariance matrix, which ensures that SNP effects simulated with that covariance matrix have the desired variance and correlation properties specified by the h2 and rg parameters.

Returns:

cov_matrix (numpy.ndarray) – Covariance matrix calculated using h2 and (possibly altered) rg values.
rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.

hail.experimental.ldscsim.calculate_phenotypes(mt, genotype, beta, h2, popstrat=None, popstrat_var=None, exact_h2=False)[source]

Calculates phenotypes by multiplying genotypes and betas.

Parameters:

mt (MatrixTable) – MatrixTable with all relevant fields passed as parameters.
genotype (Expression or CallExpression) – Entry field of genotypes.
beta (Expression) – Row field of SNP effects.
h2 (float or int or list or numpy.ndarray) – SNP-based heritability (\(h^2\)) of simulated trait. Can only be None if running annotation-informed model.
popstrat (Expression, optional) – Column field containing population stratification term.
popstrat_var (float or int) – Variance of population stratification term.
exact_h2 (bool) – Whether to exactly simulate ratio of variance of genetic component of phenotype to variance of phenotype to be h2. If False, ratio will be h2 in expectation. Observed h2 in the simulation will be close to expected h2 for large-scale simulations.

Returns:

MatrixTable – MatrixTable with simulated phenotype as column field.

hail.experimental.ldscsim.normalize_genotypes(genotype)[source]

Normalizes genotypes to have mean 0 and variance 1 at each SNP

Parameters:: genotype (Expression or CallExpression) – Entry field of genotypes.
Returns:: MatrixTable – MatrixTable with normalized genotypes.

hail.experimental.ldscsim.annotate_all(mt, row_exprs={}, col_exprs={}, entry_exprs={}, global_exprs={})[source]: Equivalent of _annotate_all, but checks source MatrixTable of exprs

hail.experimental.ldscsim.ascertainment_bias(mt, y, P)[source]

Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls).

Parameters:

mt (MatrixTable) – MatrixTable containing binary phenotype to be used.
y (Expression) – Column field of binary phenotype.
P (int or float) – Desired “sample prevalence” of phenotype.

Returns:

MatrixTable – MatrixTable containing binary phenotype with prevalence of approx. P

hail.experimental.ldscsim.binarize(mt, y, K, exact=False)[source]

Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold.

Parameters:

mt (MatrixTable) – MatrixTable containing phenotype to be binarized.
y (Expression) – Column field of phenotype.
K (int or float) – Desired “population prevalence” of phenotype.
exact (bool) – Whether to get prevalence as close as possible to K (does not use inverse CDF)

Returns:

MatrixTable – MatrixTable containing binary phenotype with prevalence of approx. K

hail.experimental.ldscsim.agg_fields(tb, coef_dict=None, str_expr=None, axis='rows')[source]

Aggregates by linear combination fields matching either keys in coef_dict or str_expr. Outputs the aggregation in a MatrixTable or Table as a new row field “agg_annot” or a new column field “agg_cov”.

Parameters:

tb (MatrixTable or Table) – MatrixTable or Table containing fields to be aggregated.
coef_dict (dict, optional) – Coefficients to multiply each field. The coefficients are specified by coef_dict value, the row (or col) field name is specified by coef_dict key. If not included, coefficients are assumed to be 1.
str_expr (str, optional) – String expression to match against row (or col) field names.
axis (str) – Either ‘rows’ or ‘cols’. If ‘rows’, this aggregates across row fields. If ‘cols’, this aggregates across col fields. If tb is a Table, axis = ‘rows’.

Returns:

MatrixTable or Table – MatrixTable or Table containing aggregation field.

hail.experimental.ldscsim.get_coef_dict(tb, str_expr=None, ref_coef_dict=None, axis='rows')[source]

Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict.

Parameters:

tb (MatrixTable or Table) – MatrixTable or Table containing row (or col) for coef_dict.
str_expr (str, optional) – String expression pattern to match against row (or col) fields. If left unspecified, the intersection of field names is only between existing row (or col) fields in mt and keys of ref_coef_dict.
ref_coef_dict (dict, optional) – Reference coefficient dictionary with keys that are row (or col) field names from which to subset. If not included, coefficients are assumed to be 1.
axis (str) – Field type in which to search for field names. Options: ‘rows’, ‘cols’

Returns:

coef_dict (dict) – Coefficients to multiply each field. The coefficients are specified by coef_dict value, the row (or col) field name is specified by coef_dict key.