ldscsim
- Models for SNP effects:
Infinitesimal (can simulate n correlated traits)
Spike & slab (can simulate 2 correlated traits)
Annotation-informed
- Features:
Field aggregation tools for annotation-informed model and population stratification with many covariates.
Automatic adjustment of genetic correlation parameters to allow for the joint simulation of up to 100 randomly correlated phenotypes.
Methods for binarizing phenotypes to have a certain prevalence and for adding ascertainment bias to binarized phenotypes.
|
Simulate phenotypes for testing LD score regression. |
|
Generates betas under different models. |
|
Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes. |
|
Generates spike & slab betas for simulation of two correlated phenotypes. |
|
Creates covariance matrix for simulating correlated SNP effects. |
|
Calculates phenotypes by multiplying genotypes and betas. |
|
Normalizes genotypes to have mean 0 and variance 1 at each SNP |
|
Equivalent of _annotate_all, but checks source MatrixTable of exprs |
|
Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls). |
|
Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold. |
|
Aggregates by linear combination fields matching either keys in coef_dict or str_expr. |
|
Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict. |
- hail.experimental.ldscsim.simulate_phenotypes(mt, genotype, h2, pi=None, rg=None, annot=None, popstrat=None, popstrat_var=None, exact_h2=False)[source]
Simulate phenotypes for testing LD score regression.
Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed. Optionally adds population stratification.
- Parameters:
mt (
MatrixTable
) –MatrixTable
containing genotypes to be used. Also should contain variant annotations as row fields if running the annotation-informed model or covariates as column fields if adding population stratification.genotype (
Expression
orCallExpression
) – Entry field containing genotypes of individuals to be used for the simulation.h2 (
float
orint
orlist
ornumpy.ndarray
) – SNP-based heritability of simulated trait.pi (
float
orint
orlist
ornumpy.ndarray
, optional) – Probability of SNP being causal when simulating under the spike & slab model.rg (
float
orint
orlist
ornumpy.ndarray
, optional) – Genetic correlation between traits.annot (
Expression
, optional) – Row field to use as our aggregated annotations.popstrat (
Expression
, optional) – Column field to use as our aggregated covariates for adding population stratification.exact_h2 (
bool
, optional) – Whether to exactly simulate ratio of variance of genetic component of phenotype to variance of phenotype to be h2. If False, ratio will be h2 in expectation. Observed h2 in the simulation will be close to expected h2 for large-scale simulations.
- Returns:
MatrixTable
–MatrixTable
with simulated betas and phenotypes, simulated according to specified model.
- hail.experimental.ldscsim.make_betas(mt, h2, pi=None, annot=None, rg=None)[source]
Generates betas under different models.
Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed.
- Parameters:
mt (
MatrixTable
) – MatrixTable containing genotypes to be used. Also should contain variant annotations as row fields if running the annotation-informed model or covariates as column fields if adding population stratification.h2 (
float
orint
orlist
ornumpy.ndarray
) – SNP-based heritability of simulated trait(s).pi (
float
orint
orlist
ornumpy.ndarray
, optional) – Probability of SNP being causal when simulating under the spike & slab model. If doing two-trait spike & slab pi is a list of probabilities for overlapping causal SNPs (see docstring ofmultitrait_ss()
)annot (
Expression
, optional) – Row field of aggregated annotations for annotation-informed model.rg (
float
orint
orlist
ornumpy.ndarray
, optional) – Genetic correlation between traits.
- Returns:
mt (
MatrixTable
) –MatrixTable
with betas as a row field, simulated according to specified model.pi (
list
) – Probability of a SNP being causal for different traits, possibly altered from input pi if covariance matrix for multitrait simulation was not positive semi-definite.rg (
list
) – Genetic correlation between traits, possibly altered from input rg if covariance matrix for multitrait simulation was not positive semi-definite.
- hail.experimental.ldscsim.multitrait_inf(mt, h2=None, rg=None, cov_matrix=None, seed=None)[source]
Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes.
- Parameters:
mt (
MatrixTable
) – MatrixTable for simulated phenotype.h2 (
float
orint
orlist
ornumpy.ndarray
, optional) – Desired SNP-based heritability (\(h^2\)) of simulated traits. If h2 isNone
, \(h^2\) is based on diagonal of cov_matrix.rg (
float
orint
orlist
ornumpy.ndarray
, optional) – Desired genetic correlation (\(r_g\)) between simulated traits. If simulating more than two correlated traits, rg should be a list of \(rg\) values corresponding to the upper right triangle of the covariance matrix. If rg isNone
and cov_matrix isNone
, \(r_g\) is assumed to be 0 between traits. If rg and cov_matrix are both not None, \(r_g\) values from cov_matrix take precedence.cov_matrix (
numpy.ndarray
, optional) – Covariance matrix for traits, unscaled by :math:`M`, the number of SNPs. Overrides h2 and rg even when h2 or rg are notNone
.seed (
int
, optional) – Seed for random number generator. If seed isNone
, seed is set randomly.
- Returns:
mt (
MatrixTable
) –MatrixTable
with simulated SNP effects as a row field of arrays.rg (
list
) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.
- hail.experimental.ldscsim.multitrait_ss(mt, h2, pi, rg=0, seed=None)[source]
Generates spike & slab betas for simulation of two correlated phenotypes.
- Parameters:
mt (
MatrixTable
) –MatrixTable
for simulated phenotype.h2 (
list
ornumpy.ndarray
) – Desired SNP-based heritability of simulated traits.pi (
list
ornumpy.ndarray
) – List of proportion of SNPs: \(p_{TT}\), \(p_{TF}\), \(p_{FT}\) \(p_{TT}\) is the proportion of SNPs that are causal for both traits, \(p_{TF}\) is the proportion of SNPs that are causal for trait 1 but not trait 2, \(p_{FT}\) is the proportion of SNPs that are causal for trait 2 but not trait 1.seed (
int
, optional) – Seed for random number generator. If seed isNone
, seed is set randomly.
Warning
May give inaccurate results if chosen parameters make the covariance matrix not positive semi-definite. Covariance matrix is likely to not be positive semi-definite when \(p_{TT}\) is small and rg is large.
- Returns:
mt (
MatrixTable
) –MatrixTable
with simulated SNP effects as a row field of arrays.pi (
list
ornumpy.ndarray
) – List of proportion of SNPs: \(p_{TT}\), \(p_{TF}\), \(p_{FT}\). Possibly altered if covariance matrix of traits was not positive semi-definite.rg (
list
) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.
- hail.experimental.ldscsim.get_cov_matrix(h2, rg, psd_rg=False)[source]
Creates covariance matrix for simulating correlated SNP effects.
Given a list of heritabilities and a list of genetic correlations,
get_cov_matrix()
constructs the covariance matrix necessary to draw from a multivariate normal distribution to generate correlated SNP effects.Examples
Suppose we have three traits enumerated as trait 1, trait 2, and trait 3. Each trait has a heritability: \(h^2_1\),:math:h^2_2,:math:h^2_3 Traits have the following genetic correlations: \(r_{g, 12}\),:math:r_{g, 13}, \(r_{g, 23}\) The ordering of indices in the subscript is arbitrary (e.g. \(r_{g, 12}\) = \(r_{g, 21}\)) as both values are the genetic correlation between trait 1 and trait 2. We can calculate \(\rho_{g,ab}\), the genetic covariance between two traits \(a\) and \(b\), as \(\rho_{g,ab}=r_{g,ab}\sqrt{h^2_a\cdot h^2_b}\). The covariance matrix is thus:
\[\begin{pmatrix} h^2_1 & r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & r_{g, 13}\sqrt{h^2_1\cdot h^2_3} \\ r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & h^2_2 & r_{g, 23}\sqrt{h^2_2\cdot h^2_3} \\ r_{g, 13}\sqrt{h^2_1\cdot h^2_3} & r_{g, 23}*\sqrt{h^2_2\cdot h^2_3} & h^2_3 \end{pmatrix}\]Now suppose we have four traits with the following heritabilities (\(h^2\)): 0.1, 0.3, 0.2, 0.6. That is, trait 1 has an \(h^2\) of 0.1, trait 2 has an \(h^2\) of 0.3 and so on. Suppose the genetic correlations (\(r_g\)) between traits are the following: trait 1 & trait 2 \(r_g\) = 0.4 trait 1 & trait 3 \(r_g\) = 0.3 trait 1 & trait 4 \(r_g\) = 0.1 trait 2 & trait 3 \(r_g\) = 0.2 trait 2 & trait 4 \(r_g\) = 0.15 trait 3 & trait 4 \(r_g\) = 0.6 To obtain the covariance matrix corresponding to this scenario \(h^2\) values are ordered according to user specification and \(r_g\) values are ordered by the order in which the corresponding genetic covariance terms will appear in the covariance matrix, reading lines in the upper triangular matrix from left to right, top to bottom (read first row left to right, read second row left to right, etc.), exluding the diagonal.
>>> cov_matrix, rg = get_cov_matrix(h2=[0.1, 0.3, 0.2, 0.6], rg=[0.4, 0.3, 0.1, 0.2, 0.15, 0.6]) >>> cov_matrix array([[0.1 , 0.06928203, 0.04242641, 0.0244949 ], [0.06928203, 0.3 , 0.04898979, 0.06363961], [0.04242641, 0.04898979, 0.2 , 0.2078461 ], [0.0244949 , 0.06363961, 0.2078461 , 0.6 ]])
The diagonal corresponds directly to h2, the list of h2 values for all traits. In the upper triangular matrix, excluding the diagonal, the entry \((a, b)\), where \(a\) and \(b\) are in \({1,2,3,4}\), is the genetic covariance (\(\rho_g\)) between traits \(a\) and \(b\). Genetic covariance is calculated as \(\rho_g= r_g*\sqrt{h^2_a*h^2_b}\) where \(r_g\) is the genetic correlation between traits \(a\) and \(b\) and \(h^2_a\) and \(h^2_b\) are heritabilities corresponding to traits \(a\) and \(b\).
Notes
Covariance matrix is not scaled by number of SNPs.
If the h2 and rg parameters passed cause the resulting covariance matrix to not be positive semidefinite, this may cause the distribution of SNP effects generated by this covariance matrix to not have the properties specified by the h2 and rg parameters. To automatically adjust rg values so that the covariance matrix is positive semidefinite, set psd_rg = True.
- Parameters:
h2 (
list
ornumpy.ndarray
) – \(h^2\) values for traits. \(h^2\) values in list should be ordered by their order in the diagonal of the covariance array, reading from top left to bottom right.rg (
list
ornumpy.ndarray
) – \(r_g\) values for traits. \(r_g\) values should be ordered in the order they appear in the upper triangle of the covariance matrix, from left to right, top to bottom.psd_rg (
bool
) – Whether to automatically adjust rg values to get a positive semi-definite covariance matrix, which ensures that SNP effects simulated with that covariance matrix have the desired variance and correlation properties specified by the h2 and rg parameters.
- Returns:
cov_matrix (
numpy.ndarray
) – Covariance matrix calculated using h2 and (possibly altered) rg values.rg (
list
) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.
- hail.experimental.ldscsim.calculate_phenotypes(mt, genotype, beta, h2, popstrat=None, popstrat_var=None, exact_h2=False)[source]
Calculates phenotypes by multiplying genotypes and betas.
- Parameters:
mt (
MatrixTable
) –MatrixTable
with all relevant fields passed as parameters.genotype (
Expression
orCallExpression
) – Entry field of genotypes.beta (
Expression
) – Row field of SNP effects.h2 (
float
orint
orlist
ornumpy.ndarray
) – SNP-based heritability (\(h^2\)) of simulated trait. Can only beNone
if running annotation-informed model.popstrat (
Expression
, optional) – Column field containing population stratification term.popstrat_var (
float
orint
) – Variance of population stratification term.exact_h2 (
bool
) – Whether to exactly simulate ratio of variance of genetic component of phenotype to variance of phenotype to be h2. If False, ratio will be h2 in expectation. Observed h2 in the simulation will be close to expected h2 for large-scale simulations.
- Returns:
MatrixTable
–MatrixTable
with simulated phenotype as column field.
- hail.experimental.ldscsim.normalize_genotypes(genotype)[source]
Normalizes genotypes to have mean 0 and variance 1 at each SNP
- Parameters:
genotype (
Expression
orCallExpression
) – Entry field of genotypes.- Returns:
MatrixTable
–MatrixTable
with normalized genotypes.
- hail.experimental.ldscsim.annotate_all(mt, row_exprs={}, col_exprs={}, entry_exprs={}, global_exprs={})[source]
Equivalent of _annotate_all, but checks source MatrixTable of exprs
- hail.experimental.ldscsim.ascertainment_bias(mt, y, P)[source]
Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls).
- Parameters:
mt (
MatrixTable
) –MatrixTable
containing binary phenotype to be used.y (
Expression
) – Column field of binary phenotype.P (
int
orfloat
) – Desired “sample prevalence” of phenotype.
- Returns:
MatrixTable
–MatrixTable
containing binary phenotype with prevalence of approx. P
- hail.experimental.ldscsim.binarize(mt, y, K, exact=False)[source]
Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold.
- Parameters:
mt (
MatrixTable
) –MatrixTable
containing phenotype to be binarized.y (
Expression
) – Column field of phenotype.K (
int
orfloat
) – Desired “population prevalence” of phenotype.exact (
bool
) – Whether to get prevalence as close as possible to K (does not use inverse CDF)
- Returns:
MatrixTable
–MatrixTable
containing binary phenotype with prevalence of approx. K
- hail.experimental.ldscsim.agg_fields(tb, coef_dict=None, str_expr=None, axis='rows')[source]
Aggregates by linear combination fields matching either keys in coef_dict or str_expr. Outputs the aggregation in a
MatrixTable
orTable
as a new row field “agg_annot” or a new column field “agg_cov”.- Parameters:
tb (
MatrixTable
orTable
) –MatrixTable
orTable
containing fields to be aggregated.coef_dict (
dict
, optional) – Coefficients to multiply each field. The coefficients are specified by coef_dict value, the row (or col) field name is specified by coef_dict key. If not included, coefficients are assumed to be 1.str_expr (
str
, optional) – String expression to match against row (or col) field names.axis (
str
) – Either ‘rows’ or ‘cols’. If ‘rows’, this aggregates across row fields. If ‘cols’, this aggregates across col fields. If tb is a Table, axis = ‘rows’.
- Returns:
MatrixTable
orTable
–MatrixTable
orTable
containing aggregation field.
- hail.experimental.ldscsim.get_coef_dict(tb, str_expr=None, ref_coef_dict=None, axis='rows')[source]
Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict.
- Parameters:
tb (
MatrixTable
orTable
) –MatrixTable
orTable
containing row (or col) for coef_dict.str_expr (
str
, optional) – String expression pattern to match against row (or col) fields. If left unspecified, the intersection of field names is only between existing row (or col) fields in mt and keys of ref_coef_dict.ref_coef_dict (
dict
, optional) – Reference coefficient dictionary with keys that are row (or col) field names from which to subset. If not included, coefficients are assumed to be 1.axis (
str
) – Field type in which to search for field names. Options: ‘rows’, ‘cols’
- Returns:
coef_dict (
dict
) – Coefficients to multiply each field. The coefficients are specified by coef_dict value, the row (or col) field name is specified by coef_dict key.