# ldscsim¶

Models for SNP effects:
• Infinitesimal (can simulate n correlated traits)

• Spike & slab (can simulate 2 correlated traits)

• Annotation-informed

Features:
• Field aggregation tools for annotation-informed model and population stratification with many covariates.

• Automatic adjustment of genetic correlation parameters to allow for the joint simulation of up to 100 randomly correlated phenotypes.

• Methods for binarizing phenotypes to have a certain prevalence and for adding ascertainment bias to binarized phenotypes.

 simulate_phenotypes(mt, genotype, h2[, pi, …]) Simulate phenotypes for testing LD score regression. make_betas(mt, h2[, pi, annot, rg]) Generates betas under different models. multitrait_inf(mt[, h2, rg, cov_matrix, seed]) Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes. multitrait_ss(mt, h2, pi[, rg, seed]) Generates spike & slab betas for simulation of two correlated phenotypes. get_cov_matrix(h2, rg[, psd_rg]) Creates covariance matrix for simulating correlated SNP effects. calculate_phenotypes(mt, genotype, beta, h2) Calculates phenotypes by multiplying genotypes and betas. normalize_genotypes(genotype) Normalizes genotypes to have mean 0 and variance 1 at each SNP annotate_all(mt[, row_exprs, col_exprs, …]) Equivalent of _annotate_all, but checks source MatrixTable of exprs ascertainment_bias(mt, y, P) Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls). binarize(mt, y, K[, exact]) Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold. agg_fields(tb[, coef_dict, str_expr, axis]) Aggregates by linear combination fields matching either keys in coef_dict or str_expr. get_coef_dict(tb[, str_expr, ref_coef_dict, …]) Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict.
hail.experimental.ldscsim.simulate_phenotypes(mt, genotype, h2, pi=None, rg=None, annot=None, popstrat=None, popstrat_var=None, exact_h2=False)[source]

Simulate phenotypes for testing LD score regression.

Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed. Optionally adds population stratification.

Parameters
Returns

MatrixTableMatrixTable with simulated betas and phenotypes, simulated according to specified model.

hail.experimental.ldscsim.make_betas(mt, h2, pi=None, annot=None, rg=None)[source]

Generates betas under different models.

Simulates betas (SNP effects) under the infinitesimal, spike & slab, or annotation-informed models, depending on parameters passed.

Parameters
Returns

• mt (MatrixTable) – MatrixTable with betas as a row field, simulated according to specified model.

• pi (list) – Probability of a SNP being causal for different traits, possibly altered from input pi if covariance matrix for multitrait simulation was not positive semi-definite.

• rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix for multitrait simulation was not positive semi-definite.

hail.experimental.ldscsim.multitrait_inf(mt, h2=None, rg=None, cov_matrix=None, seed=None)[source]

Generates correlated betas for multi-trait infinitesimal simulations for any number of phenotypes.

Parameters
• mt (MatrixTable) – MatrixTable for simulated phenotype.

• h2 (float or int or list or numpy.ndarray, optional) – Desired SNP-based heritability ($$h^2$$) of simulated traits. If h2 is None, $$h^2$$ is based on diagonal of cov_matrix.

• rg (float or int or list or numpy.ndarray, optional) – Desired genetic correlation ($$r_g$$) between simulated traits. If simulating more than two correlated traits, rg should be a list of $$rg$$ values corresponding to the upper right triangle of the covariance matrix. If rg is None and cov_matrix is None, $$r_g$$ is assumed to be 0 between traits. If rg and cov_matrix are both not None, $$r_g$$ values from cov_matrix take precedence.

• cov_matrix (numpy.ndarray, optional) – Covariance matrix for traits, unscaled by :math:M, the number of SNPs. Overrides h2 and rg even when h2 or rg are not None.

• seed (int, optional) – Seed for random number generator. If seed is None, seed is set randomly.

Returns

hail.experimental.ldscsim.multitrait_ss(mt, h2, pi, rg=0, seed=None)[source]

Generates spike & slab betas for simulation of two correlated phenotypes.

Parameters

Warning

May give inaccurate results if chosen parameters make the covariance matrix not positive semi-definite. Covariance matrix is likely to not be positive semi-definite when $$p_{TT}$$ is small and rg is large.

Returns

hail.experimental.ldscsim.get_cov_matrix(h2, rg, psd_rg=False)[source]

Creates covariance matrix for simulating correlated SNP effects.

Given a list of heritabilities and a list of genetic correlations, get_cov_matrix() constructs the covariance matrix necessary to draw from a multivariate normal distribution to generate correlated SNP effects.

Examples

Suppose we have three traits enumerated as trait 1, trait 2, and trait 3. Each trait has a heritability: $$h^2_1$$,:math:h^2_2,:math:h^2_3 Traits have the following genetic correlations: $$r_{g, 12}$$,:math:r_{g, 13}, $$r_{g, 23}$$ The ordering of indices in the subscript is arbitrary (e.g. $$r_{g, 12}$$ = $$r_{g, 21}$$) as both values are the genetic correlation between trait 1 and trait 2. We can calculate $$\rho_{g,ab}$$, the genetic covariance between two traits $$a$$ and $$b$$, as $$\rho_{g,ab}=r_{g,ab}\sqrt{h^2_a\cdot h^2_b}$$. The covariance matrix is thus:

$\begin{pmatrix} h^2_1 & r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & r_{g, 13}\sqrt{h^2_1\cdot h^2_3} \\ r_{g, 12}\sqrt{h^2_1\cdot h^2_2} & h^2_2 & r_{g, 23}\sqrt{h^2_2\cdot h^2_3} \\ r_{g, 13}\sqrt{h^2_1\cdot h^2_3} & r_{g, 23}*\sqrt{h^2_2\cdot h^2_3} & h^2_3 \end{pmatrix}$

Now suppose we have four traits with the following heritabilities ($$h^2$$): 0.1, 0.3, 0.2, 0.6. That is, trait 1 has an $$h^2$$ of 0.1, trait 2 has an $$h^2$$ of 0.3 and so on. Suppose the genetic correlations ($$r_g$$) between traits are the following: trait 1 & trait 2 $$r_g$$ = 0.4 trait 1 & trait 3 $$r_g$$ = 0.3 trait 1 & trait 4 $$r_g$$ = 0.1 trait 2 & trait 3 $$r_g$$ = 0.2 trait 2 & trait 4 $$r_g$$ = 0.15 trait 3 & trait 4 $$r_g$$ = 0.6 To obtain the covariance matrix corresponding to this scenario $$h^2$$ values are ordered according to user specification and $$r_g$$ values are ordered by the order in which the corresponding genetic covariance terms will appear in the covariance matrix, reading lines in the upper triangular matrix from left to right, top to bottom (read first row left to right, read second row left to right, etc.), exluding the diagonal.

>>> cov_matrix, rg = get_cov_matrix(h2=[0.1, 0.3, 0.2, 0.6], rg=[0.4, 0.3, 0.1, 0.2, 0.15, 0.6])
>>> cov_matrix
array([[0.1       , 0.06928203, 0.04242641, 0.0244949 ],
[0.06928203, 0.3       , 0.04898979, 0.06363961],
[0.04242641, 0.04898979, 0.2       , 0.2078461 ],
[0.0244949 , 0.06363961, 0.2078461 , 0.6       ]])


The diagonal corresponds directly to h2, the list of h2 values for all traits. In the upper triangular matrix, excluding the diagonal, the entry $$(a, b)$$, where $$a$$ and $$b$$ are in $${1,2,3,4}$$, is the genetic covariance ($$\rho_g$$) between traits $$a$$ and $$b$$. Genetic covariance is calculated as $$\rho_g= r_g*\sqrt{h^2_a*h^2_b}$$ where $$r_g$$ is the genetic correlation between traits $$a$$ and $$b$$ and $$h^2_a$$ and $$h^2_b$$ are heritabilities corresponding to traits $$a$$ and $$b$$.

Notes

Covariance matrix is not scaled by number of SNPs.

If the h2 and rg parameters passed cause the resulting covariance matrix to not be positive semidefinite, this may cause the distribution of SNP effects generated by this covariance matrix to not have the properties specified by the h2 and rg parameters. To automatically adjust rg values so that the covariance matrix is positive semidefinite, set psd_rg = True.

Parameters
• h2 (list or numpy.ndarray) – $$h^2$$ values for traits. $$h^2$$ values in list should be ordered by their order in the diagonal of the covariance array, reading from top left to bottom right.

• rg (list or numpy.ndarray) – $$r_g$$ values for traits. $$r_g$$ values should be ordered in the order they appear in the upper triangle of the covariance matrix, from left to right, top to bottom.

• psd_rg (bool) – Whether to automatically adjust rg values to get a positive semi-definite covariance matrix, which ensures that SNP effects simulated with that covariance matrix have the desired variance and correlation properties specified by the h2 and rg parameters.

Returns

• cov_matrix (numpy.ndarray) – Covariance matrix calculated using h2 and (possibly altered) rg values.

• rg (list) – Genetic correlation between traits, possibly altered from input rg if covariance matrix was not positive semi-definite.

hail.experimental.ldscsim.calculate_phenotypes(mt, genotype, beta, h2, popstrat=None, popstrat_var=None, exact_h2=False)[source]

Calculates phenotypes by multiplying genotypes and betas.

Parameters
Returns

MatrixTableMatrixTable with simulated phenotype as column field.

hail.experimental.ldscsim.normalize_genotypes(genotype)[source]

Normalizes genotypes to have mean 0 and variance 1 at each SNP

Parameters

genotype (Expression or CallExpression) – Entry field of genotypes.

Returns

MatrixTableMatrixTable with normalized genotypes.

hail.experimental.ldscsim.annotate_all(mt, row_exprs={}, col_exprs={}, entry_exprs={}, global_exprs={})[source]

Equivalent of _annotate_all, but checks source MatrixTable of exprs

hail.experimental.ldscsim.ascertainment_bias(mt, y, P)[source]

Adds ascertainment bias to a binary phenotype to give it a sample prevalence of P = cases/(cases+controls).

Parameters
Returns

MatrixTableMatrixTable containing binary phenotype with prevalence of approx. P

hail.experimental.ldscsim.binarize(mt, y, K, exact=False)[source]

Binarize phenotype y such that it has prevalence K = cases/(cases+controls) Uses inverse CDF of Gaussian to set binarization threshold when exact = False, otherwise uses ranking to determine threshold.

Parameters
Returns

MatrixTableMatrixTable containing binary phenotype with prevalence of approx. K

hail.experimental.ldscsim.agg_fields(tb, coef_dict=None, str_expr=None, axis='rows')[source]

Aggregates by linear combination fields matching either keys in coef_dict or str_expr. Outputs the aggregation in a MatrixTable or Table as a new row field “agg_annot” or a new column field “agg_cov”.

Parameters
Returns

MatrixTable or TableMatrixTable or Table containing aggregation field.

hail.experimental.ldscsim.get_coef_dict(tb, str_expr=None, ref_coef_dict=None, axis='rows')[source]

Gets either col or row fields matching str_expr and take intersection with keys in coefficient reference dict.

Parameters
Returns

coef_dict (dict) – Coefficients to multiply each field. The coefficients are specified by coef_dict value, the row (or col) field name is specified by coef_dict key.