Statistical functions

chi_squared_test(c1, c2, c3, c4)

Performs chi-squared test of independence on a 2x2 contingency table.

fisher_exact_test(c1, c2, c3, c4)

Calculates the p-value, odds ratio, and 95% confidence interval using Fisher's exact test for a 2x2 table.

contingency_table_test(c1, c2, c3, c4, ...)

Performs chi-squared or Fisher's exact test of independence on a 2x2 contingency table.

cochran_mantel_haenszel_test(a, b, c, d)

Perform the Cochran-Mantel-Haenszel test for association.

dbeta(x, a, b)

Returns the probability density at x of a beta distribution with parameters a (alpha) and b (beta).

dchisq(x, df[, ncp, log_p])

Compute the probability density at x of a chi-squared distribution with df degrees of freedom.

dnorm(x[, mu, sigma, log_p])

Compute the probability density at x of a normal distribution with mean mu and standard deviation sigma.

dpois(x, lamb[, log_p])

Compute the (log) probability density at x of a Poisson distribution with rate parameter lamb.

hardy_weinberg_test(n_hom_ref, n_het, n_hom_var)

Performs test of Hardy-Weinberg equilibrium.

binom_test(x, n, p, alternative)

Performs a binomial test on p given x successes in n trials.

pchisqtail(x, df[, ncp, lower_tail, log_p])

Returns the probability under the right-tail starting at x for a chi-squared distribution with df degrees of freedom.

pgenchisq(x, w, k, lam, mu, sigma, *[, ...])

The cumulative probability function of a generalized chi-squared distribution.

pnorm(x[, mu, sigma, lower_tail, log_p])

The cumulative probability function of a normal distribution with mean mu and standard deviation sigma.

pT(x, n[, lower_tail, log_p])

The cumulative probability function of a t-distribution with n degrees of freedom.

pF(x, df1, df2[, lower_tail, log_p])

The cumulative probability function of a F-distribution with parameters df1 and df2.

ppois(x, lamb[, lower_tail, log_p])

The cumulative probability function of a Poisson distribution.

qchisqtail(p, df[, ncp, lower_tail, log_p])

The quantile function of a chi-squared distribution with df degrees of freedom, inverts pchisqtail().

qnorm(p[, mu, sigma, lower_tail, log_p])

The quantile function of a normal distribution with mean mu and standard deviation sigma, inverts pnorm().

qpois(p, lamb[, lower_tail, log_p])

The quantile function of a Poisson distribution with rate parameter lamb, inverts ppois().

hail.expr.functions.chi_squared_test(c1, c2, c3, c4)[source]

Performs chi-squared test of independence on a 2x2 contingency table.

Examples

>>> hl.eval(hl.chi_squared_test(10, 10, 10, 10))
Struct(p_value=1.0, odds_ratio=1.0)
>>> hl.eval(hl.chi_squared_test(51, 43, 22, 92))
Struct(p_value=1.4626257805267089e-07, odds_ratio=4.959830866807611)

Notes

The odds ratio is given by (c1 / c2) / (c3 / c4).

Returned fields may be nan or inf.

Parameters:
Returns:

StructExpression – A tstruct expression with two fields, p_value (tfloat64) and odds_ratio (tfloat64).

hail.expr.functions.fisher_exact_test(c1, c2, c3, c4)[source]

Calculates the p-value, odds ratio, and 95% confidence interval using Fisher’s exact test for a 2x2 table.

Examples

>>> hl.eval(hl.fisher_exact_test(10, 10, 10, 10))
Struct(p_value=1.0000000000000002, odds_ratio=1.0,
       ci_95_lower=0.24385796914260355, ci_95_upper=4.100747675033819)
>>> hl.eval(hl.fisher_exact_test(51, 43, 22, 92))
Struct(p_value=2.1564999740157304e-07, odds_ratio=4.918058171469967,
       ci_95_lower=2.5659373368248444, ci_95_upper=9.677929632035475)

Notes

This method is identical to the version implemented in R with default parameters (two-sided, alpha = 0.05, null hypothesis that the odds ratio equals 1).

Returned fields may be nan or inf.

Parameters:
Returns:

StructExpression – A tstruct expression with four fields, p_value (tfloat64), odds_ratio (tfloat64), ci_95_lower (:py:data:.tfloat64`), and ci_95_upper (tfloat64).

hail.expr.functions.contingency_table_test(c1, c2, c3, c4, min_cell_count)[source]

Performs chi-squared or Fisher’s exact test of independence on a 2x2 contingency table.

Examples

>>> hl.eval(hl.contingency_table_test(51, 43, 22, 92, min_cell_count=22))
Struct(p_value=1.4626257805267089e-07, odds_ratio=4.959830866807611)
>>> hl.eval(hl.contingency_table_test(51, 43, 22, 92, min_cell_count=23))
Struct(p_value=2.1564999740157304e-07, odds_ratio=4.918058171469967)

Notes

If all cell counts are at least min_cell_count, the chi-squared test is used. Otherwise, Fisher’s exact test is used.

Returned fields may be nan or inf.

Parameters:
Returns:

StructExpression – A tstruct expression with two fields, p_value (tfloat64) and odds_ratio (tfloat64).

hail.expr.functions.cochran_mantel_haenszel_test(a, b, c, d)[source]

Perform the Cochran-Mantel-Haenszel test for association.

Examples

>>> a = [56, 61, 73, 71]
>>> b = [69, 257, 65, 48]
>>> c = [40, 57, 71, 55]
>>> d = [77, 301, 79, 48]
>>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
Struct(test_statistic=5.0496881823306765, p_value=0.024630370456863417)
>>> mt = ds.filter_rows(mt.locus == hl.Locus(20, 10633237))
>>> mt.count_rows()
1
>>> a, b, c, d = mt.aggregate_entries(
...     hl.tuple([
...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
...         hl.array([hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)]),
...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & mt.pheno.is_case & ~mt.pheno.is_female)]),
...         hl.array([hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & mt.pheno.is_female), hl.agg.count_where(~mt.GT.is_non_ref() & ~mt.pheno.is_case & ~mt.pheno.is_female)])
...     ])
... )
>>> hl.eval(hl.cochran_mantel_haenszel_test(a, b, c, d))
Struct(test_statistic=0.2188830334629822, p_value=0.6398923118508772)

Notes

See the Wikipedia article for more details.

Parameters:
Returns:

StructExpression – A tstruct expression with two fields, test_statistic (tfloat64) and p_value (tfloat64).

hail.expr.functions.dbeta(x, a, b)[source]

Returns the probability density at x of a beta distribution with parameters a (alpha) and b (beta).

Examples

>>> hl.eval(hl.dbeta(.2, 5, 20))
4.900377563180943
Parameters:
  • x (float or Expression of type tfloat64) – Point in [0,1] at which to sample. If a < 1 then x must be positive. If b < 1 then x must be less than 1.

  • a (float or Expression of type tfloat64) – The alpha parameter in the beta distribution. The result is undefined for non-positive a.

  • b (float or Expression of type tfloat64) – The beta parameter in the beta distribution. The result is undefined for non-positive b.

Returns:

Float64Expression

hail.expr.functions.dchisq(x, df, ncp=None, log_p=False)[source]

Compute the probability density at x of a chi-squared distribution with df degrees of freedom.

Examples

>>> hl.eval(hl.dchisq(1, 2))
0.3032653298563167
>>> hl.eval(hl.dchisq(1, 2, ncp=2))
0.17472016746112667
>>> hl.eval(hl.dchisq(1, 2, log_p=True))
-1.1931471805599454
Parameters:
  • x (float or Expression of type tfloat64) – Non-negative number at which to compute the probability density.

  • df (float or Expression of type tfloat64) – Degrees of freedom.

  • ncp (float or Expression of type tfloat64) – Noncentrality parameter, defaults to 0 if unspecified.

  • log_p (bool or BooleanExpression) – If True, the natural logarithm of the probability density is returned.

Returns:

Expression of type tfloat64 – The probability density.

hail.expr.functions.dnorm(x, mu=0, sigma=1, log_p=False)[source]

Compute the probability density at x of a normal distribution with mean mu and standard deviation sigma. Returns density of standard normal distribution by default.

Examples

>>> hl.eval(hl.dnorm(1))
0.24197072451914337
>>> hl.eval(hl.dnorm(1, mu=1, sigma=2))
0.19947114020071635
>>> hl.eval(hl.dnorm(1, log_p=True))
-1.4189385332046727
Parameters:
Returns:

Expression of type tfloat64 – The probability density.

hail.expr.functions.dpois(x, lamb, log_p=False)[source]

Compute the (log) probability density at x of a Poisson distribution with rate parameter lamb.

Examples

>>> hl.eval(hl.dpois(5, 3))
0.10081881344492458
Parameters:
Returns:

Expression of type tfloat64 – The (log) probability density.

hail.expr.functions.hardy_weinberg_test(n_hom_ref, n_het, n_hom_var, one_sided=False)[source]

Performs test of Hardy-Weinberg equilibrium.

Examples

>>> hl.eval(hl.hardy_weinberg_test(250, 500, 250))
Struct(het_freq_hwe=0.5002501250625313, p_value=0.9747844394217698)
>>> hl.eval(hl.hardy_weinberg_test(37, 200, 85))
Struct(het_freq_hwe=0.48964964307448583, p_value=1.1337210383168987e-06)

Notes

By default, this method performs a two-sided exact test with mid-p-value correction of Hardy-Weinberg equilibrium via an efficient implementation of the Levene-Haldane distribution, which models the number of heterozygous individuals under equilibrium.

The mean of this distribution is (n_ref * n_var) / (2n - 1), where n_ref = 2*n_hom_ref + n_het is the number of reference alleles, n_var = 2*n_hom_var + n_het is the number of variant alleles, and n = n_hom_ref + n_het + n_hom_var is the number of individuals. So the expected frequency of heterozygotes under equilibrium, het_freq_hwe, is this mean divided by n.

To perform one-sided exact test of excess heterozygosity with mid-p-value correction instead, set one_sided=True and the p-value returned will be from the one-sided exact test.

Parameters:
  • n_hom_ref (int or Expression of type tint32) – Number of homozygous reference genotypes.

  • n_het (int or Expression of type tint32) – Number of heterozygous genotypes.

  • n_hom_var (int or Expression of type tint32) – Number of homozygous variant genotypes.

  • one_sided (bool) – False by default. When True, perform one-sided test for excess heterozygosity.

Returns:

StructExpression – A struct expression with two fields, het_freq_hwe (tfloat64) and p_value (tfloat64).

hail.expr.functions.binom_test(x, n, p, alternative)[source]

Performs a binomial test on p given x successes in n trials.

Returns the p-value from the exact binomial test of the null hypothesis that success has probability p, given x successes in n trials.

The alternatives are interpreted as follows: - 'less': a one-tailed test of the significance of x or fewer successes, - 'greater': a one-tailed test of the significance of x or more successes, and - 'two-sided': a two-tailed test of the significance of x or any equivalent or more unlikely outcome.

Examples

All the examples below use a fair coin as the null hypothesis. Zero is interpreted as tail and one as heads.

Test if a coin is biased towards heads or tails after observing two heads out of ten flips:

>>> hl.eval(hl.binom_test(2, 10, 0.5, 'two-sided'))
0.10937499999999994

Test if a coin is biased towards tails after observing four heads out of ten flips:

>>> hl.eval(hl.binom_test(4, 10, 0.5, 'less'))
0.3769531250000001

Test if a coin is biased towards heads after observing thirty-two heads out of fifty flips:

>>> hl.eval(hl.binom_test(32, 50, 0.5, 'greater'))
0.03245432353613613
Parameters:
  • x (int or Expression of type tint32) – Number of successes.

  • n (int or Expression of type tint32) – Number of trials.

  • p (float or Expression of type tfloat64) – Probability of success, between 0 and 1.

  • alternative – : One of, “two-sided”, “greater”, “less”, (deprecated: “two.sided”).

Returns:

Expression of type tfloat64 – p-value.

hail.expr.functions.pchisqtail(x, df, ncp=None, lower_tail=False, log_p=False)[source]

Returns the probability under the right-tail starting at x for a chi-squared distribution with df degrees of freedom.

Examples

>>> hl.eval(hl.pchisqtail(5, 1))
0.025347318677468304
>>> hl.eval(hl.pchisqtail(5, 1, ncp=2))
0.20571085634347097
>>> hl.eval(hl.pchisqtail(5, 1, lower_tail=True))
0.9746526813225317
>>> hl.eval(hl.pchisqtail(5, 1, log_p=True))
-3.6750823266311876
Parameters:
  • x (float or Expression of type tfloat64) – The value at which to evaluate the CDF.

  • df (float or Expression of type tfloat64) – Degrees of freedom.

  • ncp (float or Expression of type tfloat64) – Noncentrality parameter, defaults to 0 if unspecified.

  • lower_tail (bool or BooleanExpression) – If True, compute the probability of an outcome at or below x, otherwise greater than x.

  • log_p (bool or BooleanExpression) – Return the natural logarithm of the probability.

Returns:

Expression of type tfloat64

hail.expr.functions.pgenchisq(x, w, k, lam, mu, sigma, *, max_iterations=None, min_accuracy=None)[source]

The cumulative probability function of a generalized chi-squared distribution.

The generalized chi-squared distribution has many interpretations. We share here four interpretations of the values of this distribution:

  1. A linear combination of normal variables and squares of normal variables.

  2. A weighted sum of sums of squares of normally distributed values plus a normally distributed value.

  3. A weighted sum of chi-squared distributed values plus a normally distributed value.

  4. A “quadratic form” in a vector of uncorrelated standard normal values.

The parameters of this function correspond to the parameters of the third interpretation.

\[\begin{aligned} w &: R^n \quad k : Z^n \quad lam : R^n \quad mu : R \quad sigma : R \\ \\ x &\sim N(mu, sigma^2) \\ y_i &\sim \mathrm{NonCentralChiSquared}(k_i, lam_i) \\ \\ Z &= x + w y^T \\ &= x + \sum_i w_i y_i \\ Z &\sim \mathrm{GeneralizedNonCentralChiSquared}(w, k, lam, mu, sigma) \end{aligned}\]

The generalized chi-squared distribution often arises when working on linear models with standard normal noise because the sum of the squares of the residuals should follow a generalized chi-squared distribution.

Examples

The following plot shows three examples of the generalized chi-squared cumulative distribution function.

Plots of examples of the generalized chi-square cumulative distribution function. Created by Dvidby0.

The following examples are chosen from the three instances shown above. The curves appear in the same order as the legend of the plot: blue, red, yellow.

>>> hl.eval(hl.pgenchisq(-80, w=[1, 2], k=[1, 4], lam=[1, 1], mu=0, sigma=0).value)
0.0
>>> hl.eval(hl.pgenchisq(-20, w=[1, 2], k=[1, 4], lam=[1, 1], mu=0, sigma=0).value)
0.0
>>> hl.eval(hl.pgenchisq(10 , w=[1, 2], k=[1, 4], lam=[1, 1], mu=0, sigma=0).value)
0.4670012373599629
>>> hl.eval(hl.pgenchisq(40 , w=[1, 2], k=[1, 4], lam=[1, 1], mu=0, sigma=0).value)
0.9958803111156718
>>> hl.eval(hl.pgenchisq(-80, w=[-2, -1], k=[5, 2], lam=[3, 1], mu=-3, sigma=0).value)
9.227056966837344e-05
>>> hl.eval(hl.pgenchisq(-20, w=[-2, -1], k=[5, 2], lam=[3, 1], mu=-3, sigma=0).value)
0.516439358616939
>>> hl.eval(hl.pgenchisq(10 , w=[-2, -1], k=[5, 2], lam=[3, 1], mu=-3, sigma=0).value)
1.0
>>> hl.eval(hl.pgenchisq(40 , w=[-2, -1], k=[5, 2], lam=[3, 1], mu=-3, sigma=0).value)
1.0
>>> hl.eval(hl.pgenchisq(-80, w=[1, -10, 2], k=[1, 2, 3], lam=[2, 3, 7], mu=-10, sigma=0).value)
0.14284718767288906
>>> hl.eval(hl.pgenchisq(-20, w=[1, -10, 2], k=[1, 2, 3], lam=[2, 3, 7], mu=-10, sigma=0).value)
0.5950150356303258
>>> hl.eval(hl.pgenchisq(10 , w=[1, -10, 2], k=[1, 2, 3], lam=[2, 3, 7], mu=-10, sigma=0).value)
0.923219534175858
>>> hl.eval(hl.pgenchisq(40 , w=[1, -10, 2], k=[1, 2, 3], lam=[2, 3, 7], mu=-10, sigma=0).value)
0.9971746768781656

Notes

We follow Wikipedia’s notational conventions. Some texts refer to the weight vector (our w) as \(\lambda\) or lb and the non-centrality vector (our lam) as nc.

We use the Davies’ algorithm which was published as:

Davies included Fortran source code in the original publication. Davies also released a C language port. Hail’s implementation is a fairly direct port of the C implementation to Scala. Davies provides 39 test cases with the source code. The Hail tests include all 39 test cases as well as a few additional tests.

Davies’ website cautions:

The method works well in most situations if you want only modest accuracy, say 0.0001. But problems may arise if the sum is dominated by one or two terms with a total of only one or two degrees of freedom and x is small.

For an accessible introduction the Generalized Chi-Squared Distribution, we strongly recommend the introduction of this paper:

Parameters:
  • x (float or Expression of type tfloat64) – The value at which to evaluate the cumulative distribution function (CDF).

  • w (list of float or Expression of type tarray of tfloat64) – A weight for each non-central chi-square term.

  • k (list of int or Expression of type tarray of tint32) – A degrees of freedom parameter for each non-central chi-square term.

  • lam (list of float or Expression of type tarray of tfloat64) – A non-centrality parameter for each non-central chi-square term. We use lam instead of lambda because the latter is a reserved word in Python.

  • mu (float or Expression of type tfloat64) – The standard deviation of the normal term.

  • sigma (float or Expression of type tfloat64) – The standard deviation of the normal term.

  • max_iterations (int or Expression of type tint32) – The maximum number of iterations of the numerical integration before raising an error. The default maximum number of iterations is 1e5.

  • min_accuracy (int or Expression of type tint32) – The minimum accuracy of the returned value. If the minimum accuracy is not achieved, this function will raise an error. The default minimum accuracy is 1e-5.

Returns:

StructExpression – This method returns a structure with the value as well as information about the numerical integration.

  • value : Float64Expression. If converged is true, the value of the CDF evaluated at x. Otherwise, this is the last value the integration evaluated before aborting.

  • n_iterations : Int32Expression. The number of iterations before stopping.

  • converged : BooleanExpression. True if the min_accuracy was achieved and round off error is not likely significant.

  • fault : Int32Expression. If converged is true, fault is zero. If converged is false, fault is either one or two. One indicates that the requried accuracy was not achieved. Two indicates the round-off error is possibly significant.

hail.expr.functions.pnorm(x, mu=0, sigma=1, lower_tail=True, log_p=False)[source]

The cumulative probability function of a normal distribution with mean mu and standard deviation sigma. Returns cumulative probability of standard normal distribution by default.

Examples

>>> hl.eval(hl.pnorm(0))
0.5
>>> hl.eval(hl.pnorm(1, mu=2, sigma=2))
0.30853753872598694
>>> hl.eval(hl.pnorm(2, lower_tail=False))
0.022750131948179212
>>> hl.eval(hl.pnorm(2, log_p=True))
-0.023012909328963493

Notes

Returns the left-tail probability p = Prob(\(Z < x\)) with \(Z\) a normal random variable. Defaults to a standard normal random variable.

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.pT(x, n, lower_tail=True, log_p=False)[source]

The cumulative probability function of a t-distribution with n degrees of freedom.

Examples

>>> hl.eval(hl.pT(0, 10))
0.5
>>> hl.eval(hl.pT(1, 10))
0.82955343384897
>>> hl.eval(hl.pT(1, 10, lower_tail=False))
0.17044656615103004
>>> hl.eval(hl.pT(1, 10, log_p=True))
-0.186867754489647

Notes

If lower_tail is true, returns Prob(\(X \leq\) x) where \(X\) is a t-distributed random variable with n degrees of freedom. If lower_tail is false, returns Prob(\(X\) > x).

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.pF(x, df1, df2, lower_tail=True, log_p=False)[source]

The cumulative probability function of a F-distribution with parameters df1 and df2.

Examples

>>> hl.eval(hl.pF(0, 3, 10))
0.0
>>> hl.eval(hl.pF(1, 3, 10))
0.5676627969783028
>>> hl.eval(hl.pF(1, 3, 10, lower_tail=False))
0.4323372030216972
>>> hl.eval(hl.pF(1, 3, 10, log_p=True))
-0.566227703842908

Notes

If lower_tail is true, returns Prob(\(X \leq\) x) where \(X\) is a random variable with distribution \(F`(df1, df2). If `lower_tail\) is false, returns Prob(\(X\) > x).

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.ppois(x, lamb, lower_tail=True, log_p=False)[source]

The cumulative probability function of a Poisson distribution.

Examples

>>> hl.eval(hl.ppois(2, 1))
0.9196986029286058

Notes

If lower_tail is true, returns Prob(\(X \leq\) x) where \(X\) is a Poisson random variable with rate parameter lamb. If lower_tail is false, returns Prob(\(X\) > x).

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.qchisqtail(p, df, ncp=None, lower_tail=False, log_p=False)[source]

The quantile function of a chi-squared distribution with df degrees of freedom, inverts pchisqtail().

Examples

>>> hl.eval(hl.qchisqtail(0.05, 2))
5.991464547107979
>>> hl.eval(hl.qchisqtail(0.05, 2, ncp=2))
10.838131614372958
>>> hl.eval(hl.qchisqtail(0.05, 2, lower_tail=True))
0.10258658877510107
>>> hl.eval(hl.qchisqtail(hl.log(0.05), 2, log_p=True))
5.991464547107979

Notes

Returns right-quantile x for which p = Prob(\(Z^2\) > x) with \(Z^2\) a chi-squared random variable with degrees of freedom specified by df. The probability p must satisfy 0 < p < 1.

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.qnorm(p, mu=0, sigma=1, lower_tail=True, log_p=False)[source]

The quantile function of a normal distribution with mean mu and standard deviation sigma, inverts pnorm(). Returns quantile of standard normal distribution by default.

Examples

>>> hl.eval(hl.qnorm(0.90))
1.2815515655446008
>>> hl.eval(hl.qnorm(0.90, mu=1, sigma=2))
3.5631031310892016
>>> hl.eval(hl.qnorm(0.90, lower_tail=False))
-1.2815515655446008
>>> hl.eval(hl.qnorm(hl.log(0.90), log_p=True))
1.2815515655446008

Notes

Returns left-quantile x for which p = Prob(\(Z\) < x) with \(Z\) a normal random variable with mean mu and standard deviation sigma. Defaults to a standard normal random variable, and the probability p must satisfy 0 < p < 1.

Parameters:
Returns:

Expression of type tfloat64

hail.expr.functions.qpois(p, lamb, lower_tail=True, log_p=False)[source]

The quantile function of a Poisson distribution with rate parameter lamb, inverts ppois().

Examples

>>> hl.eval(hl.qpois(0.99, 1))
4

Notes

Returns the smallest integer \(x\) such that Prob(\(X \leq x\)) \(\geq\) p where \(X\) is a Poisson random variable with rate parameter lambda.

Parameters:
Returns:

Expression of type tfloat64