Source code for hail.methods.relatedness.king

import hail as hl
from hail.expr.expressions import expr_call, matrix_table_source
from hail.typecheck import nullable, typecheck
from hail.utils import deduplicate
from hail.utils.java import Env


[docs]@typecheck(call_expr=expr_call, block_size=nullable(int)) def king(call_expr, *, block_size=None): r"""Compute relatedness estimates between individuals using a KING variant. .. include:: ../_templates/req_diploid_gt.rst Examples -------- Estimate the kinship coefficient for every pair of samples. >>> kinship = hl.king(dataset.GT) Notes ----- The following presentation summarizes the methods section of `Manichaikul, et. al. <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025716/>`__, but adopts a more consistent notation for matrices. Let - :math:`i` and :math:`j` be two individuals in the dataset - :math:`N^{Aa}_{i}` be the number of heterozygote genotypes for individual :math:`i`. - :math:`N^{Aa,Aa}_{i,j}` be the number of variants at which a pair of individuals both have heterozygote genotypes. - :math:`N^{AA,aa}_{i,j}` be the number of variants at which a pair of individuals have opposing homozygote genotypes. - :math:`S_{i,j}` be the set of single-nucleotide variants for which both individuals :math:`i` and :math:`j` have a non-missing genotype. - :math:`X_{i,s}` be the genotype score matrix. Each entry corresponds to the genotype of individual :math:`i` at variant :math:`s`. Homozygous-reference genotypes are represented as 0, heterozygous genotypes are represented as 1, and homozygous-alternate genotypes are represented as 2. :math:`X_{i,s}` is calculated by invoking :meth:`~.CallExpression.n_alt_alleles` on the `call_expr`. The three counts above, :math:`N^{Aa}`, :math:`N^{Aa,Aa}`, and :math:`N^{AA,aa}`, exclude variants where one or both individuals have missing genotypes. In terms of the symbols above, we can define :math:`d`, the genetic distance between two samples. We can interpret :math:`d` as an unnormalized measurement of the genetic material not shared identically-by-descent: .. math:: d_{i,j} = \sum_{s \in S_{i,j}}\left(X_{i,s} - X_{j,s}\right)^2 In the supplement to Manichaikul, et. al, the authors show how to re-express the genetic distance above in terms of the three counts of hetero- and homozygosity by considering the nine possible configurations of a pair of genotypes: +-------------------------------+----------+----------+----------+ |:math:`(X_{i,s} - X_{j,s})^2` |homref |het |homalt | +-------------------------------+----------+----------+----------+ |homref |0 |1 |4 | +-------------------------------+----------+----------+----------+ |het |1 |0 |1 | +-------------------------------+----------+----------+----------+ |homalt |4 |1 |0 | +-------------------------------+----------+----------+----------+ which leads to this expression for genetic distance: .. math:: d_{i,j} = 4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j} - 2 N^{Aa,Aa}_{i,j} The first term, :math:`4 N^{AA,aa}_{i,j}`, accounts for all pairs of genotypes with opposing homozygous genotypes. The second and third terms account for the four cases of one heteroyzgous genotype and one non-heterozygous genotype. Unfortunately, the second and third term also contribute to the case of a pair of heteroyzgous genotypes. We offset this with the fourth and final term. The genetic distance, :math:`d_{i,j}`, ranges between zero and four times the number of variants in the dataset. In the supplement to Manichaikul, et. al, the authors demonstrate that the kinship coefficient, :math:`\phi_{i,j}`, between two individuals from the same population is related to the expected genetic distance at any *one* variant by way of the allele frequency: .. math:: \mathop{\mathbb{E}}_{i,j} (X_{i,s} - X_{j,s})^2 = 4 p_s (1 - p_s) (1 - 2\phi_{i,j}) This identity reveals that the quotient of the expected genetic distance and the four-trial binomial variance in the allele frequency represents, roughly, the "fraction of genetic material *not* shared identically-by-descent": .. math:: 1 - 2 \phi_{i,j} = \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j} - 2 N^{Aa,Aa}_{i,j}} {\sum_{s \in S_{i,j}} 4 p_s (1 - p_s)} Note that the "coefficient of relationship", (by definition: the fraction of genetic material shared identically-by-descent) is equal to twice the kinship coefficient: :math:`\phi_{i,j} = 2 r_{i,j}`. Manichaikul, et. al, assuming one homogeneous population, demonstrate in Section 2.3 that the sum of the variance of the allele frequencies, .. math:: \sum_{s \in S_{i, j}} 2 p_s (1 - p_s) is, in expectation, proportional to the count of heterozygous genotypes of either individual: .. math:: N^{Aa}_{i} For individuals from distinct populations, the authors propose replacing the count of heteroyzgous genotypes with the average of the two individuals: .. math:: \frac{N^{Aa}_{i} + N^{Aa}_{j}}{2} Using the aforementioned equality, we define a normalized genetic distance, :math:`\widetilde{d_{i,j}}`, for a pair of individuals from distinct populations: .. math:: \begin{aligned} \widetilde{d_{i,j}} &= \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j} - 2 N^{Aa,Aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \\ &= 1 + \frac{4 N^{AA,aa}_{i,j} - 2 N^{Aa,Aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \end{aligned} As mentioned before, the complement of the normalized genetic distance is the coefficient of relationship which is also equal to twice the kinship coefficient: .. math:: 2 \phi_{i,j} = r_{i,j} = 1 - \widetilde{d_{i,j}} We now present the KING "within-family" estimator of the kinship coefficient as one-half of the coefficient of relationship: .. math:: \begin{aligned} \widehat{\phi_{i,j}^{\mathrm{within}}} &= \frac{1}{2} r_{i,j} \\ &= \frac{1}{2} \left( 1 - \widetilde{d_{i,j}} \right) \\ &= \frac{N^{Aa,Aa}_{i,j} - 2 N^{AA,aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \end{aligned} This "within-family" estimator over-estimates the kinship coefficient under certain circumstances detailed in Section 2.3 of Manichaikul, et. al. The authors recommend an alternative estimator when individuals are known to be from different families. The estimator replaces the average count of heteroyzgous genotypes with the minimum count of heterozygous genotypes: .. math:: \frac{N^{Aa}_{i} + N^{Aa}_{j}}{2} \rightsquigarrow \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j}) This transforms the "within-family" estimator into the "between-family" estimator, defined by Equation 11 of Manichaikul, et. al.: .. math:: \begin{aligned} \widetilde{d_{i,j}^{\mathrm{between}}} &= \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j} - 2 N^{Aa,Aa}_{i,j}} {2 \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j})} \\ \widehat{\phi_{i,j}^{\mathrm{between}}} &= \frac{1}{2} + \frac{2 N^{Aa,Aa}_{i,j} - 4 N^{AA,aa}_{i,j} - N^{Aa}_{i} - N^{Aa}_{j}} {4 \cdot \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j})} \end{aligned} This function, :func:`.king`, only implements the "between-family" estimator, :math:`\widehat{\phi_{i,j}^{\mathrm{between}}}`. Parameters ---------- call_expr : :class:`.CallExpression` Entry-indexed call expression. block_size : :obj:`int`, optional Block size of block matrices used in the algorithm. Default given by :meth:`.BlockMatrix.default_block_size`. Returns ------- :class:`.MatrixTable` A :class:`.MatrixTable` whose rows and columns are keys are taken from `call-expr`'s column keys. It has one entry field, `phi`. """ mt = matrix_table_source('king/call_expr', call_expr) call = Env.get_uid() mt = mt.annotate_entries(**{call: call_expr}) is_hom_ref = Env.get_uid() is_het = Env.get_uid() is_hom_var = Env.get_uid() is_defined = Env.get_uid() mt = mt.unfilter_entries() mt = mt.select_entries(**{ is_hom_ref: hl.float(hl.or_else(mt[call].is_hom_ref(), 0)), is_het: hl.float(hl.or_else(mt[call].is_het(), 0)), is_hom_var: hl.float(hl.or_else(mt[call].is_hom_var(), 0)), is_defined: hl.float(hl.is_defined(mt[call])), }) ref = hl.linalg.BlockMatrix.from_entry_expr(mt[is_hom_ref], block_size=block_size) het = hl.linalg.BlockMatrix.from_entry_expr(mt[is_het], block_size=block_size) var = hl.linalg.BlockMatrix.from_entry_expr(mt[is_hom_var], block_size=block_size) defined = hl.linalg.BlockMatrix.from_entry_expr(mt[is_defined], block_size=block_size) ref_var = (ref.T @ var).checkpoint(hl.utils.new_temp_file()) # We need the count of times the pair is AA,aa and aa,AA. ref_var is only # AA,aa. Transposing ref_var gives var_ref, i.e. aa,AA. # # n.b. (REF.T @ VAR).T == (VAR.T @ REF) by laws of matrix multiply N_AA_aa = ref_var + ref_var.T N_Aa_Aa = (het.T @ het).checkpoint(hl.utils.new_temp_file()) # We count the times the row individual has a heterozygous genotype and the # column individual has any defined genotype at all. N_Aa_defined = (het.T @ defined).checkpoint(hl.utils.new_temp_file()) het_hom_balance = N_Aa_Aa - (2 * N_AA_aa) het_hom_balance = het_hom_balance.to_matrix_table_row_major() n_hets_for_rows = N_Aa_defined.to_matrix_table_row_major() n_hets_for_cols = N_Aa_defined.T.to_matrix_table_row_major() kinship_between = het_hom_balance.rename({'element': 'het_hom_balance'}) kinship_between = kinship_between.annotate_entries( n_hets_row=n_hets_for_rows[kinship_between.row_key, kinship_between.col_key].element, n_hets_col=n_hets_for_cols[kinship_between.row_key, kinship_between.col_key].element, ) col_index_field = Env.get_uid() col_key = mt.col_key cols = mt.add_col_index(col_index_field).key_cols_by(col_index_field).cols() kinship_between = kinship_between.key_cols_by(**cols[kinship_between.col_idx].select(*col_key)) renaming, _ = deduplicate(list(col_key), already_used=set(col_key)) assert len(renaming) == len(col_key) kinship_between = kinship_between.key_rows_by( **cols[kinship_between.row_idx].select(*col_key).rename(dict(renaming)) ) kinship_between = kinship_between.annotate_entries( min_n_hets=hl.min(kinship_between.n_hets_row, kinship_between.n_hets_col) ) return ( kinship_between.select_entries( phi=(0.5) + ( (2 * kinship_between.het_hom_balance + -kinship_between.n_hets_row - kinship_between.n_hets_col) / (4 * kinship_between.min_n_hets) ) ) .select_rows() .select_cols() .select_globals() )