Relatedness¶
The relatedness of two individuals characterizes their biological relationship. For example, two individuals might be siblings or parentandchild. All notions of relatedness implemented in Hail are rooted in the idea of alleles “inherited identically by descent”. Two alleles in two distinct individuals are inherited identically by descent if both alleles were inherited by the same “recent,” common ancestor. The term “recent” distinguishes alleles shared IBD from family members from alleles shared IBD from “distant” ancestors. Distant ancestors are thought of contributing to population structure rather than relatedness.
Relatedness is usually quantified by two quantities: kinship coefficient
(\(\phi\) or PI_HAT
) and probabilityofidentitybydescentzero
(\(\pi_0\) or Z0
). The kinship coefficient is the probability that any
two alleles selected randomly from the same locus are identical by
descent. Twice the kinship coefficient is the coefficient of relationship which
is the percent of genetic material shared identically by descent.
Probabilityofidentitybydescentzero is the probability that none of the
alleles at a randomly chosen locus were inherited identically by descent.
Hail provides three methods for the inference of relatedness: PLINKstyle identity by descent 1, KING 2, and PCRelate 3.
identity_by_descent()
is appropriate for datasets containing one homogeneous population.king()
is appropriate for datasets containing multiple homogeneous populations and no admixture. It is also used to prune close relatives before usingpc_relate()
.pc_relate()
is appropriate for datasets containing multiple homogeneous populations and admixture.

Compute matrix of identitybydescent estimates. 

Compute relatedness estimates between individuals using a KING variant. 

Compute relatedness estimates between individuals using a variant of the PCRelate method. 

Simulate random diploid mating to produce new individuals. 

hail.methods.
identity_by_descent
(dataset, maf=None, bounded=True, min=None, max=None)[source]¶ Compute matrix of identitybydescent estimates.
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
To calculate a full IBD matrix, using minor allele frequencies computed from the dataset itself:
>>> hl.identity_by_descent(dataset)
To calculate an IBD matrix containing only pairs of samples with
PI_HAT
in \([0.2, 0.9]\), using minor allele frequencies stored in the row field panel_maf:>>> hl.identity_by_descent(dataset, maf=dataset['panel_maf'], min=0.2, max=0.9)
Notes
The dataset must have a column field named s which is a
StringExpression
and which uniquely identifies a column.The implementation is based on the IBD algorithm described in the PLINK paper.
identity_by_descent()
requires the dataset to be biallelic and does not perform LD pruning. Linkage disequilibrium may bias the result so consider filtering variants first.The resulting
Table
entries have the type: { i: String, j: String, ibd: { Z0: Double, Z1: Double, Z2: Double, PI_HAT: Double }, ibs0: Long, ibs1: Long, ibs2: Long }. The key list is: *i: String, j: String*.Conceptually, the output is a symmetric, samplebysample matrix. The output table has the following form
i j ibd.Z0 ibd.Z1 ibd.Z2 ibd.PI_HAT ibs0 ibs1 ibs2 sample1 sample2 1.0000 0.0000 0.0000 0.0000 ... sample1 sample3 1.0000 0.0000 0.0000 0.0000 ... sample1 sample4 0.6807 0.0000 0.3193 0.3193 ... sample1 sample5 0.1966 0.0000 0.8034 0.8034 ...
 Parameters
dataset (
MatrixTable
) – Variantkeyed and samplekeyedMatrixTable
containing genotype information.maf (
Float64Expression
, optional) – Rowindexed expression for the minor allele frequency.bounded (
bool
) – Forces the estimations forZ0
,Z1
,Z2
, andPI_HAT
to take on biologically meaningful values (in the range \([0,1]\)).min (
float
orNone
) – Sample pairs with aPI_HAT
below this value will not be included in the output. Must be in \([0,1]\).max (
float
orNone
) – Sample pairs with aPI_HAT
above this value will not be included in the output. Must be in \([0,1]\).
 Returns

hail.methods.
king
(call_expr, *, block_size=None)[source]¶ Compute relatedness estimates between individuals using a KING variant.
Note
Requires the dataset to contain only diploid genotype calls.
Examples
Estimate the kinship coefficient for every pair of samples.
>>> kinship = hl.king(dataset.GT)
Notes
The following presentation summarizes the methods section of Manichaikul, et. al., but adopts a more consistent notation for matrices.
Let
\(i\) and \(j\) be two individuals in the dataset
\(N^{Aa}_{i}\) be the number of heterozygote genotypes for individual \(i\).
\(N^{Aa,Aa}_{i,j}\) be the number of variants at which a pair of individuals both have heterozygote genotypes.
\(N^{AA,aa}_{i,j}\) be the number of variants at which a pair of individuals have opposing homozygote genotypes.
\(S_{i,j}\) be the set of singlenucleotide variants for which both individuals \(i\) and \(j\) have a nonmissing genotype.
\(X_{i,s}\) be the genotype score matrix. Each entry corresponds to the genotype of individual \(i\) at variant \(s\). Homozygousreference genotypes are represented as 0, heterozygous genotypes are represented as 1, and homozygousalternate genotypes are represented as 2. \(X_{i,s}\) is calculated by invoking
n_alt_alleles()
on the call_expr.
The three counts above, \(N^{Aa}\), \(N^{Aa,Aa}\), and \(N^{AA,aa}\), exclude variants where one or both individuals have missing genotypes.
In terms of the symbols above, we can define \(d\), the genetic distance between two samples. We can interpret \(d\) as an unnormalized measurement of the genetic material not shared identicallybydescent:
\[d_{i,j} = \sum_{s \in S_{i,j}}\left(X_{i,s}  X_{j,s}\right)^2\]In the supplement to Manichaikul, et. al, the authors show how to reexpress the genetic distance above in terms of the three counts of hetero and homozygosity by considering the nine possible configurations of a pair of genotypes:
\((X_{i,s}  X_{j,s})^2\)
homref
het
homalt
homref
0
1
4
het
1
0
1
homalt
4
1
0
which leads to this expression for genetic distance:
\[d_{i,j} = 4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j}  2 N^{Aa,Aa}_{i,j}\]The first term, \(4 N^{AA,aa}_{i,j}\), accounts for all pairs of genotypes with opposing homozygous genotypes. The second and third terms account for the four cases of one heteroyzgous genotype and one nonheterozygous genotype. Unfortunately, the second and third term also contribute to the case of a pair of heteroyzgous genotypes. We offset this with the fourth and final term.
The genetic distance, \(d_{i,j}\), ranges between zero and four times the number of variants in the dataset. In the supplement to Manichaikul, et. al, the authors demonstrate that the kinship coefficient, \(\phi_{i,j}\), between two individuals from the same population is related to the expected genetic distance at any one variant by way of the allele frequency:
\[\mathop{\mathbb{E}}_{i,j} (X_{i,s}  X_{j,s})^2 = 4 p_s (1  p_s) (1  2\phi_{i,j})\]This identity reveals that the quotient of the expected genetic distance and the fourtrial binomial variance in the allele frequency represents, roughly, the “fraction of genetic material not shared identicallybydescent”:
\[1  2 \phi_{i,j} = \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j}  2 N^{Aa,Aa}_{i,j}} {\sum_{s \in S_{i,j}} 4 p_s (1  p_s)}\]Note that the “coefficient of relationship”, (by definition: the fraction of genetic material shared identicallybydescent) is equal to twice the kinship coefficient: \(\phi_{i,j} = 2 r_{i,j}\).
Manichaikul, et. al, assuming one homogeneous population, demonstrate in Section 2.3 that the sum of the variance of the allele frequencies,
\[\sum_{s \in S_{i, j}} 2 p_s (1  p_s)\]is, in expectation, proportional to the count of heterozygous genotypes of either individual:
\[N^{Aa}_{i}\]For individuals from distinct populations, the authors propose replacing the count of heteroyzgous genotypes with the average of the two individuals:
\[\frac{N^{Aa}_{i} + N^{Aa}_{j}}{2}\]Using the aforementioned equality, we define a normalized genetic distance, \(\widetilde{d_{i,j}}\), for a pair of individuals from distinct populations:
\[\begin{aligned} \widetilde{d_{i,j}} &= \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j}  2 N^{Aa,Aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \\ &= 1 + \frac{4 N^{AA,aa}_{i,j}  2 N^{Aa,Aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \end{aligned}\]As mentioned before, the complement of the normalized genetic distance is the coefficient of relationship which is also equal to twice the kinship coefficient:
\[2 \phi_{i,j} = r_{i,j} = 1  \widetilde{d_{i,j}}\]We now present the KING “withinfamily” estimator of the kinship coefficient as onehalf of the coefficient of relationship:
\[\begin{aligned} \widehat{\phi_{i,j}^{\mathrm{within}}} &= \frac{1}{2} r_{i,j} \\ &= \frac{1}{2} \left( 1  \widetilde{d_{i,j}} \right) \\ &= \frac{N^{Aa,Aa}_{i,j}  2 N^{AA,aa}_{i,j}} {N^{Aa}_{i} + N^{Aa}_{j}} \end{aligned}\]This “withinfamily” estimator overestimates the kinship coefficient under certain circumstances detailed in Section 2.3 of Manichaikul, et. al. The authors recommend an alternative estimator when individuals are known to be from different families. The estimator replaces the average count of heteroyzgous genotypes with the minimum count of heterozygous genotypes:
\[\frac{N^{Aa}_{i} + N^{Aa}_{j}}{2} \rightsquigarrow \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j})\]This transforms the “withinfamily” estimator into the “betweenfamily” estimator, defined by Equation 11 of Manichaikul, et. al.:
\[\begin{aligned} \widetilde{d_{i,j}^{\mathrm{between}}} &= \frac{4 N^{AA,aa}_{i,j} + N^{Aa}_{i} + N^{Aa}_{j}  2 N^{Aa,Aa}_{i,j}} {2 \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j})} \\ \widehat{\phi_{i,j}^{\mathrm{between}}} &= \frac{1}{2} + \frac{2 N^{Aa,Aa}_{i,j}  4 N^{AA,aa}_{i,j}  N^{Aa}_{i}  N^{Aa}_{j}} {4 \cdot \mathrm{min}(N^{Aa}_{i}, N^{Aa}_{j})} \end{aligned}\]This function,
king()
, only implements the “betweenfamily” estimator, \(\widehat{\phi_{i,j}^{\mathrm{between}}}\). Parameters
call_expr (
CallExpression
) – Entryindexed call expression.block_size (
int
, optional) – Block size of block matrices used in the algorithm. Default given byBlockMatrix.default_block_size()
.
 Returns
MatrixTable
– AMatrixTable
whose rows and columns are keys are taken from callexpr’s column keys. It has one entry field, phi.

hail.methods.
pc_relate
(call_expr, min_individual_maf, *, k=None, scores_expr=None, min_kinship=None, statistics='all', block_size=None, include_self_kinship=False)[source]¶ Compute relatedness estimates between individuals using a variant of the PCRelate method.
Note
Requires the dataset to contain only diploid genotype calls.
Examples
Estimate kinship, identitybydescent two, identitybydescent one, and identitybydescent zero for every pair of samples, using a minimum minor allele frequency filter of 0.01 and 10 principal components to control for population structure.
>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10)
Only compute the kinship statistic. This is more efficient than computing all statistics.
>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, statistics='kin')
Compute all statistics, excluding samplepairs with kinship less than 0.1. This is more efficient than producing the full table and then filtering using
Table.filter()
.>>> rel = hl.pc_relate(dataset.GT, 0.01, k=10, min_kinship=0.1)
One can also pass in precomputed principal component scores. To produce the same results as in the previous example:
>>> _, scores_table, _ = hl.hwe_normalized_pca(dataset.GT, ... k=10, ... compute_loadings=False) >>> rel = hl.pc_relate(dataset.GT, ... 0.01, ... scores_expr=scores_table[dataset.col_key].scores, ... min_kinship=0.1)
Notes
The traditional estimator for kinship between a pair of individuals \(i\) and \(j\), sharing the set \(S_{ij}\) of singlenucleotide variants, from a population with estimated allele frequencies \(\widehat{p}_{s}\) at SNP \(s\), is given by:
\[\widehat{\psi}_{ij} \coloneqq \frac{1}{\left\mathcal{S}_{ij}\right} \sum_{s \in \mathcal{S}_{ij}} \frac{\left(g_{is}  2\hat{p}_{s}\right)\left(g_{js}  2\widehat{p}_{s}\right)} {4 \widehat{p}_{s}\left(1\widehat{p}_{s}\right)}\]This estimator is true under the model that the sharing of common (relative to the population) alleles is not very informative to relatedness (because they’re common) and the sharing of rare alleles suggests a recent common ancestor from which the allele was inherited by descent.
When multiple ancestry groups are mixed in a sample, this model breaks down. Alleles that are rare in all but one ancestry group are treated as very informative to relatedness. However, these alleles are simply markers of the ancestry group. The PCRelate method corrects for this situation and the related situation of admixed individuals.
PCRelate slightly modifies the usual estimator for relatedness: occurrences of population allele frequency are replaced with an “individualspecific allele frequency”. This modification allows the method to correctly weight an allele according to an individual’s unique ancestry profile.
The “individualspecific allele frequency” at a given genetic locus is modeled by PCRelate as a linear function of a sample’s first
k
principal component coordinates. As such, the efficacy of this method rests on two assumptions:an individual’s first
k
principal component coordinates fully describe their allelefrequencyrelevant ancestry, andthe relationship between ancestry (as described by principal component coordinates) and population allele frequency is linear
The estimators for kinship, and identitybydescent zero, one, and two follow. Let:
\(S_{ij}\) be the set of genetic loci at which both individuals \(i\) and \(j\) have a defined genotype
\(g_{is} \in {0, 1, 2}\) be the number of alternate alleles that individual \(i\) has at genetic locus \(s\)
\(\widehat{\mu_{is}} \in [0, 1]\) be the individualspecific allele frequency for individual \(i\) at genetic locus \(s\)
\({\widehat{\sigma^2_{is}}} \coloneqq \widehat{\mu_{is}} (1  \widehat{\mu_{is}})\), the binomial variance of \(\widehat{\mu_{is}}\)
\(\widehat{\sigma_{is}} \coloneqq \sqrt{\widehat{\sigma^2_{is}}}\), the binomial standard deviation of \(\widehat{\mu_{is}}\)
\(\text{IBS}^{(0)}_{ij} \coloneqq \sum_{s \in S_{ij}} \mathbb{1}_{g_{is}  g_{js} = 2}\), the number of genetic loci at which individuals \(i\) and \(j\) share no alleles
\(\widehat{f_i} \coloneqq 2 \widehat{\phi_{ii}}  1\), the inbreeding coefficient for individual \(i\)
\(g^D_{is}\) be a dominance encoding of the genotype matrix, and \(X_{is}\) be a normalized dominancecoded genotype matrix
\[g^D_{is} \coloneqq \begin{cases} \widehat{\mu_{is}} & g_{is} = 0 \\ 0 & g_{is} = 1 \\ 1  \widehat{\mu_{is}} & g_{is} = 2 \end{cases} \qquad X_{is} \coloneqq g^D_{is}  \widehat{\sigma^2_{is}} (1 + \widehat{f_i})\]The estimator for kinship is given by:
\[\widehat{\phi_{ij}} \coloneqq \frac{\sum_{s \in S_{ij}}(g  2 \mu)_{is} (g  2 \mu)_{js}} {4 * \sum_{s \in S_{ij}} \widehat{\sigma_{is}} \widehat{\sigma_{js}}}\]The estimator for identitybydescent two is given by:
\[\widehat{k^{(2)}_{ij}} \coloneqq \frac{\sum_{s \in S_{ij}}X_{is} X_{js}}{\sum_{s \in S_{ij}} \widehat{\sigma^2_{is}} \widehat{\sigma^2_{js}}}\]The estimator for identitybydescent zero is given by:
\[\widehat{k^{(0)}_{ij}} \coloneqq \begin{cases} \frac{\text{IBS}^{(0)}_{ij}} {\sum_{s \in S_{ij}} \widehat{\mu_{is}}^2(1  \widehat{\mu_{js}})^2 + (1  \widehat{\mu_{is}})^2\widehat{\mu_{js}}^2} & \widehat{\phi_{ij}} > 2^{5/2} \\ 1  4 \widehat{\phi_{ij}} + k^{(2)}_{ij} & \widehat{\phi_{ij}} \le 2^{5/2} \end{cases}\]The estimator for identitybydescent one is given by:
\[\widehat{k^{(1)}_{ij}} \coloneqq 1  \widehat{k^{(2)}_{ij}}  \widehat{k^{(0)}_{ij}}\]Note that, even if present, phase information is ignored by this method.
The PCRelate method is described in “Modelfree Estimation of Recent Genetic Relatedness”. Conomos MP, Reiner AP, Weir BS, Thornton TA. in American Journal of Human Genetics. 2016 Jan 7. The reference implementation is available in the GENESIS Bioconductor package .
pc_relate()
differs from the reference implementation in a few ways:if
k
is supplied, samples scores are computed via PCA on all samples, not a specified subset of genetically unrelated samples. The latter can be achieved by filtering samples, computing PCA variant loadings, and using these loadings to compute and pass in scores for all samples.the estimators do not perform small sample correction
the algorithm does not provide an option to use populationwide allele frequency estimates
the algorithm does not provide an option to not use “overall standardization” (see R
pcrelate
documentation)
Under the PCRelate model, kinship, \(\phi_{ij}\), ranges from 0 to 0.5, and is precisely half of the fractionofgeneticmaterialshared. Listed below are the statistics for a few pairings:
Monozygotic twins share all their genetic material so their kinship statistic is 0.5 in expection.
Parentchild and sibling pairs both have kinship 0.25 in expectation and are separated by the identitybydescentzero, \(k^{(2)}_{ij}\), statistic which is zero for parentchild pairs and 0.25 for sibling pairs.
Avuncular pairs and grandparent/child pairs both have kinship 0.125 in expectation and both have identitybydescentzero 0.5 in expectation
“Third degree relatives” are those pairs sharing \(2^{3} = 12.5 %\) of their genetic material, the results of PCRelate are often too noisy to reliably distinguish these pairs from higherdegreerelativepairs or unrelated pairs.
Note that \(g_{is}\) is the number of alternate alleles. Hence, for multiallelic variants, a value of 2 may indicate two distinct alternative alleles rather than a homozygous variant genotype. To enforce the latter, either filter or split multiallelic variants first.
The resulting table has the first 3, 4, 5, or 6 fields below, depending on the statistics parameter:
i (
col_key.dtype
) – First sample. (key field)j (
col_key.dtype
) – Second sample. (key field)kin (
tfloat64
) – Kinship estimate, \(\widehat{\phi_{ij}}\).ibd2 (
tfloat64
) – IBD2 estimate, \(\widehat{k^{(2)}_{ij}}\).ibd0 (
tfloat64
) – IBD0 estimate, \(\widehat{k^{(0)}_{ij}}\).ibd1 (
tfloat64
) – IBD1 estimate, \(\widehat{k^{(1)}_{ij}}\).
Here
col_key
refers to the column key of the source matrix table, andcol_key.dtype
is a struct containing the column key fields.There is one row for each pair of distinct samples (columns), where i corresponds to the column of smaller column index. In particular, if the same column key value exists for \(n\) columns, then the resulting table will have \(\binom{n1}{2}\) rows with both key fields equal to that column key value. This may result in unexpected behavior in downstream processing.
 Parameters
call_expr (
CallExpression
) – Entryindexed call expression.min_individual_maf (
float
) – The minimum individualspecific minor allele frequency. If either individualspecific minor allele frequency for a pair of individuals is below this threshold, then the variant will not be used to estimate relatedness for the pair.k (
int
, optional) – If set, k principal component scores are computed and used. Exactly one of k and scores_expr must be specified.scores_expr (
ArrayNumericExpression
, optional) – Columnindexed expression of principal component scores, with the same source as call_expr. All array values must have the same positive length, corresponding to the number of principal components, and all scores must be nonmissing. Exactly one of k and scores_expr must be specified.min_kinship (
float
, optional) – If set, pairs of samples with kinship lower than min_kinship are excluded from the results.statistics (
str
) – Set of statistics to compute. If'kin'
, only estimate the kinship statistic. If'kin2'
, estimate the above and IBD2. If'kin20'
, estimate the above and IBD0. If'all'
, estimate the above and IBD1.block_size (
int
, optional) – Block size of block matrices used in the algorithm. Default given byBlockMatrix.default_block_size()
.include_self_kinship (
bool
) – IfTrue
, include entries for an individual’s estimated kinship with themselves. Defaults toFalse
.
 Returns
Table
– ATable
mapping pairs of samples to their pairwise statistics.

hail.methods.
simulate_random_mating
(mt, n_rounds=1, generation_size_multiplier=1.0, keep_founders=True)[source]¶ Simulate random diploid mating to produce new individuals.
 Parameters
mt
n_rounds (
int
) – Number of rounds of mating.generation_size_multiplier (
float
) – Ratio of number of offspring to current population for each round of mating.keep_founders :obj:`bool` – If true, keep all founders and intermediate generations in the final sample list. If false, keep only offspring in the last generation.
 Returns
 1
Purcell, Shaun et al. “PLINK: a tool set for wholegenome association and populationbased linkage analyses.” American journal of human genetics vol. 81,3 (2007): 55975. doi:10.1086/519795. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950838/
 2
Manichaikul, Ani et al. “Robust relationship inference in genomewide association studies.” Bioinformatics (Oxford, England) vol. 26,22 (2010): 286773. doi:10.1093/bioinformatics/btq559. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025716/
 3
Conomos, Matthew P et al. “Modelfree Estimation of Recent Genetic Relatedness.” American journal of human genetics vol. 98,1 (2016): 12748. doi:10.1016/j.ajhg.2015.11.022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/