simfam: simulate and model family pedigrees with structured founders

Alejandro Ochoa

2023-01-09

Introduction

The simfam package is for constructing large random families—with population studies in mind—with realistic constraints including avoidance of close relative pairings but otherwise biased for closest pairs in a 1-dimensional geography. There is also code to draw genotypes across the pedigree starting from genotypes for the founders. Our model allows for the founders to be related or structured—which arises in practice when there are different ancestries among founders—and given known parameters for these founders (including known kinship matrices and ancestry proportions) we provide code to calculate the true kinship and expected admixture proportions of all descendants.

This vignette focuses on two examples. One is small, necessary to actually visualize the pedigree. The second is larger, where the pedigree cannot be fully visualized but whose population parameters are more realistic. This vignette draws from other general packages to visualize the data (popkin and kinship2), as well as to construct/simulate structured founders (bnpsd, which is based on the admixture model).

Small example

Construct random pedigree

Let’s start with a 3-generation pedigree with 15 individuals per generation. (The code can also handle variable numbers of individuals per generation, which is not demonstrated in this vignette for simplicity).

library(simfam)

# data dimensions
n <- 15 # number of individuals per generation
G <- 3 # number of generations

# the result changes in every run unless we set the seed
set.seed(1)

# draw the random pedigree with those properties
data <- sim_pedigree( n, G )
# save these objects separately
fam <- data$fam
ids <- data$ids
kinship_local_G <- data$kinship_local

The function returns a list with two objects. The first element in the list is the most important: fam is a table that describes the pedigree in a standard notation in genetics research: the plink FAM format. Every row corresponds to one individual, and columns name individuals (id), their parents (pat and mat, founders have parents set to NA), and their sex (1=male, 2=female). The FAM table format also requires a family ID (fam) and a phenotype (pheno) column, both of which take on dummy values in the output of sim_pedigree:

# the top of the table
head( fam )
#> # A tibble: 6 × 6
#>   fam   id    pat   mat     sex pheno
#>   <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1  1-1   <NA>  <NA>      1     0
#> 2 fam1  1-2   <NA>  <NA>      2     0
#> 3 fam1  1-3   <NA>  <NA>      1     0
#> 4 fam1  1-4   <NA>  <NA>      1     0
#> 5 fam1  1-5   <NA>  <NA>      2     0
#> 6 fam1  1-6   <NA>  <NA>      1     0

# and the bottom
tail( fam )
#> # A tibble: 6 × 6
#>   fam   id    pat   mat     sex pheno
#>   <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1  3-10  2-7   2-10      2     0
#> 2 fam1  3-11  2-7   2-10      1     0
#> 3 fam1  3-12  2-7   2-10      2     0
#> 4 fam1  3-13  2-14  2-11      2     0
#> 5 fam1  3-14  2-14  2-11      2     0
#> 6 fam1  3-15  2-14  2-11      2     0

The IDs (columns id, pat, mat) are formatted as g-i, where the first number is the generation and the second number is an index within the generation (1 to n). Every individual belongs to a generation in the simulation (founders are the first generation), and only individuals in the same generation may be paired. Sex is drawn at random for each individual prior to pairing. Every individual has a unique index within the generation, that places them in a 1D geography—an important detail since pairings are strongly biased to be between closest individuals. Founders are ordered as they are constructed, and children are ordered according to the mean index of their parents. Pairings occur randomly and iteratively: while there are available male-female pairs, a male is drawn at random from the available males and he is paired with the closest unpaired female that is not a close relative (by default they must be less related than second cousins, i.e. have a pedigree kinship coefficient of less than 1/4^3).

The second element in the return list of sim_pedigree is ids, the list of IDs separated by generation. This is very handy as often we want to copy the names of the founders (ids[[ 1 ]]) to other objects, or want to filter data to contain the last generation only (ids[[ G ]]).

The third element in the return list of sim_pedigree is the local (i.e., pedigree-based) kinship matrix of the individuals. This data can be recalculated from the fam table alone, so it is redundant, but it is calculated because the pairing algorithm requires it to avoid close relatives, so it is returned mostly because it’s there already. By default, only individuals in the last generation are included in the matrix that is returned, so this is a n * n (here 15 x 15) matrix:

kinship_local_G
#>         3-1    3-2    3-3    3-4    3-5    3-6    3-7    3-8    3-9   3-10
#> 3-1  0.5000 0.2500 0.2500 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-2  0.2500 0.5000 0.2500 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-3  0.2500 0.2500 0.5000 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-4  0.1250 0.1250 0.1250 0.5000 0.2500 0.2500 0.0625 0.0625 0.0625 0.0625
#> 3-5  0.1250 0.1250 0.1250 0.2500 0.5000 0.2500 0.0625 0.0625 0.0625 0.0625
#> 3-6  0.1250 0.1250 0.1250 0.2500 0.2500 0.5000 0.0625 0.0625 0.0625 0.0625
#> 3-7  0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.5000 0.2500 0.1250 0.1250
#> 3-8  0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.2500 0.5000 0.1250 0.1250
#> 3-9  0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.5000 0.2500
#> 3-10 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.5000
#> 3-11 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.2500
#> 3-12 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.2500
#> 3-13 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#> 3-14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#> 3-15 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#>        3-11   3-12   3-13   3-14   3-15
#> 3-1  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-2  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-3  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-4  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-5  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-6  0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-7  0.1250 0.1250 0.0625 0.0625 0.0625
#> 3-8  0.1250 0.1250 0.0625 0.0625 0.0625
#> 3-9  0.2500 0.2500 0.0625 0.0625 0.0625
#> 3-10 0.2500 0.2500 0.0625 0.0625 0.0625
#> 3-11 0.5000 0.2500 0.0625 0.0625 0.0625
#> 3-12 0.2500 0.5000 0.0625 0.0625 0.0625
#> 3-13 0.0625 0.0625 0.5000 0.2500 0.2500
#> 3-14 0.0625 0.0625 0.2500 0.5000 0.2500
#> 3-15 0.0625 0.0625 0.2500 0.2500 0.5000

This kinship matrix is later visualized more conveniently as a heatmap using the popkin package. If you’re interested, a list of local kinship matrices (one per generation) is returned in place of this if sim_pedigree is called with the full = TRUE option.

Visualizing the pedigree

We use the excellent kinship2 package to visualize our randomly-constructed pedigree:

# load the package
library(kinship2)
#> Loading required package: Matrix
#> Loading required package: quadprog

# Convert the `fam` table into a `pedigree` object required by the `kinship2` package.
obj <- pedigree(
    id = fam$id,
    dadid = fam$pat,
    momid = fam$mat,
    sex = fam$sex,
    affected = fam$pheno
)

# plot the pedigree
plot( obj, cex = 0.7 )

#> Did not plot the following people: 1-3 1-4 1-6 1-7 1-11 1-13 1-15

In these kinds of pedigree plots, node shapes represent sex (box=male, circle=female), and the dashed curved lines (if any) represent individuals copied in two places (once next to its parents, the second time next to its mate) to simplify the plot (the two nodes connected by a dashed line have the same individual ID). It is immediate clear that only individuals with opposite sex were paired, and no close relatives were paired (i.e. no siblings or first or second cousins).

You will notice several things from this random pedigree. One is that some individuals do not get paired in each generation, which happens because there are too many constraints: the number of males and females is random so they may be unequal, and close-relative avoidance further limits pairings. However, this problem is exacerbated in this toy example because the population is small; in a large population most people are paired (see further below). Here, founders that weren’t paired are not plotted (see message below figure). Pairings are always between people of the same generation (first number in ID), and usually but not always between people with close indexes (second number in ID).

When there are pairings, sim_pedigree enforces a minimum number of children per family (default 1). Each generation has exactly the desired number of total individuals (15 per generation in this example), which is achieved by randomly drawing family sizes from a modified Poisson distribution (the number of children in excess of the minimum is drawn from Poisson with target mean per family minus minimum size) followed by smaller random adjustments in the direction of the discrepancy until the target population size is met.

Pruning individuals without descendants

If we are only interested in the last generation, then individuals from previous generations that were unpaired are irrelevant, since for example they don’t determine the genotypes of the last generation. So for simplicity or efficiency, we may want to remove those individuals without descendants. We do this with the function prune_fam!

# replace original with pruned FAM table
# containing only ancestors of last generation
fam <- prune_fam( fam, ids[[ G ]] )
# inspect
fam
#> # A tibble: 33 × 6
#>    fam   id    pat   mat     sex pheno
#>    <chr> <chr> <chr> <chr> <int> <dbl>
#>  1 fam1  1-1   <NA>  <NA>      1     0
#>  2 fam1  1-2   <NA>  <NA>      2     0
#>  3 fam1  1-5   <NA>  <NA>      2     0
#>  4 fam1  1-8   <NA>  <NA>      1     0
#>  5 fam1  1-9   <NA>  <NA>      2     0
#>  6 fam1  1-10  <NA>  <NA>      2     0
#>  7 fam1  1-12  <NA>  <NA>      1     0
#>  8 fam1  1-14  <NA>  <NA>      1     0
#>  9 fam1  2-2   1-1   1-2       2     0
#> 10 fam1  2-3   1-1   1-2       2     0
#> # … with 23 more rows
# and plot pedigree
obj <- pedigree(
    id = fam$id,
    dadid = fam$pat,
    momid = fam$mat,
    sex = fam$sex,
    affected = fam$pheno
)
plot( obj, cex = 0.7 )

Inspection of the top of the table shows that founders that were originally unpaired are now removed (the plot also doesn’t report people not plotted). The plot further shows that many individuals in the second generation (childless uncles and aunts of the last generation) were also removed, but nobody in the third generation was removed (as desired).

Drawing genotype data: unstructured founders

We will draw genotypes in two ways to reflect the traditional (unrelated founders) and the intended use of simfam: structured founders. Let’s do unrelated founders first. Note that, to keep this vignette runtime low, the number of loci is much lower than any real dataset, so data will look artificially noisier here.

# number of loci to draw
m <- 5000
# draw uniform ancestral allele frequencies
p_anc <- runif( m )
# draw independent (unstructured) genotypes for founders (includes founders without descendants)
X_1 <- matrix(
    rbinom( n * m, 2, p_anc ),
    nrow = m,
    ncol = n
)
# `simfam` requires names for the founders on `X_1`, to validate identities and align
colnames( X_1 ) <- ids[[ 1 ]]

Before moving on, it’s important to understand that these founders have a kinship matrix, albeit a trivial one, with zero kinship (unrelated) between individuals and a self-kinship of 1/2 (expected for outbred individuals). We can verify this directly using a kinship estimator applied to the genotypes we just drew. We use popkin, which is the only unbiased kinship estimator. popkin also provides heatmap visualization of kinship matrices via plot_popkin.

library(popkin)
# estimate kinship
kinship_est_1 <- popkin( X_1 )
# true/expected kinship
kinship_1 <- diag( n ) / 2
# assign same IDs as `X_1` and `fam`; `simfam` generally requires these to validate/align
colnames( kinship_1 ) <- colnames( X_1 )
rownames( kinship_1 ) <- colnames( X_1 )
# plot them together for comparison
plot_popkin(
    list( kinship_1, kinship_est_1 ),
    titles = c('Expected', 'Estimated'),
    names = TRUE,
    leg_width = 0.2,
    mar = c(3, 2)
)

Now we draw genotypes across the pedigree using geno_fam! Since we have genotypes of the founders, we can simulate the random inheritance of alleles to every descendant, which produces the correct pedigree-based kinship/covariance structure at every locus. However, this code draws loci independently from each other (there’s no linkage disequilibrium). We also construct the total expected kinship across this pedigree with a similar function, kinship_fam.

X <- geno_fam( X_1, fam )
# expected kinship of entire pruned pedigree, starting from true kinship of founders.
# Called "local" kinship because it ignores the ancestral population structure (will be reused)
# NOTE: agrees with `kinship2::kinship( fam )`, but only because founders are unstructured
kinship_local <- kinship_fam( kinship_1, fam )

We can view and validate the resulting data in several ways. The more complete validation is to estimate kinship for all X (all 3 generations) and see that it agrees with kinship_fam:

# estimate kinship
kinship_local_est <- popkin( X )
# plot them together for comparison
plot_popkin(
    list( kinship_local, kinship_local_est ),
    titles = c('Expected', 'Estimated'),
    names = TRUE,
    names_cex = 0.8,
    leg_width = 0.2,
    mar = c(3, 2)
)

Above you can see that only founders in the pruned FAM table were kept in the output. However, in a real dataset previous generations might not be available (especially if G were much larger than 3), so we might only have the genotype data of a subset of more recent relatives. Here we subset to keep the last generation, and compare not only to kinship_fam, but also to the original kinship_local_G that was calculated by sim_pedigree to avoid pairing close relatives.

# indexes (really logicals) marking individuals in last generation,
# to subset data (which is aligned with `fam`)
indexes_G <- fam$id %in% ids[[ G ]]
# estimate kinship
kinship_local_est <- popkin( X )
# plot them together for comparison
plot_popkin(
    list(
        kinship_local[ indexes_G, indexes_G ],
        kinship_local_est[ indexes_G, indexes_G ],
        kinship_local_G
    ),
    titles = c('Expected', 'Estimated', 'Local'),
    names = TRUE,
    mar = c(3, 2)
)

Drawing genotype data: admixed founders

Now we simulate data from the much more interesting case, where founders have differing mixed ancestries! This is much more like real human data. For this we use another external package, bnpsd, to construct the admixture parameters of the founders, which also includes calculating their true population kinship matrix.

library(bnpsd)
# additional admixture parameters:
# number of ancestries
K <- 3
# define population structure
# FST values for k=3 subpopulations
inbr_subpops <- c(0.2, 0.4, 0.6)
# admixture proportions from 1D geography (founders)
admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 )
# add founder names to `admix_proportions_1` (will propagate to other vars of interest)
rownames( admix_proportions_1 ) <- ids[[ 1 ]]
# also name subpopulations simply as S1, S2, S3
colnames( admix_proportions_1 ) <- paste0( 'S', 1:K )

# calculate true kinship matrix of the admixed founders
# NOTE: overwrites `kinship` for unstructured founders
kinship_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) )

# draw genotypes from admixture model
# NOTE: overwrites `X_1` for unstructured founders
X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X

We can again verify the population kinship of the founders, which is no longer full of zeroes but rather takes on continuous values that depend on the admixture proportions of each pair of founders:

# estimate kinship (NOTE: overwrites `kinship_est_1` for unstructured founders)
kinship_est_1 <- popkin( X_1 )
# plot them together for comparison
plot_popkin(
    list( kinship_1, kinship_est_1 ),
    titles = c('Expected', 'Estimated'),
    names = TRUE,
    leg_width = 0.2,
    mar = c(3, 2)
)

Note that, unlike the corresponding demonstration in the bnpsd package, here the diagonal plots self-kinship values instead of inbreeding coefficients, as kinship matrices are the central focus of this simfam package.

We draw genotypes across the pedigree the same way as before:

X <- geno_fam( X_1, fam )
# expected kinship of entire pruned pedigree, starting from true kinship of founders
# NOTE: DOESN'T agree with `kinship2::kinship( fam )` anymore because founders are structured!
kinship_total <- kinship_fam( kinship_1, fam )

We again compare the expected kinship matrix to the estimate from genotypes:

# estimate kinship
kinship_total_est <- popkin( X )
# plot them together for comparison
plot_popkin(
    list( kinship_total, kinship_total_est ),
    titles = c('Expected', 'Estimated'),
    names = TRUE,
    names_cex = 0.8,
    leg_width = 0.2,
    mar = c(3, 2)
)

Lastly, we again subset individuals in the last generation only and compare their total to their local kinship, which are related but different things when a population is structured. Although undoubtedly the total kinship (panels A-B below) capture much of the same correlation patterns as the local kinship (panel C), the total kinship captures additional covariance due to the structure of the founders that the local kinship ignores (the local kinship uses only the pedigree information and ignores the ancestry differences of the founders).

plot_popkin(
    list(
        kinship_total[ indexes_G, indexes_G ],
        kinship_total_est[ indexes_G, indexes_G ],
        kinship_local_G
    ),
    titles = c('Total Expected', 'Total Estimated', 'Local'),
    names = TRUE,
    mar = c(3, 2)
)

Admixture proportions in the pedigree

In this admixture model the structure comes from fractional shared ancestry due to admixture as well as the relatedness between ancestries. We can look at these relatedness components in isolation from the pedigree and see that their contribution to the total kinship matrix is nearly additive (will explain more precisely shortly)! First propagate the admixture proportions across the pedigree, under the simple model that the ancestry of a child is the average of the ancestry of its parents:

# admixture proportions of pruned family
admix_proportions <- admix_fam( admix_proportions_1, fam )

# visualize as bar plot

# for nice colors
library(RColorBrewer)
# colors for independent subpopulations
col_subpops <- brewer.pal( K, "Paired" )
# shrink default margins
par_orig <- par(mar = c(4, 4, 0.5, 0) + 0.2)
barplot(
    t( admix_proportions ),
    col = col_subpops,
    legend.text = TRUE,
    args.legend = list( cex = 0.7, bty = 'n' ),
    border = NA,
    space = 0,
    ylab = 'Admixture prop.',
    las = 2
)
mtext('Individuals', side = 1, line = 3)

par( par_orig ) # reset `par`

The above figure shows that admixture proportions in the first generation vary a lot: for example, 1-1 has a large proportion of S1 ancestry, while 1-14 has mostly S3 instead. In contrast, in the third generation ancestry is much more averaged out, with fewer extremes. This effect will be weaker in larger populations, an example we go into shortly.

This code calculates the kinship expected from the admixture structure only, which will be visualized shortly.

# this is the admixture-only relatedness
kinship_admix <- coanc_to_kinship( coanc_admix( admix_proportions, inbr_subpops ) )

Total kinship partitioned into structural and family components

We also demonstrate a good approximation of the total kinship that treats the structural (here admixture) and pedigree relatedness (in this case treating founders as unstructured) as independent effects, which roughly says that the kinship components satisfy (pseudocode):

# These quantities are independent
(1 - total) = (1 - struct) * (1 - family).
# solved for total, reveals near additivity:
total = struct + family - struct * family.

So when both the structural and family kinship values are much smaller than 1, their sum is a good match of the total kinship. Only when these values approach 1 does it become necessary to subtract their product, which is necessary since all of these values are probabilities and the total kinship cannot exceed 1. The same applies to inbreeding coefficient, but if we start from kinship matrices we need to transform the diagonal values back and forth (since, in pseudocode, kinship[i,i] = ( 1 + inbreeding[i] ) / 2). We plot all of the matrices and verify that the approximation is very good, so the conceptual partitioning into structural and family effects is justified:

# this is the approximation of the total kinship:
# NOTE: a bit complicated because diagonal has to be transformed.
# Approximation operates on inbreeding coefficients, not self-kinship,
# so `popkin::inbr_diag` converts self-kinship to inbreeding,
# `bnpsd::coanc_to_kinship` converts inbreeding back to self-kinship.
# Off-diagonal values are unmodified by both functions.
kinship_total_approx <- coanc_to_kinship(
    inbr_diag( kinship_local ) +
    inbr_diag( kinship_admix ) -
    inbr_diag( kinship_admix ) * inbr_diag( kinship_local )
)

# plot all model kinship matrices (no estimates from genotypes this time)
plot_popkin(
    list( kinship_admix, kinship_local, kinship_total, kinship_total_approx ),
    titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'),
    names = TRUE,
    names_cex = 0.8,
    leg_width = 0.2,
    layout_rows = 2,
    mar = c(3, 2)
)

Large example

Here we will simulate a much larger population size and a pedigree with a larger number of generations, to produce data that may better resemble real human data. We will make no attempt of visualize pedigrees or show individual labels. The whole code, redundant with our earlier analysis, is presented with minor explanations. Note, however, use of the newly-introduced functions geno_last_gen, kinship_last_gen, and admix_last_gen to directly construct data for the last generation only, which also saves memory compared to the more general geno_fam, kinship_fam, and admix_fam shown earlier.

library(simfam)

# data dimensions
n <- 500 # number of individuals per generation
G <- 10 # number of generations
K <- 3 # number of ancestries
m <- 5000 # number of loci

# draw the random pedigree with those properties.
data <- sim_pedigree( n, G )
fam <- data$fam
ids <- data$ids
# use this local kinship calculation in plot
kinship_local_G <- data$kinship_local

# prune pedigree to speed up simulations/etc
fam <- prune_fam( fam, ids[[ G ]] )

### admixture model

# define population structure
# FST values for k=3 subpopulations
inbr_subpops <- c(0.2, 0.4, 0.6)
# admixture proportions from 1D geography (founders)
admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 )
# add founder names to `admix_proportions_1` (will propagate to other vars of interest)
rownames( admix_proportions_1 ) <- ids[[ 1 ]]
# also name subpopulations simply as S1, S2, S3
colnames( admix_proportions_1 ) <- paste0( 'S', 1:K )

# draw genotypes from admixture model
# admixed founders
X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X
# draw genotypes, returning last generation only
X_G <- geno_last_gen( X_1, fam, ids )
# total kinship estimated from genotypes
kinship_total_G_est <- popkin( X_G )

# calculate true total kinship matrix
kinship_total_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) )
kinship_total_G <- kinship_last_gen( kinship_total_1, fam, ids )

# kinship due to admixture only
admix_proportions_G <- admix_last_gen( admix_proportions_1, fam, ids )
kinship_admix_G <- coanc_to_kinship( coanc_admix( admix_proportions_G, inbr_subpops ) )

# total kinship approximation from components
kinship_total_approx_G <- coanc_to_kinship(
    inbr_diag( kinship_local_G ) +
    inbr_diag( kinship_admix_G ) -
    inbr_diag( kinship_admix_G ) * inbr_diag( kinship_local_G )
)

The first important validation is that the total structure (estimated from genotypes) agrees with the theoretical calculations based on the pedigree and the known founder total kinship. For simplicity we visualize the last generation only:

plot_popkin(
    list( kinship_total_G, kinship_total_G_est ),
    titles = c('Expected', 'Estimated'),
    leg_width = 0.2,
    mar = c(3, 2)
)

The next demonstration is that the component partitioning approximation works again (limited to last generation):

# plot all model kinship matrices (no estimates from genotypes this time)
plot_popkin(
    list( kinship_admix_G, kinship_local_G, kinship_total_G, kinship_total_approx_G ),
    titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'),
    leg_width = 0.2,
    layout_rows = 2,
    mar = c(3, 2)
)

Lastly, let’s determine how many individuals in each generation were ancestors of individuals in the last generation. The bar plot below shows that, in this large simulation, most individuals in the first generation (founders), and in every subsequent generation, had descendants in the last generation.

# the IDs in `fam` (the pruned FAM table) contain the individuals of interest
frac_per_gen <- vector('numeric', G)
for ( g in 1 : G ) {
    # this is the desired fraction
    frac_per_gen[g] <- sum( fam$id %in% ids[[ g ]] ) / n
}
# visualize as a barplot
barplot(
    frac_per_gen,
    names.arg = 1 : G,
    xlab = 'Generation',
    ylab = 'Frac. individuals with descendants'
)
# add a line marking the maximum fraction of individuals per generation
abline( h = 1, lty = 2, col = 'red' )

Example with recombination

All previous examples had independent loci. Here we use new functions to simulate recombination breaks and therefore a realistic linkage disequilibrium (LD) structure consistent with the pedigree.

For comparison, let’s start from the same pedigree used in the last “large” example. Here we shall simulate the first two human autosomes (coordinates in genome version hg38) with a real recombination map (slightly simplified) provided with this package. The steps are lengthy primarily due to the non-trivial construction of ancestral haplotypes.

# real human recombination map is also used to define chromosome numbers and lengths
# however, to keep example brief we use only first two chromosomes
map <- recomb_map_hg38[ 1L:2L ]

# tricky part is need to define ancestral haplotypes, following the desired genetic structure
# will simulate each chromosome separately
haplos_anc <- vector( 'list', length( map ) )
for ( chr in 1L : length( map ) ) {
    # Positions matter for genetic map, so for this example choose random positions with a
    # desired density, here 100 per Kb on average, keep it sparse for small example
    pos_max <- max( map[[ chr ]]$pos )
    m_loci_chr <- round( pos_max / 100000L )
    pos_chr <- sample.int( pos_max, m_loci_chr )
    # Draw haplotypes.  Here we use admixture code to get genotypes
    # with same structure as before for first generation
    X_chr <- draw_all_admix( admix_proportions_1, inbr_subpops, m_loci_chr )$X
    # and randomly split genotypes into haplotypes
    # (there was no LD/phase so this works well enough for our purposes)
    H1_chr <- matrix( rbinom( X_chr, 1L, X_chr/2L ), nrow = m_loci_chr )
    # second haplotype is complement of first given genotypes
    H2_chr <- X_chr - H1_chr
    # column names are required for code later, identifying ancestors (need _pat/mat suffixes)
    colnames( H1_chr ) <- paste0( colnames( X_chr ), '_pat' )
    colnames( H2_chr ) <- paste0( colnames( X_chr ), '_mat' )
    # final data all together
    X_chr <- cbind( H1_chr, H2_chr )
    # add to structure, in a list
    haplos_anc[[ chr ]] <- list( X = X_chr, pos = pos_chr )
}

# initialize trivial founders chromosomes (unrecombined chromosomes with unique labels)
# must use IDs of first generation from pedigree as input
founders <- recomb_init_founders( ids[[ 1 ]], map )
# draw recombination breaks along pedigree, with coordinates in genetic distance (centiMorgans)
inds <- recomb_last_gen( founders, fam, ids )
# map recombination break coordinates to base pairs
inds <- recomb_map_inds( inds, map )
# determine haplotypes of descendants given ancestral haplotypes
haplos <- recomb_haplo_inds( inds, haplos_anc )
# and reduce haplotype data to genotypes, same standard matrix format as previous examples
X_G_recomb <- recomb_geno_inds( haplos )
# total kinship estimated from genotypes (simulated with recombination)
kinship_total_G_recomb_est <- popkin( X_G_recomb )

First we validate that the kinship structure of these genotypes with recombination matches the earlier estimate from the simulation with independent loci as well as the expected kinship from the pedigree. Below it can be seen that the coarse structure is the same, although the data with recombination appears noisier, as expected because LD increases estimation variance. Simulating more chromosomes would reduce the variance considerably.

plot_popkin(
    list( kinship_total_G, kinship_total_G_est, kinship_total_G_recomb_est ),
    titles = c('Expected', 'Estimated (indep loci)', 'Estimated (recomb)'),
    mar = c(3, 2)
)

Now let’s confirm the presence of LD. The below plot shows that genotype correlations are noisy overall (because the number of individuals is small in our toy example) but nevertheless, correlations within the two chromosomes in the simulation with recombination (two diagonal blocks in panel B below) are noticeably larger than correlations between chromosomes, as well as the correlations in the earlier simulation with independent loci (panel A).

# Notice that the number of loci is similar for both simulations
# (rows of the genotype matrices)
dim( X_G )
#> [1] 5000  500
dim( X_G_recomb )
#> [1] 4912  500

# Calculate correlation matrices between loci (hence transposition).
# Exclude fixed loci to prevent warnings about zero standard deviations.
ld_G <- abs( cor( t( X_G[ !fixed_loci( X_G ), ] ) ) )
ld_G_recomb <- abs( cor( t( X_G_recomb[ !fixed_loci( X_G_recomb ), ] ) ) )
# cap extreme values, to be able to see the relatively weak LD signal
ld_max <- 0.3
ld_G[ ld_G > ld_max ] <- ld_max
ld_G_recomb[ ld_G_recomb > ld_max ] <- ld_max
# plot!
plot_popkin(
    list( ld_G, ld_G_recomb ),
    titles = c('Indep loci', 'Recomb'),
    leg_width = 0.2,
    mar = c(3, 2),
    leg_title = 'Absolute Correlation',
    ylab = 'Loci'
)