The package diemr
incorporates the diagnostic index
expectation maximisation algorithm used to estimate which genomic
alleles belong to which side of a barrier to geneflow. To start using
diemr
, load the package or install it from CRAN if it is
not yet available:
# Attempt to load a package, if the package was not available, install and load
if(!require("diemr", character.only = TRUE)){
install.packages("diemr", dependencies = TRUE)
library("diemr", character.only = TRUE)
}
Next, assemble paths to all files containing the data to be used by
diemr
. Here, we will use a tiny example dataset that is
included in the package for illustrating the analysis workflow. A good
practice is to check that all files contain data in the correct format
for all individuals and markers. Additionally, the analysis will need a
list with ploidies for all genomic compartments and individuals, and a
vector with indices of samples that will be included in the
analysis.
<- system.file("extdata", "data6x3.txt", package = "diemr")
filepaths # Assuming diploid markers of all individuals
<- list(rep(2, 6))
ploidies # Analysing all individuals
<- 1:6
samples CheckDiemFormat(files = filepaths,
ChosenInds = samples,
ploidy = ploidies)
# File check passed: TRUE
# Ploidy check passed: TRUE
If the CheckDiemFormat()
function fails, work through
the error messages and fix the stored input files accordingly. The
algorithm repeatedly accesses data from the harddisk, so seeing the
passed file and variable check prior to the analysis is critical.
To estimate the marker polarities, their diagnostic indices and
support, run the function diem()
with default settings.
Here, we have only one file with data, so paralelisation is unnecessary,
and we set nCores = 1
. This setting must always be
used on Windows. Other operating systems can process multiple
genomic compartments (e.g. chromosomes) in parallel, and the analysis
of different genomic compartment files can run on multiple
processors.
<- diem(files = filepaths,
res ploidy = ploidies,
markerPolarity = list(c(FALSE, FALSE, TRUE)),
ChosenInds = samples,
nCores = 1)
The result is a list, where the element res$DI
contains
a table with marker polarities, their diagnostic indices and
support.
$DI
res# newPolarity DI Support Marker
# 1 FALSE -4.872256 15.930181 1
# 2 TRUE -3.520647 18.633399 2
# 3 TRUE -13.274822 6.130625 3
The column newPolarity
means that marker 1 should be
imported for subsequent analyses as is, and markers 2 and 3 should be
imported with 0 replaced with 2 and 2 replaced with 0. The marker 3 has
the lowest diagnostic index and low support, indicating that the
genotypes scored at this marker are poorly related to the barrier to
geneflow arbitrated by the data.
With the marker polarities optimised to detect a barrier to geneflow, a plot of the polarised genome will show how genomic regions cross the barrier. First, the genotypes need to be imported with optimal marker polarities. Second, individual hybrid indices need to be calculated from the polarised genotypes. And last, the data will be plotted as a raster image.
<- importPolarized(file = filepaths,
genotypes changePolarity = res$markerPolarity,
ChosenInds = samples)
<- apply(X = res$I4,
h MARGIN = 1,
FUN = pHetErrOnStateCount)[1, ]
plotPolarized(genotypes = genotypes,
HI = h)
The diemr
package uses a consise genome representation.
Let’s have a small dataset of three markers genotyped for six
individuals.
S001122
S121000 S02221U
The genotypes encoded as 0
represent homozygots for an
allele attributed to barrier side A, 1
are heterozygous
genotypes, 2
are homozygots for another allele, attributed
to barrier side B, and U
(symbol “_” is also allowed)
represents an unknown state or a third (fourth) allele. The power of
diem
lies in the safety that the user does not need to
determine the true assignment to the barrier sides A and B before the
analysis and the specific genotypes encoded as 0
and
2
respectively can be arbitrary.
The leading S
on each line of the input file is needed
to ensure that the marker genotypes are read in as a string on all
operating systems. The S
is dropped during import of the
genotypes, and the dataset is imported as a character matrix of all
markers.
Some genomic compartments differ between individuals in their ploidy. For example, markers located on chromosome X in mammals will be diploid in females, but haploid in males. Ploidy differences between individuals influence calculation of the hybrid index, which in turn has an effect on the diem analysis.
To set up the diem analysis with multiple compartments, the
markers with different individual ploidies must be stored in separate
files. The file analysed in the Quick start chapter
could contain markers from autosomes and an additional file will contain
markers from an X chromosome, with individuals 2 and 6 being males. The
respective ploidies for the second genomic compartment will be
c(2, 1, 2, 2, 2, 1)
.
Arguments files
and ploidy
will need to
reflect the information, taking care that the order of filenames
corresponds to the order of elements in the list of ploidies.
diem
cannot check that the order of the elements is
correct, only that the information is in the correct format.
<- c(system.file("extdata", "data6x3.txt", package = "diemr"),
filepaths2 system.file("extdata", "data7x10.txt", package = "diemr"))
<- list(rep(2, 6),
ploidies2 c(2, 1, 2, 2, 2, 1))
CheckDiemFormat(files = filepaths2,
ChosenInds = samples,
ploidy = ploidies2)
# File check passed: TRUE
# Ploidy check passed: TRUE
# Set random seed for repeatibility of null polarities (optional)
set.seed(39583782)
# Run diem with verbose = TRUE to store hybrid indices with ploidy-aware allele counts
<- diem(files = filepaths2,
res2 ploidy = ploidies2,
markerPolarity = FALSE,
ChosenInds = samples,
nCores = 1,
verbose = TRUE)
Plotting polarised genomes from multiple compartments requires
separate import of the compartment data. The polarities in the
res2$markerPolarity
element are combined across all
compartments, and extracting them requires knowledge of the number of
markers in each compartment. Alternatively, the marker polarities from
each compartment can be extracted from the list in the
res2$PolarityChanges
element.
# Import each compartment into a list
<- list(importPolarized(file = filepaths2[1],
genotypes2 changePolarity = res2$markerPolarity[1:3],
ChosenInds = samples),
importPolarized(file = filepaths2[2],
changePolarity = res2$markerPolarity[4:13],
ChosenInds = samples))
# Bind compartment genotypes into one matrix
<- Reduce(cbind, genotypes2)
genotypes2
# Load individual hybrid indices from a stored file
<- unlist(read.table("diagnostics/HIwithOptimalPolarities.txt"))
h2
# Plot the polarised genotypes
plotPolarized(genotypes = genotypes2,
HI = h2)
diemr
from the
source?install.packages(package = "diemr_1.1.tar.gz",
repos = NULL, type = "source")
Make sure to set the R
working directory to the folder,
where the package tarball is stored, or include a full path to the file
within the quotes. Update the version number to the specific file.
There are two options. First, use diem
with argument
verbose = TRUE
and the hybrid indices will be stored in a
text file in the diagnostics folder in the working directory. The stored
values will be ploidy-aware. However, these hybrid indices could include
the bias introduced by the small data correction if the data warranted
it. The user might prefer to calculate the hybrid indices without the
small data correction. For this, use the I4
matrix in the
diem
output.
apply(res$I4, MARGIN = 1, FUN = pHetErrOnStateCount)
# [,1] [,2] [,3] [,4] [,5] [,6]
# p 0.5000000 0 0.3333333 0.5000000 0.8333333 1.0000000
# Het 0.3333333 0 0.6666667 0.3333333 0.3333333 0.0000000
# Err 0.0000000 0 0.0000000 0.0000000 0.0000000 0.3333333
diemr
?Yes. Multiple barriers to geneflow between multiple groups of samples
can be identified iteratively with help from the argument
ChosenInds
. For example, let’s assume that the individual 2
was identified as belonging to one side of the barrier and being
separated from other by the steepest change in the hybrid index. In the
next diem
run, we exclude the individual 2, adjusting the
ploidies to reflect the omission.
<- list(c(2, 2, 2, 2, 2))
ploidies2 <- c(1, 3:6)
samples2 CheckDiemFormat(files = filepaths,
ChosenInds = samples2,
ploidy = ploidies2)
# File check passed: TRUE
# Ploidy check passed: TRUE
<- diem(files = filepaths,
res2 ChosenInds = samples2,
ploidy = ploidies2,
nCores = 1,
markerPolarity = list(c(FALSE, FALSE, TRUE)))
# calculate hybrid indices from updated I4
<- apply(res2$I4,
h.res2 MARGIN = 1,
FUN = pHetErrOnStateCount)[1, ]
# set names for the hybrid index values
<- setNames(h.res2, nm = samples2)
h.res2 # 1 3 4 5 6
# 0.50 0.33 0.50 0.83 1.00
# calculate the center of the maximum hybrid index change
<- data.frame(rollmean = zoo::rollmean(sort(h.res2), k = 2),
diffs diff = diff(sort(h.res2), lag = 1))
<- diffs$rollmean[which.max(diffs$diff)]
h.res2.c # [1] 0.6666667
Since the center of the barrier is at 0.67 now, diem
separated individuals 1, 3, and 4 from a group that includes 5, 6.
Combined with the result from the first diem
run, we have
identified three groups in the dataset: (2), (1, 3, 4), and (5, 6).
The input data probably contains invariant sites, which weight the
hybrid indices to be similar between individuals. Make sure to use only
variable sites. Alternatively, filter sites with lowest diagnostic index
after the diem
analysis and recalculate the hybrid indices
with the filtered data.
diemr
?To use diemr
in a publication, please cite:
Baird S. J. E., Petruzela J., Jaron I., Skrabanek P., Martinkova N. 2023. Genome polarisation for detecting barriers to geneflow. Methods in Ecology and Evolution, 14: 512–528. doi: 10.1111/2041-210X.14010.