distantia: an R package to compute the dissimilarity between multivariate timeseries ================
The package distantia allows to measure the dissimilarity between multivariate ecological timeseries (METS hereafter). The package assumes that the target sequences are ordered along a given dimension, being depth and time the most common ones, but others such as latitude or elevation are also possible. Furthermore, the target METS can be regular or irregular, and have their samples aligned (same age/time/depth) or unaligned (different age/time/depth). The only requirement is that the sequences must have at least two (but ideally more) columns with the same name and units representing different variables relevant to the dynamics of an ecological system.
In this document I explain the logics behind the method, show how to use it, and demonstrate how the distantia package introduces useful tools to compare multivariate timeseries. The topics covered in this document are:
You can install the released version of distantia (currently v1.0.0) from CRAN with:
And the development version (currently v1.0.1) from GitHub with:
Loading the library, plus other helper libraries:
library(distantia)
library(ggplot2)
library(viridis)
library(kableExtra)
library(qgraph)
library(tidyr)
In this section I will use two example datasets based on the Abernethy pollen core (Birks and Mathewes, 1978) to fully explain the logical backbone of the dissimilarity analyses implemented in distantia.
#loading sequences
data(sequenceA)
data(sequenceB)
#showing first rows
kable(sequenceA[1:15, ], caption = "Sequence A")
betula 
pinus 
corylu 
junipe 
empetr 
gramin 
cypera 
artemi 
rumex 

79 
271 
36 
0 
4 
7 
25 
0 
0 
113 
320 
42 
0 
4 
3 
11 
0 
0 
51 
420 
39 
0 
2 
1 
12 
0 
0 
130 
470 
6 
0 
0 
2 
4 
0 
0 
31 
450 
6 
0 
3 
2 
3 
0 
0 
59 
425 
12 
0 
0 
2 
3 
0 
0 
78 
386 
29 
2 
0 
0 
2 
0 
0 
71 
397 
52 
2 
0 
6 
3 
0 
0 
140 
310 
50 
2 
0 
4 
3 
0 
0 
150 
323 
34 
2 
0 
11 
2 
0 
0 
175 
317 
37 
2 
0 
11 
3 
0 
0 
181 
345 
28 
3 
0 
7 
3 
0 
0 
153 
285 
36 
2 
0 
8 
3 
0 
1 
214 
315 
54 
2 
1 
13 
5 
0 
0 
200 
210 
41 
6 
0 
10 
4 
0 
0 
betula 
pinus 
corylu 
junipe 
gramin 
cypera 
artemi 
rumex 

19 
175 
NA 
2 
34 
39 
1 
0 
18 
119 
28 
1 
36 
44 
0 
4 
30 
99 
37 
0 
2 
20 
0 
1 
26 
101 
29 
0 
0 
18 
0 
0 
31 
99 
30 
0 
1 
10 
0 
0 
24 
97 
28 
0 
2 
9 
0 
0 
23 
105 
34 
0 
1 
6 
0 
0 
48 
112 
46 
0 
0 
12 
0 
0 
29 
108 
16 
0 
6 
3 
0 
0 
23 
110 
21 
0 
2 
11 
0 
1 
5 
119 
19 
0 
1 
1 
0 
0 
30 
105 
NA 
0 
9 
7 
0 
0 
22 
116 
17 
0 
1 
7 
0 
0 
24 
115 
20 
0 
2 
4 
0 
0 
26 
119 
23 
0 
4 
0 
0 
0 
Notice that sequenceB has a few NA values (that were introduced to serve as an example). The function prepareSequences gets them ready for analysis by matching colum names and handling empty data. It allows to merge two or more METS into a single dataframe ready for further analyses. Note that, since the data represents pollen abundances, a Hellinger transformation (square root of the relative proportions of each taxa, it balances the relative abundances of rare and dominant taxa) is applied. This transformation balances the relative importance of very abundant versus rare taxa. The function prepareSequences will generally be the starting point of any analysis performed with the distantia package.
#checking the function helpfile.
help(prepareSequences)
#preparing sequences
AB.sequences < prepareSequences(
sequence.A = sequenceA,
sequence.A.name = "A",
sequence.B = sequenceB,
sequence.B.name = "B",
merge.mode = "complete",
if.empty.cases = "zero",
transformation = "hellinger"
)
#showing first rows of the transformed data
kable(AB.sequences[1:15, ], digits = 4, caption = "Sequences A and B ready for analysis.")
id 
betula 
pinus 
corylu 
junipe 
empetr 
gramin 
cypera 
artemi 
rumex 

A 
0.4327 
0.8014 
0.2921 
0.0002 
0.0974 
0.1288 
0.2434 
2e04 
0.0002 
A 
0.4788 
0.8057 
0.2919 
0.0001 
0.0901 
0.0780 
0.1494 
1e04 
0.0001 
A 
0.3117 
0.8944 
0.2726 
0.0001 
0.0617 
0.0436 
0.1512 
1e04 
0.0001 
A 
0.4609 
0.8763 
0.0990 
0.0001 
0.0001 
0.0572 
0.0808 
1e04 
0.0001 
A 
0.2503 
0.9535 
0.1101 
0.0001 
0.0778 
0.0636 
0.0778 
1e04 
0.0001 
A 
0.3432 
0.9210 
0.1548 
0.0001 
0.0001 
0.0632 
0.0774 
1e04 
0.0001 
A 
0.3962 
0.8813 
0.2416 
0.0634 
0.0001 
0.0001 
0.0634 
1e04 
0.0001 
A 
0.3657 
0.8647 
0.3129 
0.0614 
0.0001 
0.1063 
0.0752 
1e04 
0.0001 
A 
0.5245 
0.7804 
0.3134 
0.0627 
0.0001 
0.0886 
0.0768 
1e04 
0.0001 
A 
0.5361 
0.7866 
0.2552 
0.0619 
0.0001 
0.1452 
0.0619 
1e04 
0.0001 
A 
0.5667 
0.7627 
0.2606 
0.0606 
0.0001 
0.1421 
0.0742 
1e04 
0.0001 
A 
0.5650 
0.7800 
0.2222 
0.0727 
0.0001 
0.1111 
0.0727 
1e04 
0.0001 
A 
0.5599 
0.7642 
0.2716 
0.0640 
0.0001 
0.1280 
0.0784 
1e04 
0.0453 
A 
0.5952 
0.7222 
0.2990 
0.0575 
0.0407 
0.1467 
0.0910 
1e04 
0.0001 
A 
0.6516 
0.6677 
0.2950 
0.1129 
0.0001 
0.1457 
0.0922 
1e04 
0.0001 
The computation of dissimilarity between the datasets A and B requires several steps.
It is computed by the distanceMatrix function, which allows the user to select a distance metric (so far the ones implemented are manhattan, euclidean, chi, and hellinger). The function plotMatrix allows an easy visualization of the resulting distance matrix.
#computing distance matrix
AB.distance.matrix < distanceMatrix(
sequences = AB.sequences,
method = "euclidean"
)
#plotting distance matrix
plotMatrix(
distance.matrix = AB.distance.matrix,
color.palette = "viridis",
margins = rep(4,4))
This step uses a dynamic programming algorithm to find the leastcost path that connnects the cell 1,1 of the matrix (lower left in the image above) and the last cell of the matrix (opposite corner). This can be done via in two different ways.
Equation 1
Equation 2
[AB_{between} = 2 \times (D(A_{1}, B_{1}) + \sum_{i=1}^{{m}\sum_{j=1}}{n} min\left(\begin{array}{c}D(A_{i}, B_{j+1}), \\ D(A_{i+1}, B_{j} \\ D(A_{i+1}, B_{j+1}) \end{array}\right))](https://latex.codecogs.com/png.latex?AB_%7Bbetween%7D%20%3D%202%20%5Ctimes%20%28D%28A_%7B1%7D%2C%20B_%7B1%7D%29%20%2B%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Csum_%7Bj%3D1%7D%5E%7Bn%7D%20min%5Cleft%28%5Cbegin%7Barray%7D%7Bc%7DD%28A_%7Bi%7D%2C%20B_%7Bj%2B1%7D%29%2C%20%5C%5C%20D%28A_%7Bi%2B1%7D%2C%20B_%7Bj%7D%20%5C%5C%20D%28A_%7Bi%2B1%7D%2C%20B_%7Bj%2B1%7D%29%20%5Cend%7Barray%7D%5Cright%29%29 “AB_{between} = 2 \times (D(A_{1}, B_{1}) + \sum_{i=1}^{{m}\sum_{j=1}}{n} min\left(\begin{array}{c}D(A_{i}, B_{j+1}), \\ D(A_{i+1}, B_{j} \\ D(A_{i+1}, B_{j+1}) \end{array}\right))”)
Where:
The equation returns
The code below performs these steps according to both equations
#ORTHOGONAL SEARCH
#computing leastcost matrix
AB.least.cost.matrix < leastCostMatrix(
distance.matrix = AB.distance.matrix,
diagonal = FALSE
)
#extracting leastcost path
AB.least.cost.path < leastCostPath(
distance.matrix = AB.distance.matrix,
least.cost.matrix = AB.least.cost.matrix,
diagonal = FALSE
)
#DIAGONAL SEARCH
#computing leastcost matrix
AB.least.cost.matrix.diag < leastCostMatrix(
distance.matrix = AB.distance.matrix,
diagonal = TRUE
)
#extracting leastcost path
AB.least.cost.path.diag < leastCostPath(
distance.matrix = AB.distance.matrix,
least.cost.matrix = AB.least.cost.matrix.diag,
diagonal = TRUE
)
#plotting solutions
plotMatrix(
distance.matrix = list(
'AB' = AB.least.cost.matrix[[1]],
'AB' = AB.least.cost.matrix.diag[[1]]
),
least.cost.path = list(
'AB' = AB.least.cost.path[[1]],
'AB' = AB.least.cost.path.diag[[1]]
),
color.palette = "viridis",
margin = rep(4,4),
plot.rows = 1,
plot.columns = 2
)
Computing
#orthogonal solution
AB.between < leastCost(
least.cost.path = AB.least.cost.path
)
#diagonal solution
AB.between.diag < leastCost(
least.cost.path = AB.least.cost.path.diag
)
Which returns a value for
Notice the straight vertical and horizontal lines that show up in some regions of the least cost paths shown in the figure above. These are blocks, and happen in dissimilar sections of the compared sequences. Also, an unbalanced number of rows in the compared sequences can generate long blocks. Blocks inflate the value of
This package includes an algorithm to remove blocks from the least cost path, which offers more realistic values for
#ORTHOGONAL SOLUTION
#removing blocks from least cost path
AB.least.cost.path.nb < leastCostPathNoBlocks(
least.cost.path = AB.least.cost.path
)
#computing AB.between again
AB.between.nb < leastCost(
least.cost.path = AB.least.cost.path.nb
)
#DIAGONAL SOLUTION
#removing blocks
AB.least.cost.path.diag.nb < leastCostPathNoBlocks(
least.cost.path = AB.least.cost.path.diag
)
#diagonal solution without blocks
AB.between.diag.nb < leastCost(
least.cost.path = AB.least.cost.path.diag.nb
)
Which now yields 11.2975 for the orthogonal solution, and 16.8667 for the diagonal one. Notice how now the diagonal solution has a higher value, because by default, the diagonal method generates less blocks. That is why each measure of dissimilarity (orthogonal, diagonal, orthogonal noblocks, and diagonal noblocks) lies within a different comparative framework, and therefore, outputs from different methods should not be compared.
Hereafter only the diagonal noblocks option will be considered in the example cases, since it is the most general and safe solution of the four mentioned above.
#changing names of the selected solutions
AB.least.cost.path < AB.least.cost.path.diag.nb
AB.between < AB.between.diag.nb
#removing unneeded objects
rm(AB.between.diag, AB.between.diag.nb, AB.between.nb, AB.distance.matrix, AB.least.cost.matrix, AB.least.cost.matrix.diag, AB.least.cost.path.diag, AB.least.cost.path.diag.nb, AB.least.cost.path.nb, sequenceA, sequenceB)
This step requires to compute the distances between adjacent samples in each sequence and sum them, as shown in Equation 3.
Equation 3
This operation is performed by the autoSum function shown below.
AB.within < autoSum(
sequences = AB.sequences,
least.cost.path = AB.least.cost.path,
method = "euclidean"
)
AB.within
#> $`AB`
#> [1] 19.69168
The dissimilarity measure
Equation 4a
This equation has a particularity. Imagine two identical sequences A and B, with three samples each. In this case,
Since the samples of each sequence with the same index are identical, this can be reduced to
which in turn equals
This equality does not work in the same way when the leastcost path searchmethod includes diagonals. When the sequenes are identical, diagonal methods yield an
Equation 4b
In any case, the psi function only requires the leastcost, and the autosum of both sequences to compute
Which yields a psi equal to 1.7131. The output of psi is a list, that can be transformed to a dataframe or a matrix by using the formatPsi function.
#to dataframe
AB.psi.dataframe < formatPsi(
psi.values = AB.psi,
to = "dataframe")
kable(AB.psi.dataframe, digits = 4)
A 
B 
psi 

A 
B 
1.7131 
All the steps required to compute psi, including the format options provided by formatPsi are wrapped together in the function workflowPsi. It includes options to switch to a diagonal method, and to ignore blocks, as shown below.
#checking the help file
help(workflowPsi)
#computing psi for A and B
AB.psi < workflowPsi(
sequences = AB.sequences,
grouping.column = "id",
method = "euclidean",
format = "list",
diagonal = TRUE,
ignore.blocks = TRUE
)
AB.psi
#> $`AB`
#> [1] 1.713075
The function allows to exclude particular columns from the analysis (argument exclude.columns), select different distance metrics (argument method), use diagonals to find the leastcost path (argument diagonal), or measure psi by ignoring blocks in the leastcost path (argument ignore.blocks). Since we have observed several blocks in the leastcost path, below we compute psi by ignoring them.
The package can work seamlessly with any given number of sequences, as long as there is memory enough available (but check the new function workflowPsiHP, it can work with up to 40k sequences, if you have a cluster at hand, and a few years to waste). To do so, almost every function uses the packages “doParallel” and “foreach”, that together allow to parallelize the execution of the distantia functions by using all the processors in your machine but one.
The example dataset sequencesMIS contains 12 sections of the same sequence belonging to different marine isotopic stages identified by a column named “MIS”. MIS stages with odd numbers are generally interpreted as warm periods (interglacials), while the odd ones are interpreted as cold periods (glacials). In any case, this interpretation is not important to illustrate this capability of the library.
data(sequencesMIS)
kable(head(sequencesMIS, n=15), digits = 4, caption = "Header of the sequencesMIS dataset.")
MIS 
Quercus 
Betula 
Pinus 
Alnus 
Tilia 
Carpinus 

MIS1 
55 
1 
5 
3 
4 
5 
MIS1 
86 
21 
35 
8 
0 
10 
MIS1 
120 
15 
8 
1 
0 
1 
MIS1 
138 
16 
12 
6 
1 
3 
MIS1 
130 
12 
17 
2 
1 
1 
MIS1 
128 
0 
6 
4 
2 
2 
MIS1 
140 
0 
19 
9 
4 
0 
MIS1 
113 
0 
15 
12 
2 
5 
MIS1 
98 
0 
27 
2 
2 
0 
MIS1 
92 
1 
16 
7 
3 
0 
MIS1 
73 
3 
22 
3 
0 
0 
MIS1 
91 
1 
21 
3 
7 
0 
MIS1 
148 
1 
22 
1 
4 
0 
MIS1 
148 
0 
1 
7 
13 
0 
MIS1 
149 
1 
2 
5 
4 
0 
unique(sequencesMIS$MIS)
#> [1] "MIS1" "MIS2" "MIS3" "MIS4" "MIS5" "MIS6" "MIS7"
#> [8] "MIS8" "MIS9" "MIS10" "MIS11" "MIS12"
The dataset is checked and prepared with prepareSequences.
MIS.sequences < prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
if.empty.cases = "zero",
transformation = "hellinger"
)
The dissimilarity measure psi can be computed for every combination of sequences through the function workflowPsi shown below.
MIS.psi < workflowPsi(
sequences = MIS.sequences,
grouping.column = "MIS",
method = "euclidean",
diagonal = TRUE,
ignore.blocks = TRUE
)
#there is also a "highperformance" (HP) version of this function with a much lower memory footprint. It uses the options method = "euclidean", diagonal = TRUE, and ignore.blocks = TRUE by default
MIS.psi < workflowPsiHP(
sequences = MIS.sequences,
grouping.column = "MIS"
)
#ordered with lower psi on top
kable(MIS.psi[order(MIS.psi$psi), ], digits = 4, caption = "Psi values between pairs of MIS periods.")
A 
B 
psi 


24 
MIS3 
MIS6 
0.8631 
59 
MIS8 
MIS11 
0.8643 
65 
MIS10 
MIS12 
0.9099 
30 
MIS3 
MIS12 
0.9359 
61 
MIS9 
MIS10 
0.9422 
66 
MIS11 
MIS12 
0.9816 
40 
MIS5 
MIS7 
0.9866 
64 
MIS10 
MIS11 
0.9931 
62 
MIS9 
MIS11 
1.0009 
60 
MIS8 
MIS12 
1.0079 
43 
MIS5 
MIS10 
1.0174 
45 
MIS5 
MIS12 
1.0242 
28 
MIS3 
MIS10 
1.0340 
42 
MIS5 
MIS9 
1.0392 
56 
MIS7 
MIS12 
1.0423 
63 
MIS9 
MIS12 
1.0531 
53 
MIS7 
MIS9 
1.0598 
51 
MIS6 
MIS12 
1.0736 
26 
MIS3 
MIS8 
1.0846 
52 
MIS7 
MIS8 
1.0970 
21 
MIS2 
MIS12 
1.1002 
32 
MIS4 
MIS6 
1.1020 
58 
MIS8 
MIS10 
1.1186 
54 
MIS7 
MIS10 
1.1209 
13 
MIS2 
MIS4 
1.1214 
15 
MIS2 
MIS6 
1.1273 
12 
MIS2 
MIS3 
1.1365 
49 
MIS6 
MIS10 
1.1670 
27 
MIS3 
MIS9 
1.1725 
47 
MIS6 
MIS8 
1.1929 
57 
MIS8 
MIS9 
1.1982 
23 
MIS3 
MIS5 
1.2468 
22 
MIS3 
MIS4 
1.2541 
44 
MIS5 
MIS11 
1.2656 
29 
MIS3 
MIS11 
1.2770 
41 
MIS5 
MIS8 
1.2890 
19 
MIS2 
MIS10 
1.2976 
55 
MIS7 
MIS11 
1.3568 
48 
MIS6 
MIS9 
1.3699 
50 
MIS6 
MIS11 
1.4274 
25 
MIS3 
MIS7 
1.4692 
39 
MIS5 
MIS6 
1.4884 
38 
MIS4 
MIS12 
1.5184 
17 
MIS2 
MIS8 
1.5455 
34 
MIS4 
MIS8 
1.5710 
14 
MIS2 
MIS5 
1.6316 
46 
MIS6 
MIS7 
1.6411 
36 
MIS4 
MIS10 
1.6487 
18 
MIS2 
MIS9 
1.6933 
4 
MIS1 
MIS5 
1.7265 
6 
MIS1 
MIS7 
1.7712 
3 
MIS1 
MIS4 
1.8732 
20 
MIS2 
MIS11 
1.9079 
33 
MIS4 
MIS7 
1.9633 
11 
MIS1 
MIS12 
2.0587 
16 
MIS2 
MIS7 
2.1501 
8 
MIS1 
MIS9 
2.2155 
7 
MIS1 
MIS8 
2.2918 
1 
MIS1 
MIS2 
2.2944 
2 
MIS1 
MIS3 
2.3167 
5 
MIS1 
MIS6 
2.3369 
10 
MIS1 
MIS11 
2.3639 
37 
MIS4 
MIS11 
2.3919 
35 
MIS4 
MIS9 
2.4482 
31 
MIS4 
MIS5 
2.4810 
9 
MIS1 
MIS10 
2.6995 
A dataframe like this can be transformed into a matrix to be plotted as an adjacency network with the qgraph package.
#psi values to matrix
MIS.psi.matrix < formatPsi(
psi.values = MIS.psi,
to = "matrix"
)
#dissimilariy to distance
MIS.distance < 1/MIS.psi.matrix**4
#plotting network
qgraph::qgraph(
MIS.distance,
layout='spring',
vsize=5,
labels = colnames(MIS.distance),
colors = viridis::viridis(2, begin = 0.3, end = 0.8, alpha = 0.5, direction = 1)
)
Or as a matrix with ggplot2.
#ordering factors to get a triangular matrix
MIS.psi$A < factor(MIS.psi$A, levels=unique(sequencesMIS$MIS))
MIS.psi$B < factor(MIS.psi$B, levels=unique(sequencesMIS$MIS))
#plotting matrix
ggplot(data=na.omit(MIS.psi), aes(x=A, y=B, size=psi, color=psi)) +
geom_point() +
viridis::scale_color_viridis(direction = 1) +
guides(size = FALSE)
The dataframe of dissimilarities between pairs of sequences can be also used to analyze the drivers of dissimilarity. To do so, attributes such as differences in time (when sequences represent different times) or distance (when sequences represent different sites) between sequences, or differences between physical/climatic attributes between sequences such as topography or climate can be added to the table, so models such as
The package distantia is also useful to compare synchronic sequences that have the same number of samples. In this particular case, distances to obtain
Here we test these ideas with the climate dataset included in the library. It represents simulated palaeoclimate over 200 ky. at four sites identified by the column sequenceId. Note that this time the transformation applied is “scaled”, which uses the scale function of R base to center and scale the data.
#loading sample data
data(climate)
#preparing sequences
climate < prepareSequences(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
paired.samples = TRUE,
transformation = "scale"
)
In this case, the argument paired.samples of workflowPsi must be set to TRUE. Additionally, if the argument same.time is set to TRUE, the time/age of the samples is checked, and samples without the same time/age are removed from the analysis.
#computing psi
climate.psi < workflowPsi(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
method = "euclidean",
paired.samples = TRUE, #this bit is important
same.time = TRUE, #removes samples with unequal time
format = "dataframe"
)
#ordered with lower psi on top
kable(climate.psi[order(climate.psi$psi), ], digits = 4, row.names = FALSE, caption = "Psi values between pairs of sequences in the 'climate' dataset.")
A 
B 
psi 

2 
4 
3.4092 
4 
2 
3.4092 
1 
3 
3.5702 
3 
1 
3.5702 
3 
4 
4.1139 
4 
3 
4.1139 
1 
2 
4.2467 
2 
1 
4.2467 
2 
3 
4.6040 
3 
2 
4.6040 
1 
4 
4.8791 
4 
1 
4.8791 
One question that may arise when comparing time series is “to what extent are dissimilarity values a result of chance?”. Answering this question requires to compare a given dissimilarity value with a distribution of dissimilarity values resulting from chance. However… how do we simulate chance in a multivariate timeseries? The natural answer is “permutation”. Since samples in a multivariate timeseries are ordered, randomly reshuffling samples is out of the question, because that would destroy the structure of the data. A more gentler alternative is to randomly switch single datapoints (a case of a variable) independently by variable. This kind of permutation is named “restricted permutation”, and preserves global trends within the data, but changes local structure.
A restricted permutation test on psi values requires the following steps:
Such a proportion represents the probability of obtaining a value lower than real psi by chance.
Since the restricted permutation only happens at a local scale within each column of each sequence, the probability values returned are very conservative and shouldn’t be interpreted in the same way pvalues are interpreted.
The process described above has been implemented in the workflowNullPsi function. We will apply it to three groups of the sequencesMIS dataset.
#getting example data
data(sequencesMIS)
#working with 3 groups (to make this fast)
sequencesMIS < sequencesMIS[sequencesMIS$MIS %in% c("MIS4", "MIS5", "MIS6"),]
#preparing sequences
sequencesMIS < prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
transformation = "hellinger"
)
The computation of the null psi values goes as follows:
random.psi < workflowNullPsi(
sequences = sequencesMIS,
grouping.column = "MIS",
method = "euclidean",
diagonal = TRUE,
ignore.blocks = TRUE,
repetitions = 99, #recommended value: 999
parallel.execution = TRUE
)
#there is also a highperformance version of this function with fewer options (diagonal = TRUE, ignore.blocks = TRUE, and method = "euclidean" are used by default)
random.psi < workflowNullPsiHP(
sequences = sequencesMIS,
grouping.column = "MIS",
repetitions = 99, #recommended value: 999
parallel.execution = TRUE
)
Note that the number of repetitions has been set to 9 in order to speedup execution. The actual number would ideally be 999.
The output is a list with two dataframes, psi and p.
The dataframe psi contains the real and random psi values. The column psi contains the dissimilarity between the sequences in the columns A and B. The columns r1 to r9 contain the psi values obtained from permutations of the sequences.
kable(random.psi$psi, digits = 4, caption = "True and null psi values generated by workflowNullPsi.")
A 
B 
psi 
r1 
r2 
r3 
r4 
r5 
r6 
r7 
r8 
r9 
r10 
r11 
r12 
r13 
r14 
r15 
r16 
r17 
r18 
r19 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 
r32 
r33 
r34 
r35 
r36 
r37 
r38 
r39 
r40 
r41 
r42 
r43 
r44 
r45 
r46 
r47 
r48 
r49 
r50 
r51 
r52 
r53 
r54 
r55 
r56 
r57 
r58 
r59 
r60 
r61 
r62 
r63 
r64 
r65 
r66 
r67 
r68 
r69 
r70 
r71 
r72 
r73 
r74 
r75 
r76 
r77 
r78 
r79 
r80 
r81 
r82 
r83 
r84 
r85 
r86 
r87 
r88 
r89 
r90 
r91 
r92 
r93 
r94 
r95 
r96 
r97 
r98 
r99 
r100 
r101 
r102 
r103 
r104 
r105 
r106 
r107 
r108 
r109 
r110 
r111 
r112 
r113 
r114 
r115 
r116 
r117 
r118 
r119 
r120 
r121 
r122 
r123 
r124 
r125 
r126 
r127 
r128 
r129 
r130 
r131 
r132 
r133 
r134 
r135 
r136 
r137 
r138 
r139 
r140 
r141 
r142 
r143 
r144 
r145 
r146 
r147 
r148 
r149 
r150 
r151 
r152 
r153 
r154 
r155 
r156 
r157 
r158 
r159 
r160 
r161 
r162 
r163 
r164 
r165 
r166 
r167 
r168 
r169 
r170 
r171 
r172 
r173 
r174 
r175 
r176 
r177 
r178 
r179 
r180 
r181 
r182 
r183 
r184 
r185 
r186 
r187 
r188 
r189 
r190 
r191 
r192 
r193 
r194 
r195 
r196 
r197 
r198 
r199 
r200 
r201 
r202 
r203 
r204 
r205 
r206 
r207 
r208 
r209 
r210 
r211 
r212 
r213 
r214 
r215 
r216 
r217 
r218 
r219 
r220 
r221 
r222 
r223 
r224 
r225 
r226 
r227 
r228 
r229 
r230 
r231 
r232 
r233 
r234 
r235 
r236 
r237 
r238 
r239 
r240 
r241 
r242 
r243 
r244 
r245 
r246 
r247 
r248 
r249 
r250 
r251 
r252 
r253 
r254 
r255 
r256 
r257 
r258 
r259 
r260 
r261 
r262 
r263 
r264 
r265 
r266 
r267 
r268 
r269 
r270 
r271 
r272 
r273 
r274 
r275 
r276 
r277 
r278 
r279 
r280 
r281 
r282 
r283 
r284 
r285 
r286 
r287 
r288 
r289 
r290 
r291 
r292 
r293 
r294 
r295 
r296 
r297 

MIS4 
MIS5 
2.4810 
0.0000 
3.7536 
1.3317 
0.0000 
3.8819 
1.4707 
0.0000 
3.6452 
1.2600 
0.0000 
3.5013 
1.3834 
0.0000 
4.0277 
1.3484 
0.0000 
2.9678 
1.1679 
0.0000 
3.7815 
1.3421 
0.0000 
3.8574 
1.2646 
0.0000 
3.7788 
1.2697 
0.0000 
3.7480 
1.2656 
0.0000 
3.3589 
1.3758 
0.0000 
2.6402 
1.4237 
0.0000 
3.7716 
1.1127 
0.0000 
3.7987 
1.4647 
0.0000 
3.6555 
1.4283 
0.000 
3.3450 
1.1930 
0.0000 
2.8487 
1.2650 
0.0000 
3.2751 
0.9897 
0.0000 
3.7681 
1.2368 
0.0000 
3.0029 
1.1084 
0.0000 
2.9344 
1.1265 
0.0000 
3.0970 
1.3298 
0.0000 
3.2655 
1.0583 
0.0000 
3.7342 
1.2722 
0.0000 
3.9553 
1.1021 
0.0000 
3.6368 
1.1076 
0.0000 
3.5591 
1.4114 
0.0000 
3.2966 
1.3217 
0.0000 
3.5813 
1.1319 
0.0000 
3.3199 
0.983 
0.0000 
3.8344 
1.1349 
0.0000 
3.9911 
1.2949 
0.0000 
3.3264 
1.2411 
0.0000 
3.8077 
1.3534 
0.0000 
3.7525 
1.2932 
0.0000 
3.3755 
1.4088 
0.0000 
3.7383 
1.3163 
0.0000 
3.5603 
1.3440 
0.0000 
4.5097 
1.6253 
0.0000 
3.0503 
1.1434 
0.0000 
3.5231 
1.3495 
0.0000 
3.5837 
1.2140 
0.0000 
4.2384 
1.2953 
0.0000 
3.4604 
1.2267 
0.0000 
3.5569 
1.3008 
0.0000 
3.8854 
1.5219 
0.0000 
4.0438 
1.1978 
0.0000 
3.6232 
1.1538 
0.0000 
3.7029 
1.7062 
0.0000 
4.0271 
1.3001 
0.0000 
2.8608 
1.3602 
0.0000 
3.8412 
1.0693 
0.0000 
3.3452 
1.3526 
0.0000 
3.7804 
1.1964 
0.0000 
3.8922 
1.2574 
0.0000 
3.4213 
1.1074 
0.0000 
3.5955 
1.2909 
0.0000 
3.1859 
1.2943 
0.0000 
3.6091 
1.2801 
0.0000 
2.9185 
1.3562 
0.0000 
4.1324 
1.1933 
0.0000 
3.8081 
1.4057 
0.0000 
3.5450 
1.4605 
0.0000 
3.7457 
1.2695 
0.0000 
3.6136 
1.3570 
0.0000 
3.8297 
1.3051 
0.0000 
3.5123 
1.0305 
0.0000 
3.4788 
1.0452 
0.0000 
3.3500 
1.5083 
0.0000 
3.4277 
1.0918 
0.0000 
3.5416 
1.1457 
0.0000 
3.8708 
1.4005 
0.0000 
3.6071 
1.2262 
0.0000 
3.5871 
1.3329 
0.0000 
4.0342 
1.4382 
0.0000 
3.6460 
1.2434 
0.0000 
3.5562 
1.2390 
0.0000 
3.5911 
1.124 
0.0000 
3.7326 
1.5393 
0.0000 
3.9109 
1.4098 
0.0000 
3.4930 
1.1832 
0.0000 
2.7677 
1.2942 
0.0000 
3.6112 
1.2785 
0.0000 
3.7562 
1.1567 
0.0000 
3.4105 
1.2620 
0.0000 
3.0288 
1.0794 
0.0000 
4.3534 
1.5268 
0.0000 
3.3925 
1.1479 
0.0000 
3.4932 
1.0770 
0.0000 
3.5992 
1.4058 
0.0000 
3.7622 
1.2329 
0.0000 
3.7002 
1.2894 
0.0000 
3.5179 
1.1918 
0.0000 
4.2240 
1.3402 
0.0000 
3.5807 
1.1993 
0.0000 
3.8250 
1.2416 
0.0000 
3.7522 
1.4727 
0.0000 
3.6046 
1.2223 
0.0000 
3.8396 
1.0294 
MIS4 
MIS6 
1.1020 
3.7536 
0.0000 
1.7778 
3.8819 
0.0000 
1.7145 
3.6452 
0.0000 
1.4906 
3.5013 
0.0000 
1.5997 
4.0277 
0.0000 
1.9284 
2.9678 
0.0000 
1.7680 
3.7815 
0.0000 
1.7752 
3.8574 
0.0000 
1.7149 
3.7788 
0.0000 
1.6472 
3.7480 
0.0000 
1.7108 
3.3589 
0.0000 
1.7109 
2.6402 
0.0000 
1.5961 
3.7716 
0.0000 
1.6565 
3.7987 
0.0000 
1.7153 
3.6555 
0.0000 
1.5970 
3.345 
0.0000 
1.6136 
2.8487 
0.0000 
1.6701 
3.2751 
0.0000 
1.4932 
3.7681 
0.0000 
1.6126 
3.0029 
0.0000 
1.9839 
2.9344 
0.0000 
1.8604 
3.0970 
0.0000 
1.5921 
3.2655 
0.0000 
1.6640 
3.7342 
0.0000 
1.5258 
3.9553 
0.0000 
1.6979 
3.6368 
0.0000 
1.5611 
3.5591 
0.0000 
1.6552 
3.2966 
0.0000 
1.6900 
3.5813 
0.0000 
1.8737 
3.3199 
0.0000 
1.717 
3.8344 
0.0000 
1.5950 
3.9911 
0.0000 
1.6099 
3.3264 
0.0000 
1.5220 
3.8077 
0.0000 
1.3983 
3.7525 
0.0000 
1.3374 
3.3755 
0.0000 
1.8604 
3.7383 
0.0000 
1.8833 
3.5603 
0.0000 
1.7699 
4.5097 
0.0000 
1.6200 
3.0503 
0.0000 
1.6997 
3.5231 
0.0000 
1.7551 
3.5837 
0.0000 
1.6986 
4.2384 
0.0000 
1.5901 
3.4604 
0.0000 
1.7288 
3.5569 
0.0000 
1.5894 
3.8854 
0.0000 
1.5422 
4.0438 
0.0000 
1.7261 
3.6232 
0.0000 
1.3016 
3.7029 
0.0000 
1.6888 
4.0271 
0.0000 
1.8111 
2.8608 
0.0000 
1.9147 
3.8412 
0.0000 
1.7930 
3.3452 
0.0000 
1.8602 
3.7804 
0.0000 
1.9937 
3.8922 
0.0000 
1.7063 
3.4213 
0.0000 
1.4280 
3.5955 
0.0000 
1.7968 
3.1859 
0.0000 
1.5233 
3.6091 
0.0000 
1.3548 
2.9185 
0.0000 
1.7555 
4.1324 
0.0000 
1.6919 
3.8081 
0.0000 
1.6254 
3.5450 
0.0000 
1.6068 
3.7457 
0.0000 
1.9194 
3.6136 
0.0000 
1.2503 
3.8297 
0.0000 
1.5950 
3.5123 
0.0000 
1.3527 
3.4788 
0.0000 
1.4569 
3.3500 
0.0000 
1.9489 
3.4277 
0.0000 
1.6999 
3.5416 
0.0000 
1.5389 
3.8708 
0.0000 
1.4900 
3.6071 
0.0000 
1.8822 
3.5871 
0.0000 
1.6469 
4.0342 
0.0000 
1.8238 
3.6460 
0.0000 
1.4199 
3.5562 
0.0000 
1.9143 
3.5911 
0.0000 
1.530 
3.7326 
0.0000 
1.6699 
3.9109 
0.0000 
1.8029 
3.4930 
0.0000 
1.6022 
2.7677 
0.0000 
1.6816 
3.6112 
0.0000 
1.7398 
3.7562 
0.0000 
1.3929 
3.4105 
0.0000 
1.6506 
3.0288 
0.0000 
1.5035 
4.3534 
0.0000 
1.9079 
3.3925 
0.0000 
1.7209 
3.4932 
0.0000 
1.5058 
3.5992 
0.0000 
1.7595 
3.7622 
0.0000 
1.7663 
3.7002 
0.0000 
1.7471 
3.5179 
0.0000 
1.6484 
4.2240 
0.0000 
1.6049 
3.5807 
0.0000 
1.6514 
3.8250 
0.0000 
1.8138 
3.7522 
0.0000 
1.7080 
3.6046 
0.0000 
1.2971 
3.8396 
0.0000 
1.5657 
MIS5 
MIS6 
1.4884 
1.3317 
1.7778 
0.0000 
1.4707 
1.7145 
0.0000 
1.2600 
1.4906 
0.0000 
1.3834 
1.5997 
0.0000 
1.3484 
1.9284 
0.0000 
1.1679 
1.7680 
0.0000 
1.3421 
1.7752 
0.0000 
1.2646 
1.7149 
0.0000 
1.2697 
1.6472 
0.0000 
1.2656 
1.7108 
0.0000 
1.3758 
1.7109 
0.0000 
1.4237 
1.5961 
0.0000 
1.1127 
1.6565 
0.0000 
1.4647 
1.7153 
0.0000 
1.4283 
1.5970 
0.0000 
1.193 
1.6136 
0.0000 
1.2650 
1.6701 
0.0000 
0.9897 
1.4932 
0.0000 
1.2368 
1.6126 
0.0000 
1.1084 
1.9839 
0.0000 
1.1265 
1.8604 
0.0000 
1.3298 
1.5921 
0.0000 
1.0583 
1.6640 
0.0000 
1.2722 
1.5258 
0.0000 
1.1021 
1.6979 
0.0000 
1.1076 
1.5611 
0.0000 
1.4114 
1.6552 
0.0000 
1.3217 
1.6900 
0.0000 
1.1319 
1.8737 
0.0000 
0.9830 
1.7170 
0.000 
1.1349 
1.5950 
0.0000 
1.2949 
1.6099 
0.0000 
1.2411 
1.5220 
0.0000 
1.3534 
1.3983 
0.0000 
1.2932 
1.3374 
0.0000 
1.4088 
1.8604 
0.0000 
1.3163 
1.8833 
0.0000 
1.3440 
1.7699 
0.0000 
1.6253 
1.6200 
0.0000 
1.1434 
1.6997 
0.0000 
1.3495 
1.7551 
0.0000 
1.2140 
1.6986 
0.0000 
1.2953 
1.5901 
0.0000 
1.2267 
1.7288 
0.0000 
1.3008 
1.5894 
0.0000 
1.5219 
1.5422 
0.0000 
1.1978 
1.7261 
0.0000 
1.1538 
1.3016 
0.0000 
1.7062 
1.6888 
0.0000 
1.3001 
1.8111 
0.0000 
1.3602 
1.9147 
0.0000 
1.0693 
1.7930 
0.0000 
1.3526 
1.8602 
0.0000 
1.1964 
1.9937 
0.0000 
1.2574 
1.7063 
0.0000 
1.1074 
1.4280 
0.0000 
1.2909 
1.7968 
0.0000 
1.2943 
1.5233 
0.0000 
1.2801 
1.3548 
0.0000 
1.3562 
1.7555 
0.0000 
1.1933 
1.6919 
0.0000 
1.4057 
1.6254 
0.0000 
1.4605 
1.6068 
0.0000 
1.2695 
1.9194 
0.0000 
1.3570 
1.2503 
0.0000 
1.3051 
1.5950 
0.0000 
1.0305 
1.3527 
0.0000 
1.0452 
1.4569 
0.0000 
1.5083 
1.9489 
0.0000 
1.0918 
1.6999 
0.0000 
1.1457 
1.5389 
0.0000 
1.4005 
1.4900 
0.0000 
1.2262 
1.8822 
0.0000 
1.3329 
1.6469 
0.0000 
1.4382 
1.8238 
0.0000 
1.2434 
1.4199 
0.0000 
1.2390 
1.9143 
0.0000 
1.1240 
1.5300 
0.000 
1.5393 
1.6699 
0.0000 
1.4098 
1.8029 
0.0000 
1.1832 
1.6022 
0.0000 
1.2942 
1.6816 
0.0000 
1.2785 
1.7398 
0.0000 
1.1567 
1.3929 
0.0000 
1.2620 
1.6506 
0.0000 
1.0794 
1.5035 
0.0000 
1.5268 
1.9079 
0.0000 
1.1479 
1.7209 
0.0000 
1.0770 
1.5058 
0.0000 
1.4058 
1.7595 
0.0000 
1.2329 
1.7663 
0.0000 
1.2894 
1.7471 
0.0000 
1.1918 
1.6484 
0.0000 
1.3402 
1.6049 
0.0000 
1.1993 
1.6514 
0.0000 
1.2416 
1.8138 
0.0000 
1.4727 
1.7080 
0.0000 
1.2223 
1.2971 
0.0000 
1.0294 
1.5657 
0.0000 
The dataframe p contains the probability of obtaining the real psi value by chance for each combination of sequences.
A 
B 
p 

MIS4 
MIS5 
0.6677852 
MIS4 
MIS6 
0.3355705 
MIS5 
MIS6 
0.6845638 
What variables are more important in explaining the dissimilarity between two sequences?, or in other words, what variables contribute the most to the dissimilarity between two sequences? One reasonable answer is: the one that reduces dissimilarity the most when removed from the data. This section explains how to use the function workflowImportance follows such a principle to evaluate the importance of given variables in explaining differences between sequences.
First, we prepare the data. It is again sequencesMIS, but with only three groups selected (MIS 4 to 6) to simplify the analysis.
#getting example data
data(sequencesMIS)
#getting three groups only to simplify
sequencesMIS < sequencesMIS[sequencesMIS$MIS %in% c("MIS4", "MIS5", "MIS6"),]
#preparing sequences
sequences < prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
merge.mode = "complete"
)
The workflow function is pretty similar to the ones explained above. However, unlike the other functions in the package, that parallelize across the comparison of pairs of sequences, this one parallelizes the computation of psi on combinations of columns, removing one column each time.
WARNING: the argument ‘exclude.columns’ of ‘workflowImportance’ does not work in version 1.0.0 (available in CRAN), but the bug is fixed in version 1.0.1 (available in GitHub). If you are using 1.0.0, I recommend you to subset ‘sequences’ so only the grouping column and the numeric columns to be compared are available for the function.
psi.importance < workflowImportance(
sequences = sequencesMIS,
grouping.column = "MIS",
method = "euclidean",
diagonal = TRUE,
ignore.blocks = TRUE
)
#there is also a high performance version of this function, but with fewer options (it uses euclidean, diagonal, and ignores blocks by default)
psi.importance < workflowImportanceHP(
sequences = sequencesMIS,
grouping.column = "MIS"
)
The output is a list with two slots named psi and psi.drop.
The dataframe psi contains psi values for each combination of variables (named in the coluns A and B) computed for all columns in the column All variables, and one column per variable named Without variable_name containing the psi value when that variable is removed from the compared sequences.
kable(psi.importance$psi, digits = 4, caption = "Psi values with all variables (column 'All variables'), and without one variable at a time.")
A 
B 
All variables 
Without Carpinus 
Without Tilia 
Without Alnus 
Without Pinus 
Without Betula 
Without Quercus 

MIS4 
MIS5 
4.3929 
4.4018 
4.4070 
4.4087 
5.9604 
4.4240 
0.9878 
MIS4 
MIS6 
1.0376 
1.0374 
1.0372 
1.0303 
1.6222 
1.0311 
0.9646 
MIS5 
MIS6 
1.7680 
1.7684 
1.7685 
1.7695 
1.1980 
1.7873 
1.0706 
This table can be plotted as a bar plot as follows:
#extracting object
psi.df < psi.importance$psi
#to long format
psi.df.long < tidyr::gather(psi.df, variable, psi, 3:ncol(psi.df))
#creating column with names of the sequences
psi.df.long$name < paste(psi.df.long$A, psi.df.long$B, sep="  ")
#plot
ggplot(data=psi.df.long, aes(x=variable, y=psi, fill=psi)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap("name") +
scale_fill_viridis(direction = 1) +
ggtitle("Contribution of separated variables to dissimilarity.") +
labs(fill = "Psi")
The second table, named psi.drop describes the drop in psi values, in percentage, when the given variable is removed from the analysis. Large positive numbers indicate that dissimilarity drops (increase in similarity) when the given variable is removed, confirming that the variable is important to explain the dissimilarity between both sequences. Negative values indicate an increase in dissimilarity between the sequences when the variable is dropped.
In summary:
kable(psi.importance$psi.drop, caption = "Drop in psi, as percentage of the psi value obtained when using all variables. Positive values indicate that the sequences become more similar when the given variable is removed (contribution to dissimilarity), while negative values indicate that the sequences become more dissimilar when the variable is removed (contribution to similarity).")
A 
B 
Carpinus 
Tilia 
Alnus 
Pinus 
Betula 
Quercus 

MIS4 
MIS5 
0.20 
0.32 
0.36 
35.68 
0.71 
77.51 
MIS4 
MIS6 
0.01 
0.03 
0.70 
56.35 
0.62 
7.03 
MIS5 
MIS6 
0.02 
0.03 
0.09 
32.24 
1.09 
39.44 
#extracting object
psi.drop.df < psi.importance$psi.drop
#to long format
psi.drop.df.long < tidyr::gather(psi.drop.df, variable, psi, 3:ncol(psi.drop.df))
#creating column with names of the sequences
psi.drop.df.long$name < paste(psi.drop.df.long$A, psi.drop.df.long$B, sep="  ")
#plot
ggplot(data=psi.drop.df.long, aes(x=variable, y=psi, fill=psi)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_hline(yintercept = 0, size = 0.3) +
facet_wrap("name") +
scale_fill_viridis(direction = 1) +
ggtitle("Drop in dissimilarity when variables are removed.") +
ylab("Drop in dissimilarity (%)") +
labs(fill = "Psi drop (%)")
In this scenario the user has one short and one long sequence, and the goal is to find the section in the long sequence that better matches the short one. To recreate this scenario we use the dataset sequencesMIS. The first 10 samples will serve as short sequence, and the first 40 samples as long sequence. These small subsets are selected to speedup the execution time of this example.
#loading the data
data(sequencesMIS)
#removing grouping column
sequencesMIS$MIS < NULL
#subsetting to get the short sequence
MIS.short < sequencesMIS[1:10, ]
#subsetting to get the long sequence
MIS.long < sequencesMIS[1:40, ]
The sequences have to be prepared and transformed. For simplicity, the sequences are named short and long, and the grouping column is named id, but the user can name them at will. Since the data represents community composition, a Hellinger transformation is applied.
MIS.short.long < prepareSequences(
sequence.A = MIS.short,
sequence.A.name = "short",
sequence.B = MIS.long,
sequence.B.name = "long",
grouping.column = "id",
transformation = "hellinger"
)
The function workflowPartialMatch shown below is going to subset the long sequence in sizes between min.length and max.length. In the example below this search space has the same size as MIS.short to speedup the execution of this example, but wider windows are possible. If left empty, the length of the segment in the long sequence to be matched will have the same number of samples as the short sequence. In the example below we look for segments of the same length, two samples shorter, and two samples longer than the shorter sequence.
MIS.psi < workflowPartialMatch(
sequences = MIS.short.long,
grouping.column = "id",
method = "euclidean",
diagonal = TRUE,
ignore.blocks = TRUE,
min.length = nrow(MIS.short),
max.length = nrow(MIS.short)
)
The function returns a dataframe with three columns: first.row (first row of the matched segment of the long sequence), last.row (last row of the matched segment of the long sequence), and psi (ordered from lower to higher). In this case, since the long sequence contains the short sequence, the first row shows a perfect match.
kable(MIS.psi[1:15, ], digits = 4, caption = "First and last row of a section of the long sequence along with the psi value obtained during the partial matching.")
first.row 
last.row 
psi 

1 
10 
0.0000 
3 
12 
0.2707 
4 
13 
0.2815 
2 
11 
0.3059 
6 
15 
0.4343 
5 
14 
0.5509 
9 
18 
1.3055 
10 
19 
1.3263 
8 
17 
1.3844 
7 
16 
1.3949 
15 
24 
1.4428 
12 
21 
1.4711 
17 
26 
1.4959 
11 
20 
1.5101 
18 
27 
1.5902 
Subsetting the long sequence to obtain the segment best matching with the short sequence goes as follows.
#indices of the best matching segment
best.match.indices < MIS.psi[1, "first.row"]:MIS.psi[1, "last.row"]
#subsetting by these indices
best.match < MIS.long[best.match.indices, ]
Under this scenario, the objective is to combine two sequences into a single composite sequence. The basic assumption followed by the algorithm building the composite sequence is most similar samples should go together, but respecting the original ordering of the sequences. Therefore, the output will contain the samples in both sequences ordered in a way that minimizes the multivariate distance between consecutive samples. This scenario assumes that at least one of the sequences do not have a time/age/depth column, or that the values in such a column are uncertain. In any case, time/age/depth is not considered as a factor in the generation of the composite sequence.
The example below uses the pollenGP dataset, which contains 200 samples, with 40 pollen types each. To create a smalle case study, the code below separates the first 20 samples of the sequence into two different sequences with 10 randomly selected samples each. Even though this scenario assumes that these sequences do not have depth or age, these columns will be kept so the result can be assessed. That is why these columns are added to the exclude.columns argument. Also, note that the argument transformation is set to “none”, so the output is not transformed, and the outcome can be easily interpreted. This will give more weight to the most abundant taxa, which will in fact guide the slotting.
#loading the data
data(pollenGP)
#getting first 20 samples
pollenGP < pollenGP[1:20, ]
#sampling indices
set.seed(10) #to get same result every time
sampling.indices < sort(sample(1:20, 10))
#subsetting the sequence
A < pollenGP[sampling.indices, ]
B < pollenGP[sampling.indices, ]
#preparing the sequences
AB < prepareSequences(
sequence.A = A,
sequence.A.name = "A",
sequence.B = B,
sequence.B.name = "B",
grouping.column = "id",
exclude.columns = c("depth", "age"),
transformation = "none"
)
Once the sequences are prepared, the function workflowSlotting will allow to combine (slot) them. The function computes a distance matrix between the samples in both sequences according to the method argument, computes the leastcost matrix, and generates the leastcost path. Note that it only uses an orthogonal method considering blocks, since this is the only option really suitable for this task.
AB.combined < workflowSlotting(
sequences = AB,
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
method = "euclidean",
plot = TRUE
)
The function reads the leastcost path in order to find the combination of samples of both sequences that minimizes dissimilarity, constrained by the order of the samples on each sequence. The output dataframe has a column named original.index, which has the index of each sample in the original datasets.
id 
original.index 
depth 
age 
Abies 
Juniperus 
Hedera 
Plantago 
Boraginaceae 
Crassulaceae 


1 
A 
1 
3 
3.97 
11243 
0 
5 
12 
21 
95 
11 
B 
1 
1 
3.92 
11108 
0 
7 
0 
5 
20 
12 
B 
2 
2 
3.95 
11189 
0 
3 
3 
15 
47 
13 
B 
3 
4 
4.00 
11324 
0 
8 
43 
60 
65 
2 
A 
2 
6 
4.05 
11459 
0 
20 
73 
94 
1 
14 
B 
4 
5 
4.02 
11378 
0 
44 
76 
110 
0 
3 
A 
3 
7 
4.07 
11514 
0 
20 
80 
100 
0 
4 
A 
4 
8 
4.10 
11595 
0 
34 
80 
155 
0 
5 
A 
5 
9 
4.17 
11784 
0 
22 
44 
131 
0 
15 
B 
5 
13 
4.27 
12054 
0 
37 
30 
150 
0 
6 
A 
6 
10 
4.20 
11865 
0 
35 
30 
112 
0 
7 
A 
7 
11 
4.22 
11919 
0 
30 
45 
150 
0 
8 
A 
8 
12 
4.25 
12000 
0 
44 
35 
150 
0 
9 
A 
9 
15 
4.32 
12189 
0 
43 
17 
120 
0 
16 
B 
6 
14 
4.30 
12135 
0 
50 
10 
120 
0 
Note that several samples show inverted ages with respect to the previous samples. This is to be expected, since the slotting algorithm only takes into account distance/dissimilarity between adjacent samples to generate the ordering.
This scenario assumes that the user has two METS, one of them with a given attribute (age/time) that needs to be transferred to the other sequence by using similarity/dissimilarity (constrained by sample order) as a transfer criterion. This case is relatively common in palaeoecology, when a given dataset is dated, and another taken at a close location is not.
The code below prepares the data for the example. The sequence pollenGP is the reference sequence, and contains the column age. The sequence pollenX is the target sequence, without an age column. We generate it by taking 40 random samples between the samples 50 and 100 of pollenGP. The sequences are prepared with prepareSequences, as usual, with the identificators “GP” and “X”
#loading sample dataset
data(pollenGP)
#subset pollenGP to make a shorter dataset
pollenGP < pollenGP[1:50, ]
#generating a subset of pollenGP
set.seed(10)
pollenX < pollenGP[sort(sample(1:50, 40)), ]
#we separate the age column
pollenX.age < pollenX$age
#and remove the age values from pollenX
pollenX$age < NULL
pollenX$depth < NULL
#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP < pollenGP[sample(1:50, 10), ]
#prepare sequences
GP.X < prepareSequences(
sequence.A = pollenGP,
sequence.A.name = "GP",
sequence.B = pollenX,
sequence.B.name = "X",
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
transformation = "none"
)
The transfer of “age” values from GP to X can be done in two ways, both constrained by sample order:
A direct transfer of an attribute from the samples of one sequence to the samples of another requires to compute a distance matrix between samples, the leastcost matrix and its leastcost path (both with the option diagonal activated), and to parse the leastcost path file to assign attribute values. This is done by the function workflowTransfer with the option
#parameters
X.new < workflowTransfer(
sequences = GP.X,
grouping.column = "id",
time.column = "age",
method = "euclidean",
transfer.what = "age",
transfer.from = "GP",
transfer.to = "X",
mode = "direct"
)
kable(X.new[1:15, ], digits = 4)
id 
depth 
age 
Abies 
Juniperus 
Hedera 
Plantago 
Boraginaceae 
Crassulaceae 
Pinus 
Ranunculaceae 
Rhamnus 
Caryophyllaceae 
Dipsacaceae 
Betula 
Acer 
Armeria 
Tilia 
Hippophae 
Salix 
Labiatae 
Valeriana 
Nymphaea 
Umbelliferae 
Sanguisorba_minor 
Plantago.lanceolata 
Campanulaceae 
Asteroideae 
Gentiana 
Fraxinus 
Cichorioideae 
Taxus 
Rumex 
Cedrus 
Ranunculus.subgen..Batrachium 
Cyperaceae 
Corylus 
Myriophyllum 
Filipendula 
Vitis 
Rubiaceae 
Polypodium 


41 
X 
0 
3.92 
11108 
0 
7 
0 
5 
20 
0 
13 
0 
2 
1 
0 
2 
41 
0 
0 
0 
0 
0 
0 
0 
1 
0 
8 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
60 
2 
0 
0 
42 
X 
0 
4.00 
11324 
0 
8 
43 
60 
65 
0 
10 
0 
0 
2 
4 
0 
0 
0 
0 
0 
0 
2 
0 
4 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
11 
0 
2 
0 
43 
X 
0 
4.02 
11378 
0 
44 
76 
110 
0 
0 
0 
0 
0 
2 
11 
0 
0 
0 
0 
3 
1 
3 
0 
0 
5 
0 
0 
1 
0 
6 
0 
0 
1 
0 
1 
0 
0 
0 
0 
1 
1 
1 
44 
X 
0 
4.05 
11459 
0 
20 
73 
94 
1 
0 
0 
0 
0 
1 
10 
0 
2 
0 
0 
3 
1 
3 
0 
0 
3 
0 
0 
0 
0 
3 
0 
0 
2 
0 
0 
0 
0 
0 
0 
1 
0 
0 
45 
X 
0 
4.07 
11514 
0 
20 
80 
100 
0 
0 
0 
0 
0 
10 
4 
0 
0 
1 
0 
0 
0 
3 
0 
0 
1 
0 
0 
0 
0 
8 
1 
0 
0 
0 
0 
0 
1 
2 
1 
0 
1 
0 
46 
X 
0 
4.07 
11595 
0 
34 
80 
155 
0 
0 
0 
2 
0 
2 
13 
0 
0 
0 
0 
1 
0 
2 
0 
0 
2 
0 
0 
0 
0 
6 
0 
0 
0 
0 
0 
0 
1 
1 
0 
0 
1 
0 
47 
X 
0 
4.17 
11784 
0 
22 
44 
131 
0 
0 
0 
0 
0 
1 
13 
0 
0 
0 
0 
5 
0 
5 
0 
0 
3 
0 
0 
0 
0 
6 
0 
0 
0 
0 
2 
0 
1 
0 
0 
2 
0 
0 
48 
X 
0 
4.20 
11865 
0 
35 
30 
112 
0 
0 
0 
0 
0 
4 
8 
0 
0 
0 
2 
2 
1 
2 
0 
0 
7 
0 
1 
0 
0 
10 
1 
0 
0 
0 
1 
0 
1 
1 
0 
2 
1 
0 
49 
X 
0 
4.22 
11919 
0 
30 
45 
150 
0 
0 
0 
0 
0 
4 
11 
0 
0 
0 
0 
2 
3 
1 
0 
0 
1 
0 
0 
0 
0 
13 
1 
0 
0 
0 
0 
0 
0 
0 
0 
2 
1 
0 
50 
X 
0 
4.25 
12000 
0 
44 
35 
150 
0 
0 
0 
0 
0 
2 
8 
0 
0 
0 
0 
3 
1 
0 
0 
0 
5 
0 
0 
0 
0 
6 
3 
0 
0 
0 
0 
0 
0 
2 
0 
1 
0 
0 
51 
X 
0 
4.25 
12054 
0 
37 
30 
150 
0 
0 
1 
0 
0 
6 
10 
0 
0 
0 
0 
7 
2 
2 
0 
0 
6 
0 
0 
0 
0 
7 
1 
0 
0 
0 
0 
0 
0 
3 
0 
4 
0 
0 
52 
X 
0 
4.32 
12135 
0 
50 
10 
120 
0 
0 
0 
0 
0 
1 
7 
0 
0 
0 
0 
1 
0 
1 
0 
0 
8 
0 
0 
0 
0 
8 
2 
0 
2 
0 
0 
0 
0 
0 
0 
1 
0 
0 
53 
X 
0 
4.32 
12189 
0 
43 
17 
120 
0 
0 
0 
0 
0 
2 
15 
0 
0 
1 
1 
2 
0 
2 
0 
0 
5 
1 
0 
0 
0 
6 
2 
0 
2 
0 
1 
0 
2 
1 
0 
0 
0 
0 
54 
X 
0 
4.40 
12324 
0 
50 
11 
86 
0 
0 
0 
0 
0 
2 
15 
0 
0 
2 
1 
4 
5 
3 
0 
0 
6 
0 
0 
0 
0 
5 
1 
0 
2 
0 
0 
1 
1 
0 
0 
3 
1 
0 
55 
X 
0 
4.40 
12405 
0 
51 
6 
70 
0 
0 
0 
0 
0 
1 
16 
0 
0 
4 
1 
2 
4 
2 
0 
0 
5 
2 
0 
0 
0 
1 
2 
0 
1 
0 
0 
0 
0 
0 
0 
0 
3 
1 
The algorithm finds the most similar samples, and transfers attribute values directly between them. This can result in duplicated attribute values, as highlighted in the table above. The Pearson correlation between the original ages (stored in pollenX.age) and the assigned ones is 0.9996, so it can be concluded that in spite of its simplicity, this algorithm yields accurate results.
If we consider:
The unknwon value
The code below exemplifies the operation, using the samples 1 and 4 of the dataset pollenGP as
#loading data
data(pollenGP)
#samples in A
Ai < pollenGP[1, 3:ncol(pollenGP)]
Aj < pollenGP[4, 3:ncol(pollenGP)]
#ages of the samples in A
Ati < pollenGP[1, "age"]
Atj < pollenGP[4, "age"]
#sample in B
Bk < pollenGP[2, 3:ncol(pollenGP)]
#computing distances between Bk, Ai, and Aj
DBkAi < distance(Bk, Ai)
DBkAj < distance(Bk, Aj)
#normalizing the distances to 1
wi < DBkAi / (DBkAi + DBkAj)
wj < DBkAj / (DBkAi + DBkAj)
#computing Btk
Btk < wi * Ati + wj * Atj
The table below shows the observed versus the predicted values for
temp.df < data.frame(Observed = pollenGP[3, "age"], Predicted = Btk)
kable(t(temp.df), digits = 4, caption = "Observed versus predicted value in the interpolation of an age based on similarity between samples.")
Observed 
3.9700 
Predicted 
3.9735 
Below we create some example data, where a subset of pollenGP will be the donor of age values, and another subset of it, named pollenX will be the receiver of the age values.
#loading sample dataset
data(pollenGP)
#subset pollenGP to make a shorter dataset
pollenGP < pollenGP[1:50, ]
#generating a subset of pollenGP
set.seed(10)
pollenX < pollenGP[sort(sample(1:50, 40)), ]
#we separate the age column
pollenX.age < pollenX$age
#and remove the age values from pollenX
pollenX$age < NULL
pollenX$depth < NULL
#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP < pollenGP[sample(1:50, 10), ]
#prepare sequences
GP.X < prepareSequences(
sequence.A = pollenGP,
sequence.A.name = "GP",
sequence.B = pollenX,
sequence.B.name = "X",
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
transformation = "none"
)
To transfer attributes from GP to X we use the workflowTransfer function with the option mode = “interpolate”.
#parameters
X.new < workflowTransfer(
sequences = GP.X,
grouping.column = "id",
time.column = "age",
method = "euclidean",
transfer.what = "age",
transfer.from = "GP",
transfer.to = "X",
mode = "interpolated"
)
kable(X.new[1:15, ], digits = 4, caption = "Result of the transference of an age attribute from one sequence to another. NA values are expected when predicted ages for a given sample yield a higher number than the age of the previous sample.") %>%
row_spec(c(8, 13), bold = T)
id 
depth 
age 
Abies 
Juniperus 
Hedera 
Plantago 
Boraginaceae 
Crassulaceae 
Pinus 
Ranunculaceae 
Rhamnus 
Caryophyllaceae 
Dipsacaceae 
Betula 
Acer 
Armeria 
Tilia 
Hippophae 
Salix 
Labiatae 
Valeriana 
Nymphaea 
Umbelliferae 
Sanguisorba_minor 
Plantago.lanceolata 
Campanulaceae 
Asteroideae 
Gentiana 
Fraxinus 
Cichorioideae 
Taxus 
Rumex 
Cedrus 
Ranunculus.subgen..Batrachium 
Cyperaceae 
Corylus 
Myriophyllum 
Filipendula 
Vitis 
Rubiaceae 
Polypodium 


41 
X 
0 
3.9497 
11108 
0 
7 
0 
5 
20 
0 
13 
0 
2 
1 
0 
2 
41 
0 
0 
0 
0 
0 
0 
0 
1 
0 
8 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
60 
2 
0 
0 
42 
X 
0 
3.9711 
11324 
0 
8 
43 
60 
65 
0 
10 
0 
0 
2 
4 
0 
0 
0 
0 
0 
0 
2 
0 
4 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
11 
0 
2 
0 
43 
X 
0 
4.0009 
11378 
0 
44 
76 
110 
0 
0 
0 
0 
0 
2 
11 
0 
0 
0 
0 
3 
1 
3 
0 
0 
5 
0 
0 
1 
0 
6 
0 
0 
1 
0 
1 
0 
0 
0 
0 
1 
1 
1 
44 
X 
0 
4.0219 
11459 
0 
20 
73 
94 
1 
0 
0 
0 
0 
1 
10 
0 
2 
0 
0 
3 
1 
3 
0 
0 
3 
0 
0 
0 
0 
3 
0 
0 
2 
0 
0 
0 
0 
0 
0 
1 
0 
0 
45 
X 
0 
4.0522 
11514 
0 
20 
80 
100 
0 
0 
0 
0 
0 
10 
4 
0 
0 
1 
0 
0 
0 
3 
0 
0 
1 
0 
0 
0 
0 
8 
1 
0 
0 
0 
0 
0 
1 
2 
1 
0 
1 
0 
46 
X 
0 
4.1361 
11595 
0 
34 
80 
155 
0 
0 
0 
2 
0 
2 
13 
0 
0 
0 
0 
1 
0 
2 
0 
0 
2 
0 
0 
0 
0 
6 
0 
0 
0 
0 
0 
0 
1 
1 
0 
0 
1 
0 
47 
X 
0 
4.1972 
11784 
0 
22 
44 
131 
0 
0 
0 
0 
0 
1 
13 
0 
0 
0 
0 
5 
0 
5 
0 
0 
3 
0 
0 
0 
0 
6 
0 
0 
0 
0 
2 
0 
1 
0 
0 
2 
0 
0 
48 
X 
0 
NA 
11865 
0 
35 
30 
112 
0 
0 
0 
0 
0 
4 
8 
0 
0 
0 
2 
2 
1 
2 
0 
0 
7 
0 
1 
0 
0 
10 
1 
0 
0 
0 
1 
0 
1 
1 
0 
2 
1 
0 
49 
X 
0 
4.2027 
11919 
0 
30 
45 
150 
0 
0 
0 
0 
0 
4 
11 
0 
0 
0 
0 
2 
3 
1 
0 
0 
1 
0 
0 
0 
0 
13 
1 
0 
0 
0 
0 
0 
0 
0 
0 
2 
1 
0 
50 
X 
0 
4.2237 
12000 
0 
44 
35 
150 
0 
0 
0 
0 
0 
2 
8 
0 
0 
0 
0 
3 
1 
0 
0 
0 
5 
0 
0 
0 
0 
6 
3 
0 
0 
0 
0 
0 
0 
2 
0 
1 
0 
0 
51 
X 
0 
4.2999 
12054 
0 
37 
30 
150 
0 
0 
1 
0 
0 
6 
10 
0 
0 
0 
0 
7 
2 
2 
0 
0 
6 
0 
0 
0 
0 
7 
1 
0 
0 
0 
0 
0 
0 
3 
0 
4 
0 
0 
52 
X 
0 
NA 
12135 
0 
50 
10 
120 
0 
0 
0 
0 
0 
1 
7 
0 
0 
0 
0 
1 
0 
1 
0 
0 
8 
0 
0 
0 
0 
8 
2 
0 
2 
0 
0 
0 
0 
0 
0 
1 
0 
0 
53 
X 
0 
NA 
12189 
0 
43 
17 
120 
0 
0 
0 
0 
0 
2 
15 
0 
0 
1 
1 
2 
0 
2 
0 
0 
5 
1 
0 
0 
0 
6 
2 
0 
2 
0 
1 
0 
2 
1 
0 
0 
0 
0 
54 
X 
0 
4.3501 
12324 
0 
50 
11 
86 
0 
0 
0 
0 
0 
2 
15 
0 
0 
2 
1 
4 
5 
3 
0 
0 
6 
0 
0 
0 
0 
5 
1 
0 
2 
0 
0 
1 
1 
0 
0 
3 
1 
0 
55 
X 
0 
4.4155 
12405 
0 
51 
6 
70 
0 
0 
0 
0 
0 
1 
16 
0 
0 
4 
1 
2 
4 
2 
0 
0 
5 
2 
0 
0 
0 
1 
2 
0 
1 
0 
0 
0 
0 
0 
0 
0 
3 
1 
When interpolated values of the age column (transferred attribute via interpolation) show the value NA, it means that the interpolation yielded an age lower than the previous one. This happens when the same
Without taking into account these NA values, the Pearson correlation of the interpolated ages with the real ones is 0.9985.
IMPORTANT: the interpretation of the interpolated ages requires a careful consideration. Please, don’t do it blindly, because this algorithm has its limitations. For example, significant hiatuses in the data can introduce wild variations in interpolated ages.