distantia: an R package to compute the dissimilarity between multivariate time-series ================

Summary

The package distantia allows to measure the dissimilarity between multivariate ecological time-series (METS hereafter). The package assumes that the target sequences are ordered along a given dimension, being depth and time the most common ones, but others such as latitude or elevation are also possible. Furthermore, the target METS can be regular or irregular, and have their samples aligned (same age/time/depth) or unaligned (different age/time/depth). The only requirement is that the sequences must have at least two (but ideally more) columns with the same name and units representing different variables relevant to the dynamics of an ecological system.

In this document I explain the logics behind the method, show how to use it, and demonstrate how the distantia package introduces useful tools to compare multivariate time-series. The topics covered in this document are:

Installation of the package.
Comparing two irregular METS.
Comparing multiple irregular METS.
Comparing regular and aligned METS.
Restricted permutation test to assess the significance of dissimilarity scores
Assessing the contribution of every variable to the dissimilarity between two METS.
Partial matching: finding the section in a long METS more similar to a given shorter one.
Sequence slotting: combining samples of two METS into a single composite sequence.
Tranferring an attribute from one METS to another: direct and interpolated modes.

Installation

You can install the released version of distantia (currently v1.0.0) from CRAN with:

install.packages("distantia")

And the development version (currently v1.0.1) from GitHub with:

library(devtools)
devtools::install_github("BlasBenito/distantia")

Loading the library, plus other helper libraries:

library(distantia)
library(ggplot2)
library(viridis)
library(kableExtra)
library(qgraph)
library(tidyr)

Comparing two irregular METS

In this section I will use two example datasets based on the Abernethy pollen core (Birks and Mathewes, 1978) to fully explain the logical backbone of the dissimilarity analyses implemented in distantia.

#loading sequences
data(sequenceA)
data(sequenceB)

#showing first rows
kable(sequenceA[1:15, ], caption = "Sequence A")

Sequence A
betula	pinus	corylu	junipe	empetr	gramin	cypera	rumex
79	271	36	0	4	7	25	0
113	320	42	0	4	3	11	0
51	420	39	0	2	1	12	0
130	470	6	0	0	2	4	0
31	450	6	0	3	2	3	0
59	425	12	0	0	2	3	0
78	386	29	2	0	0	2	0
71	397	52	2	0	6	3	0
140	310	50	2	0	4	3	0
150	323	34	2	0	11	2	0
175	317	37	2	0	11	3	0
181	345	28	3	0	7	3	0
153	285	36	2	0	8	3	1
214	315	54	2	1	13	5	0
200	210	41	6	0	10	4	0

kable(sequenceB[1:15, ], caption = "Sequence B")

Sequence B
betula	pinus	corylu	junipe	gramin	cypera	artemi	rumex
19	175	NA	2	34	39	1	0
18	119	28	1	36	44	0	4
30	99	37	0	2	20	0	1
26	101	29	0	0	18	0	0
31	99	30	0	1	10	0	0
24	97	28	0	2	9	0	0
23	105	34	0	1	6	0	0
48	112	46	0	0	12	0	0
29	108	16	0	6	3	0	0
23	110	21	0	2	11	0	1
5	119	19	0	1	1	0	0
30	105	NA	0	9	7	0	0
22	116	17	0	1	7	0	0
24	115	20	0	2	4	0	0
26	119	23	0	4	0	0	0

Data preparation

Notice that sequenceB has a few NA values (that were introduced to serve as an example). The function prepareSequences gets them ready for analysis by matching colum names and handling empty data. It allows to merge two or more METS into a single dataframe ready for further analyses. Note that, since the data represents pollen abundances, a Hellinger transformation (square root of the relative proportions of each taxa, it balances the relative abundances of rare and dominant taxa) is applied. This transformation balances the relative importance of very abundant versus rare taxa. The function prepareSequences will generally be the starting point of any analysis performed with the distantia package.

#checking the function help-file.
help(prepareSequences)

#preparing sequences
AB.sequences <- prepareSequences(
  sequence.A = sequenceA,
  sequence.A.name = "A",
  sequence.B = sequenceB,
  sequence.B.name = "B",
  merge.mode = "complete",
  if.empty.cases = "zero",
  transformation = "hellinger"
)

#showing first rows of the transformed data
kable(AB.sequences[1:15, ], digits = 4, caption = "Sequences A and B ready for analysis.")

Sequences A and B ready for analysis.
id	betula	pinus	corylu	junipe	empetr	gramin	cypera	artemi	rumex
A	0.4327	0.8014	0.2921	0.0002	0.0974	0.1288	0.2434	2e-04	0.0002
A	0.4788	0.8057	0.2919	0.0001	0.0901	0.0780	0.1494	1e-04	0.0001
A	0.3117	0.8944	0.2726	0.0001	0.0617	0.0436	0.1512	1e-04	0.0001
A	0.4609	0.8763	0.0990	0.0001	0.0001	0.0572	0.0808	1e-04	0.0001
A	0.2503	0.9535	0.1101	0.0001	0.0778	0.0636	0.0778	1e-04	0.0001
A	0.3432	0.9210	0.1548	0.0001	0.0001	0.0632	0.0774	1e-04	0.0001
A	0.3962	0.8813	0.2416	0.0634	0.0001	0.0001	0.0634	1e-04	0.0001
A	0.3657	0.8647	0.3129	0.0614	0.0001	0.1063	0.0752	1e-04	0.0001
A	0.5245	0.7804	0.3134	0.0627	0.0001	0.0886	0.0768	1e-04	0.0001
A	0.5361	0.7866	0.2552	0.0619	0.0001	0.1452	0.0619	1e-04	0.0001
A	0.5667	0.7627	0.2606	0.0606	0.0001	0.1421	0.0742	1e-04	0.0001
A	0.5650	0.7800	0.2222	0.0727	0.0001	0.1111	0.0727	1e-04	0.0001
A	0.5599	0.7642	0.2716	0.0640	0.0001	0.1280	0.0784	1e-04	0.0453
A	0.5952	0.7222	0.2990	0.0575	0.0407	0.1467	0.0910	1e-04	0.0001
A	0.6516	0.6677	0.2950	0.1129	0.0001	0.1457	0.0922	1e-04	0.0001

The computation of dissimilarity between the datasets A and B requires several steps.

1. Computation of a distance matrix among the samples of both sequences.

It is computed by the distanceMatrix function, which allows the user to select a distance metric (so far the ones implemented are manhattan, euclidean, chi, and hellinger). The function plotMatrix allows an easy visualization of the resulting distance matrix.

#computing distance matrix
AB.distance.matrix <- distanceMatrix(
  sequences = AB.sequences,
  method = "euclidean"
)

#plotting distance matrix
plotMatrix(
  distance.matrix = AB.distance.matrix,
  color.palette = "viridis",
  margins = rep(4,4))

Distance matrix between the samples of two irregular multivariate sequences. Darker colors indicate higher distance between samples.

2. Computation of the least-cost path within the distance matrix.

This step uses a dynamic programming algorithm to find the least-cost path that connnects the cell 1,1 of the matrix (lower left in the image above) and the last cell of the matrix (opposite corner). This can be done via in two different ways.

an orthogonal search by moving either one step on the x axis or one step on the y axis at a time (see Equation 1).

Equation 1
$AB_{between} = 2 \times (D(A_{1}, B_{1}) + \sum_{i=1}{m}\sum_{j=1}{n} min\left(\begin{array}{c}D(A_{i}, B_{j+1}), \\ D(A_{i+1}, B_{j}) \end{array}\right))$

an orthogonal and diagonal search (a.k.a diagonal) which includes the above, plus a diagonal search.

Equation 2
[AB_{between} = 2 \times (D(A_{1}, B_{1}) + \sum_{i=1}^{{m}\sum_{j=1}}{n} min\left(\begin{array}{c}D(A_{i}, B_{j+1}), \\ D(A_{i+1}, B_{j} \\ D(A_{i+1}, B_{j+1}) \end{array}\right))](https://latex.codecogs.com/png.latex?AB_%7Bbetween%7D%20%3D%202%20%5Ctimes%20%28D%28A_%7B1%7D%2C%20B_%7B1%7D%29%20%2B%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Csum_%7Bj%3D1%7D%5E%7Bn%7D%20min%5Cleft%28%5Cbegin%7Barray%7D%7Bc%7DD%28A_%7Bi%7D%2C%20B_%7Bj%2B1%7D%29%2C%20%5C%5C%20D%28A_%7Bi%2B1%7D%2C%20B_%7Bj%7D%20%5C%5C%20D%28A_%7Bi%2B1%7D%2C%20B_%7Bj%2B1%7D%29%20%5Cend%7Barray%7D%5Cright%29%29 “AB_{between} = 2 \times (D(A_{1}, B_{1}) + \sum_{i=1}^{{m}\sum_{j=1}}{n} min\left(\begin{array}{c}D(A_{i}, B_{j+1}), \\ D(A_{i+1}, B_{j} \\ D(A_{i+1}, B_{j+1}) \end{array}\right))”)

Where:

and are the number of samples of the multivariate time-series and .
and are the indices of the samples in and being considered on each step of the recursive algorithm.
is a function that returns the multivariate distance (i.e. Manhattan) between any given pair of samples of and .
is a function returning the minimum distance of a subset of distances defined in the neigborhood of the given indices and

The equation returns , which is the double of the sum of distances that lie within the least-cost path, and represent the distance between the samples of A and B. The value of is computed by using the functions leastCostMatrix, which computes the partial solutions to the least-cost problem, leastCostPath, which returns the best global solution, and leastCost function, which sums the distances of the least-cost path and multiplies them by 2.

The code below performs these steps according to both equations


#ORTHOGONAL SEARCH
#computing least-cost matrix
AB.least.cost.matrix <- leastCostMatrix(
  distance.matrix = AB.distance.matrix,
  diagonal = FALSE
)

#extracting least-cost path
AB.least.cost.path <- leastCostPath(
  distance.matrix = AB.distance.matrix,
  least.cost.matrix = AB.least.cost.matrix,
  diagonal = FALSE
  )


#DIAGONAL SEARCH
#computing least-cost matrix
AB.least.cost.matrix.diag <- leastCostMatrix(
  distance.matrix = AB.distance.matrix,
  diagonal = TRUE
)

#extracting least-cost path
AB.least.cost.path.diag <- leastCostPath(
  distance.matrix = AB.distance.matrix,
  least.cost.matrix = AB.least.cost.matrix.diag,
  diagonal = TRUE
  )

#plotting solutions
plotMatrix(
  distance.matrix = list(
    'A|B' = AB.least.cost.matrix[[1]], 
    'A|B' = AB.least.cost.matrix.diag[[1]]
    ),
  least.cost.path = list(
    'A|B' = AB.least.cost.path[[1]], 
    'A|B' = AB.least.cost.path.diag[[1]]
    ),
  color.palette = "viridis",
  margin = rep(4,4),
  plot.rows = 1,
  plot.columns = 2
  )

Least-cost path plotted on the least-cost matrix. Left solution is computed with an orthogonal search path, while the right one includes diagonals.

Computing from these solutions is straightforward with the function leastCost

#orthogonal solution
AB.between <- leastCost(
  least.cost.path = AB.least.cost.path
  )

#diagonal solution
AB.between.diag <- leastCost(
  least.cost.path = AB.least.cost.path.diag
  )

Which returns a value for of 33.7206 for the orthogonal solution, and 22.7596 for the diagonal one. Diagonal solutions always yield lower values for than orthogonal ones.

Notice the straight vertical and horizontal lines that show up in some regions of the least cost paths shown in the figure above. These are blocks, and happen in dissimilar sections of the compared sequences. Also, an unbalanced number of rows in the compared sequences can generate long blocks. Blocks inflate the value of because the distance to a given sample is counted several times per block. This problem often leads to false negatives, that is, to the conclusion that two sequences are statistically different when actually they are not.

This package includes an algorithm to remove blocks from the least cost path, which offers more realistic values for . The function leastCostPathNoBlocks reads a least cost path, and removes all blocks as follows.

#ORTHOGONAL SOLUTION
#removing blocks from least cost path
AB.least.cost.path.nb <- leastCostPathNoBlocks(
    least.cost.path = AB.least.cost.path
    )

#computing AB.between again
AB.between.nb <- leastCost(
  least.cost.path = AB.least.cost.path.nb
  )


#DIAGONAL SOLUTION
#removing blocks
AB.least.cost.path.diag.nb <- leastCostPathNoBlocks(
    least.cost.path = AB.least.cost.path.diag
    )

#diagonal solution without blocks
AB.between.diag.nb <- leastCost(
  least.cost.path = AB.least.cost.path.diag.nb
    )

Which now yields 11.2975 for the orthogonal solution, and 16.8667 for the diagonal one. Notice how now the diagonal solution has a higher value, because by default, the diagonal method generates less blocks. That is why each measure of dissimilarity (orthogonal, diagonal, orthogonal no-blocks, and diagonal no-blocks) lies within a different comparative framework, and therefore, outputs from different methods should not be compared.

Hereafter only the diagonal no-blocks option will be considered in the example cases, since it is the most general and safe solution of the four mentioned above.

#changing names of the selected solutions
AB.least.cost.path <- AB.least.cost.path.diag.nb
AB.between <- AB.between.diag.nb

#removing unneeded objects
rm(AB.between.diag, AB.between.diag.nb, AB.between.nb, AB.distance.matrix, AB.least.cost.matrix, AB.least.cost.matrix.diag, AB.least.cost.path.diag, AB.least.cost.path.diag.nb, AB.least.cost.path.nb, sequenceA, sequenceB)

3. Autosum, or sum of the distances among adjacent samples on each sequence.

This step requires to compute the distances between adjacent samples in each sequence and sum them, as shown in Equation 3.

Equation 3
$AB_{within} = \sum_{i=1}^{m} D(A_{i }, A_{i + 1}) + \sum_{i=1}^{n} D(B_{i }, B_{i + 1})$

This operation is performed by the autoSum function shown below.

AB.within <- autoSum(
  sequences = AB.sequences,
  least.cost.path = AB.least.cost.path,
  method = "euclidean"
  )
AB.within
#> $`A|B`
#> [1] 19.69168

4. Compute dissimilarity score .

The dissimilarity measure was first described in the book “Numerical methods in Quaternary pollen analysis” (Birks and Gordon, 1985). Psi is computed as shown in Equation 4a:

Equation 4a
$\psi = \frac{AB_{between} - AB_{within}}{AB_{within}}$

This equation has a particularity. Imagine two identical sequences A and B, with three samples each. In this case, is computed as

AB_{between} = 2 \times (D(A_{1}, B_{1}) + D(A_{1}, B_{2}) + D(A_{2}, B_{2}) + D(A_{2}, B_{3}) + D(A_{3}, B_{3}))

Since the samples of each sequence with the same index are identical, this can be reduced to

AB_{between} = 2 \times (D(A_{1}, B_{2}) + D(A_{2}, B_{3})) = AB_{within}

which in turn equals as shown in Equation 4, yielding a value of 0.

This equality does not work in the same way when the least-cost path search-method includes diagonals. When the sequenes are identical, diagonal methods yield an of 0, leading to a equal to -1. To fix this shift, this package uses Equation 4b instead when is selected, which adds 1 to the final solution.

Equation 4b
$\psi = \frac{AB_{between} - AB_{within}}{AB_{within}} + 1$

In any case, the psi function only requires the least-cost, and the autosum of both sequences to compute . Since we are working with a diagonal search, 1 has to be added to the final solution.

AB.psi <- psi(
  least.cost = AB.between,
  autosum = AB.within
  )
AB.psi[[1]] <- AB.psi[[1]] + 1

Which yields a psi equal to 1.7131. The output of psi is a list, that can be transformed to a dataframe or a matrix by using the formatPsi function.

#to dataframe
AB.psi.dataframe <- formatPsi(
  psi.values = AB.psi,
  to = "dataframe")
kable(AB.psi.dataframe, digits = 4)

A	B	psi
A	B	1.7131

workflowPsi: doing it all at once

All the steps required to compute psi, including the format options provided by formatPsi are wrapped together in the function workflowPsi. It includes options to switch to a diagonal method, and to ignore blocks, as shown below.

#checking the help file
help(workflowPsi)

#computing psi for A and B
AB.psi <- workflowPsi(
  sequences = AB.sequences,
  grouping.column = "id",
  method = "euclidean",
  format = "list",
  diagonal = TRUE,
  ignore.blocks = TRUE
)
AB.psi
#> $`A|B`
#> [1] 1.713075

The function allows to exclude particular columns from the analysis (argument exclude.columns), select different distance metrics (argument method), use diagonals to find the least-cost path (argument diagonal), or measure psi by ignoring blocks in the least-cost path (argument ignore.blocks). Since we have observed several blocks in the least-cost path, below we compute psi by ignoring them.

#cleaning workspace
rm(list = ls())

Comparing multiple irregular METS

The package can work seamlessly with any given number of sequences, as long as there is memory enough available (but check the new function workflowPsiHP, it can work with up to 40k sequences, if you have a cluster at hand, and a few years to waste). To do so, almost every function uses the packages “doParallel” and “foreach”, that together allow to parallelize the execution of the distantia functions by using all the processors in your machine but one.

The example dataset sequencesMIS contains 12 sections of the same sequence belonging to different marine isotopic stages identified by a column named “MIS”. MIS stages with odd numbers are generally interpreted as warm periods (interglacials), while the odd ones are interpreted as cold periods (glacials). In any case, this interpretation is not important to illustrate this capability of the library.

data(sequencesMIS)
kable(head(sequencesMIS, n=15), digits = 4, caption = "Header of the sequencesMIS dataset.")

Header of the sequencesMIS dataset.
MIS	Quercus	Betula	Pinus	Alnus	Tilia	Carpinus
MIS-1	55	1	5	3	4	5
MIS-1	86	21	35	8	0	10
MIS-1	120	15	8	1	0	1
MIS-1	138	16	12	6	1	3
MIS-1	130	12	17	2	1	1
MIS-1	128	0	6	4	2	2
MIS-1	140	0	19	9	4	0
MIS-1	113	0	15	12	2	5
MIS-1	98	0	27	2	2	0
MIS-1	92	1	16	7	3	0
MIS-1	73	3	22	3	0	0
MIS-1	91	1	21	3	7	0
MIS-1	148	1	22	1	4	0
MIS-1	148	0	1	7	13	0
MIS-1	149	1	2	5	4	0

unique(sequencesMIS$MIS)
#>  [1] "MIS-1"  "MIS-2"  "MIS-3"  "MIS-4"  "MIS-5"  "MIS-6"  "MIS-7" 
#>  [8] "MIS-8"  "MIS-9"  "MIS-10" "MIS-11" "MIS-12"

The dataset is checked and prepared with prepareSequences.

MIS.sequences <- prepareSequences(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  if.empty.cases = "zero",
  transformation = "hellinger"
)

The dissimilarity measure psi can be computed for every combination of sequences through the function workflowPsi shown below.

MIS.psi <- workflowPsi(
  sequences = MIS.sequences,
  grouping.column = "MIS",
  method = "euclidean",
  diagonal = TRUE,
  ignore.blocks = TRUE
)

#there is also a "high-performance" (HP) version of this function with a much lower memory footprint. It uses the options method = "euclidean", diagonal = TRUE, and ignore.blocks = TRUE by default 
MIS.psi <- workflowPsiHP(
  sequences = MIS.sequences,
  grouping.column = "MIS"
)

#ordered with lower psi on top
kable(MIS.psi[order(MIS.psi$psi), ], digits = 4, caption = "Psi values between pairs of MIS periods.")

Psi values between pairs of MIS periods.
	A	B	psi
24	MIS-3	MIS-6	0.8631
59	MIS-8	MIS-11	0.8643
65	MIS-10	MIS-12	0.9099
30	MIS-3	MIS-12	0.9359
61	MIS-9	MIS-10	0.9422
66	MIS-11	MIS-12	0.9816
40	MIS-5	MIS-7	0.9866
64	MIS-10	MIS-11	0.9931
62	MIS-9	MIS-11	1.0009
60	MIS-8	MIS-12	1.0079
43	MIS-5	MIS-10	1.0174
45	MIS-5	MIS-12	1.0242
28	MIS-3	MIS-10	1.0340
42	MIS-5	MIS-9	1.0392
56	MIS-7	MIS-12	1.0423
63	MIS-9	MIS-12	1.0531
53	MIS-7	MIS-9	1.0598
51	MIS-6	MIS-12	1.0736
26	MIS-3	MIS-8	1.0846
52	MIS-7	MIS-8	1.0970
21	MIS-2	MIS-12	1.1002
32	MIS-4	MIS-6	1.1020
58	MIS-8	MIS-10	1.1186
54	MIS-7	MIS-10	1.1209
13	MIS-2	MIS-4	1.1214
15	MIS-2	MIS-6	1.1273
12	MIS-2	MIS-3	1.1365
49	MIS-6	MIS-10	1.1670
27	MIS-3	MIS-9	1.1725
47	MIS-6	MIS-8	1.1929
57	MIS-8	MIS-9	1.1982
23	MIS-3	MIS-5	1.2468
22	MIS-3	MIS-4	1.2541
44	MIS-5	MIS-11	1.2656
29	MIS-3	MIS-11	1.2770
41	MIS-5	MIS-8	1.2890
19	MIS-2	MIS-10	1.2976
55	MIS-7	MIS-11	1.3568
48	MIS-6	MIS-9	1.3699
50	MIS-6	MIS-11	1.4274
25	MIS-3	MIS-7	1.4692
39	MIS-5	MIS-6	1.4884
38	MIS-4	MIS-12	1.5184
17	MIS-2	MIS-8	1.5455
34	MIS-4	MIS-8	1.5710
14	MIS-2	MIS-5	1.6316
46	MIS-6	MIS-7	1.6411
36	MIS-4	MIS-10	1.6487
18	MIS-2	MIS-9	1.6933
4	MIS-1	MIS-5	1.7265
6	MIS-1	MIS-7	1.7712
3	MIS-1	MIS-4	1.8732
20	MIS-2	MIS-11	1.9079
33	MIS-4	MIS-7	1.9633
11	MIS-1	MIS-12	2.0587
16	MIS-2	MIS-7	2.1501
8	MIS-1	MIS-9	2.2155
7	MIS-1	MIS-8	2.2918
1	MIS-1	MIS-2	2.2944
2	MIS-1	MIS-3	2.3167
5	MIS-1	MIS-6	2.3369
10	MIS-1	MIS-11	2.3639
37	MIS-4	MIS-11	2.3919
35	MIS-4	MIS-9	2.4482
31	MIS-4	MIS-5	2.4810
9	MIS-1	MIS-10	2.6995

A dataframe like this can be transformed into a matrix to be plotted as an adjacency network with the qgraph package.

#psi values to matrix
MIS.psi.matrix <- formatPsi(
  psi.values = MIS.psi,
  to = "matrix"
)

#dissimilariy to distance
MIS.distance <- 1/MIS.psi.matrix**4

#plotting network
qgraph::qgraph(
  MIS.distance,
  layout='spring',
  vsize=5,
  labels = colnames(MIS.distance),
  colors = viridis::viridis(2, begin = 0.3, end = 0.8, alpha = 0.5, direction = -1)
  )

Similarity between MIS sequences represented as a network. More similar sites are closer, and linked by a wider edge. Note that glacials are colored in blue and interglacials in green

Or as a matrix with ggplot2.

#ordering factors to get a triangular matrix
MIS.psi$A <- factor(MIS.psi$A, levels=unique(sequencesMIS$MIS))
MIS.psi$B <- factor(MIS.psi$B, levels=unique(sequencesMIS$MIS))

#plotting matrix
ggplot(data=na.omit(MIS.psi), aes(x=A, y=B, size=psi, color=psi)) + 
  geom_point() +
  viridis::scale_color_viridis(direction = -1) +
  guides(size = FALSE)

Dissimilarity between MIS sequences. Darker colors indicate a higher dissimilarity.

The dataframe of dissimilarities between pairs of sequences can be also used to analyze the drivers of dissimilarity. To do so, attributes such as differences in time (when sequences represent different times) or distance (when sequences represent different sites) between sequences, or differences between physical/climatic attributes between sequences such as topography or climate can be added to the table, so models such as (were A, B, and C are these attributes) can be fitted.

#cleaning workspace
rm(list = ls())

Comparing regular aligned METS

The package distantia is also useful to compare synchronic sequences that have the same number of samples. In this particular case, distances to obtain are computed only between samples with the same time/depth/order, and no distance matrix (nor least-cost analysis) is required. When the argument paired.samples in prepareSequences is set to TRUE, the function checks if the sequences have the same number of rows, and, if time.column is provided, it selects the samples that have valid time/depth columns for every sequence in the dataset.

Here we test these ideas with the climate dataset included in the library. It represents simulated palaeoclimate over 200 ky. at four sites identified by the column sequenceId. Note that this time the transformation applied is “scaled”, which uses the scale function of R base to center and scale the data.

#loading sample data
data(climate)

#preparing sequences
climate <- prepareSequences(
  sequences = climate,
  grouping.column = "sequenceId",
  time.column = "time",
  paired.samples = TRUE,
  transformation = "scale"
  )

In this case, the argument paired.samples of workflowPsi must be set to TRUE. Additionally, if the argument same.time is set to TRUE, the time/age of the samples is checked, and samples without the same time/age are removed from the analysis.

#computing psi
climate.psi <- workflowPsi(
  sequences = climate,
  grouping.column = "sequenceId",
  time.column = "time",
  method = "euclidean",
  paired.samples = TRUE, #this bit is important
  same.time = TRUE, #removes samples with unequal time
  format = "dataframe"
)

#ordered with lower psi on top
kable(climate.psi[order(climate.psi$psi), ], digits = 4, row.names = FALSE, caption = "Psi values between pairs of sequences in the 'climate' dataset.")

Psi values between pairs of sequences in the ‘climate’ dataset.
A	B	psi
2	4	3.4092
4	2	3.4092
1	3	3.5702
3	1	3.5702
3	4	4.1139
4	3	4.1139
1	2	4.2467
2	1	4.2467
2	3	4.6040
3	2	4.6040
1	4	4.8791
4	1	4.8791

#cleaning workspace
rm(list = ls())

Restricted permutation test to assess the significance of dissimilarity values

One question that may arise when comparing time series is “to what extent are dissimilarity values a result of chance?”. Answering this question requires to compare a given dissimilarity value with a distribution of dissimilarity values resulting from chance. However… how do we simulate chance in a multivariate time-series? The natural answer is “permutation”. Since samples in a multivariate time-series are ordered, randomly re-shuffling samples is out of the question, because that would destroy the structure of the data. A more gentler alternative is to randomly switch single data-points (a case of a variable) independently by variable. This kind of permutation is named “restricted permutation”, and preserves global trends within the data, but changes local structure.

A restricted permutation test on psi values requires the following steps:

Compute the real psi on two given sequences A and B.
Repeat the following steps several times (99 to 999):
- For each case of each column of A and B, randomly apply one of these actions:
  - Leave it as is.
  - Replace it with the previous case.
  - Replace it with the next case.
- Compute randomized psi between A and B and store the value.
Add real psi to the pool of randomized psi.
Compute the proportion of randomized psi that is equal or lower than real psi.

Such a proportion represents the probability of obtaining a value lower than real psi by chance.

Since the restricted permutation only happens at a local scale within each column of each sequence, the probability values returned are very conservative and shouldn’t be interpreted in the same way p-values are interpreted.

The process described above has been implemented in the workflowNullPsi function. We will apply it to three groups of the sequencesMIS dataset.

#getting example data
data(sequencesMIS)

#working with 3 groups (to make this fast)
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]

#preparing sequences
sequencesMIS <- prepareSequences(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  transformation = "hellinger"
)

The computation of the null psi values goes as follows:

random.psi <- workflowNullPsi(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  method = "euclidean",
  diagonal = TRUE,
  ignore.blocks = TRUE,
  repetitions = 99, #recommended value: 999
  parallel.execution = TRUE
)

#there is also a high-performance version of this function with fewer options (diagonal = TRUE, ignore.blocks = TRUE, and method = "euclidean" are used by default)
random.psi <- workflowNullPsiHP(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  repetitions = 99, #recommended value: 999
  parallel.execution = TRUE
)

Note that the number of repetitions has been set to 9 in order to speed-up execution. The actual number would ideally be 999.

The output is a list with two dataframes, psi and p.

The dataframe psi contains the real and random psi values. The column psi contains the dissimilarity between the sequences in the columns A and B. The columns r1 to r9 contain the psi values obtained from permutations of the sequences.

kable(random.psi$psi, digits = 4, caption = "True and null psi values generated by workflowNullPsi.")

True and null psi values generated by workflowNullPsi.
A	B	psi	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10	r11	r12	r13	r14	r15	r16	r17	r18	r19	r20	r21	r22	r23	r24	r25	r26	r27	r28	r29	r30	r31	r32	r33	r34	r35	r36	r37	r38	r39	r40	r41	r42	r43	r44	r45	r46	r47	r48	r49	r50	r51	r52	r53	r54	r55	r56	r57	r58	r59	r60	r61	r62	r63	r64	r65	r66	r67	r68	r69	r70	r71	r72	r73	r74	r75	r76	r77	r78	r79	r80	r81	r82	r83	r84	r85	r86	r87	r88	r89	r90	r91	r92	r93	r94	r95	r96	r97	r98	r99	r100	r101	r102	r103	r104	r105	r106	r107	r108	r109	r110	r111	r112	r113	r114	r115	r116	r117	r118	r119	r120	r121	r122	r123	r124	r125	r126	r127	r128	r129	r130	r131	r132	r133	r134	r135	r136	r137	r138	r139	r140	r141	r142	r143	r144	r145	r146	r147	r148	r149	r150	r151	r152	r153	r154	r155	r156	r157	r158	r159	r160	r161	r162	r163	r164	r165	r166	r167	r168	r169	r170	r171	r172	r173	r174	r175	r176	r177	r178	r179	r180	r181	r182	r183	r184	r185	r186	r187	r188	r189	r190	r191	r192	r193	r194	r195	r196	r197	r198	r199	r200	r201	r202	r203	r204	r205	r206	r207	r208	r209	r210	r211	r212	r213	r214	r215	r216	r217	r218	r219	r220	r221	r222	r223	r224	r225	r226	r227	r228	r229	r230	r231	r232	r233	r234	r235	r236	r237	r238	r239	r240	r241	r242	r243	r244	r245	r246	r247	r248	r249	r250	r251	r252	r253	r254	r255	r256	r257	r258	r259	r260	r261	r262	r263	r264	r265	r266	r267	r268	r269	r270	r271	r272	r273	r274	r275	r276	r277	r278	r279	r280	r281	r282	r283	r284	r285	r286	r287	r288	r289	r290	r291	r292	r293	r294	r295	r296	r297
MIS-4	MIS-5	2.4810	0.0000	3.7536	1.3317	0.0000	3.8819	1.4707	0.0000	3.6452	1.2600	0.0000	3.5013	1.3834	0.0000	4.0277	1.3484	0.0000	2.9678	1.1679	0.0000	3.7815	1.3421	0.0000	3.8574	1.2646	0.0000	3.7788	1.2697	0.0000	3.7480	1.2656	0.0000	3.3589	1.3758	0.0000	2.6402	1.4237	0.0000	3.7716	1.1127	0.0000	3.7987	1.4647	0.0000	3.6555	1.4283	0.000	3.3450	1.1930	0.0000	2.8487	1.2650	0.0000	3.2751	0.9897	0.0000	3.7681	1.2368	0.0000	3.0029	1.1084	0.0000	2.9344	1.1265	0.0000	3.0970	1.3298	0.0000	3.2655	1.0583	0.0000	3.7342	1.2722	0.0000	3.9553	1.1021	0.0000	3.6368	1.1076	0.0000	3.5591	1.4114	0.0000	3.2966	1.3217	0.0000	3.5813	1.1319	0.0000	3.3199	0.983	0.0000	3.8344	1.1349	0.0000	3.9911	1.2949	0.0000	3.3264	1.2411	0.0000	3.8077	1.3534	0.0000	3.7525	1.2932	0.0000	3.3755	1.4088	0.0000	3.7383	1.3163	0.0000	3.5603	1.3440	0.0000	4.5097	1.6253	0.0000	3.0503	1.1434	0.0000	3.5231	1.3495	0.0000	3.5837	1.2140	0.0000	4.2384	1.2953	0.0000	3.4604	1.2267	0.0000	3.5569	1.3008	0.0000	3.8854	1.5219	0.0000	4.0438	1.1978	0.0000	3.6232	1.1538	0.0000	3.7029	1.7062	0.0000	4.0271	1.3001	0.0000	2.8608	1.3602	0.0000	3.8412	1.0693	0.0000	3.3452	1.3526	0.0000	3.7804	1.1964	0.0000	3.8922	1.2574	0.0000	3.4213	1.1074	0.0000	3.5955	1.2909	0.0000	3.1859	1.2943	0.0000	3.6091	1.2801	0.0000	2.9185	1.3562	0.0000	4.1324	1.1933	0.0000	3.8081	1.4057	0.0000	3.5450	1.4605	0.0000	3.7457	1.2695	0.0000	3.6136	1.3570	0.0000	3.8297	1.3051	0.0000	3.5123	1.0305	0.0000	3.4788	1.0452	0.0000	3.3500	1.5083	0.0000	3.4277	1.0918	0.0000	3.5416	1.1457	0.0000	3.8708	1.4005	0.0000	3.6071	1.2262	0.0000	3.5871	1.3329	0.0000	4.0342	1.4382	0.0000	3.6460	1.2434	0.0000	3.5562	1.2390	0.0000	3.5911	1.124	0.0000	3.7326	1.5393	0.0000	3.9109	1.4098	0.0000	3.4930	1.1832	0.0000	2.7677	1.2942	0.0000	3.6112	1.2785	0.0000	3.7562	1.1567	0.0000	3.4105	1.2620	0.0000	3.0288	1.0794	0.0000	4.3534	1.5268	0.0000	3.3925	1.1479	0.0000	3.4932	1.0770	0.0000	3.5992	1.4058	0.0000	3.7622	1.2329	0.0000	3.7002	1.2894	0.0000	3.5179	1.1918	0.0000	4.2240	1.3402	0.0000	3.5807	1.1993	0.0000	3.8250	1.2416	0.0000	3.7522	1.4727	0.0000	3.6046	1.2223	0.0000	3.8396	1.0294
MIS-4	MIS-6	1.1020	3.7536	0.0000	1.7778	3.8819	0.0000	1.7145	3.6452	0.0000	1.4906	3.5013	0.0000	1.5997	4.0277	0.0000	1.9284	2.9678	0.0000	1.7680	3.7815	0.0000	1.7752	3.8574	0.0000	1.7149	3.7788	0.0000	1.6472	3.7480	0.0000	1.7108	3.3589	0.0000	1.7109	2.6402	0.0000	1.5961	3.7716	0.0000	1.6565	3.7987	0.0000	1.7153	3.6555	0.0000	1.5970	3.345	0.0000	1.6136	2.8487	0.0000	1.6701	3.2751	0.0000	1.4932	3.7681	0.0000	1.6126	3.0029	0.0000	1.9839	2.9344	0.0000	1.8604	3.0970	0.0000	1.5921	3.2655	0.0000	1.6640	3.7342	0.0000	1.5258	3.9553	0.0000	1.6979	3.6368	0.0000	1.5611	3.5591	0.0000	1.6552	3.2966	0.0000	1.6900	3.5813	0.0000	1.8737	3.3199	0.0000	1.717	3.8344	0.0000	1.5950	3.9911	0.0000	1.6099	3.3264	0.0000	1.5220	3.8077	0.0000	1.3983	3.7525	0.0000	1.3374	3.3755	0.0000	1.8604	3.7383	0.0000	1.8833	3.5603	0.0000	1.7699	4.5097	0.0000	1.6200	3.0503	0.0000	1.6997	3.5231	0.0000	1.7551	3.5837	0.0000	1.6986	4.2384	0.0000	1.5901	3.4604	0.0000	1.7288	3.5569	0.0000	1.5894	3.8854	0.0000	1.5422	4.0438	0.0000	1.7261	3.6232	0.0000	1.3016	3.7029	0.0000	1.6888	4.0271	0.0000	1.8111	2.8608	0.0000	1.9147	3.8412	0.0000	1.7930	3.3452	0.0000	1.8602	3.7804	0.0000	1.9937	3.8922	0.0000	1.7063	3.4213	0.0000	1.4280	3.5955	0.0000	1.7968	3.1859	0.0000	1.5233	3.6091	0.0000	1.3548	2.9185	0.0000	1.7555	4.1324	0.0000	1.6919	3.8081	0.0000	1.6254	3.5450	0.0000	1.6068	3.7457	0.0000	1.9194	3.6136	0.0000	1.2503	3.8297	0.0000	1.5950	3.5123	0.0000	1.3527	3.4788	0.0000	1.4569	3.3500	0.0000	1.9489	3.4277	0.0000	1.6999	3.5416	0.0000	1.5389	3.8708	0.0000	1.4900	3.6071	0.0000	1.8822	3.5871	0.0000	1.6469	4.0342	0.0000	1.8238	3.6460	0.0000	1.4199	3.5562	0.0000	1.9143	3.5911	0.0000	1.530	3.7326	0.0000	1.6699	3.9109	0.0000	1.8029	3.4930	0.0000	1.6022	2.7677	0.0000	1.6816	3.6112	0.0000	1.7398	3.7562	0.0000	1.3929	3.4105	0.0000	1.6506	3.0288	0.0000	1.5035	4.3534	0.0000	1.9079	3.3925	0.0000	1.7209	3.4932	0.0000	1.5058	3.5992	0.0000	1.7595	3.7622	0.0000	1.7663	3.7002	0.0000	1.7471	3.5179	0.0000	1.6484	4.2240	0.0000	1.6049	3.5807	0.0000	1.6514	3.8250	0.0000	1.8138	3.7522	0.0000	1.7080	3.6046	0.0000	1.2971	3.8396	0.0000	1.5657
MIS-5	MIS-6	1.4884	1.3317	1.7778	0.0000	1.4707	1.7145	0.0000	1.2600	1.4906	0.0000	1.3834	1.5997	0.0000	1.3484	1.9284	0.0000	1.1679	1.7680	0.0000	1.3421	1.7752	0.0000	1.2646	1.7149	0.0000	1.2697	1.6472	0.0000	1.2656	1.7108	0.0000	1.3758	1.7109	0.0000	1.4237	1.5961	0.0000	1.1127	1.6565	0.0000	1.4647	1.7153	0.0000	1.4283	1.5970	0.0000	1.193	1.6136	0.0000	1.2650	1.6701	0.0000	0.9897	1.4932	0.0000	1.2368	1.6126	0.0000	1.1084	1.9839	0.0000	1.1265	1.8604	0.0000	1.3298	1.5921	0.0000	1.0583	1.6640	0.0000	1.2722	1.5258	0.0000	1.1021	1.6979	0.0000	1.1076	1.5611	0.0000	1.4114	1.6552	0.0000	1.3217	1.6900	0.0000	1.1319	1.8737	0.0000	0.9830	1.7170	0.000	1.1349	1.5950	0.0000	1.2949	1.6099	0.0000	1.2411	1.5220	0.0000	1.3534	1.3983	0.0000	1.2932	1.3374	0.0000	1.4088	1.8604	0.0000	1.3163	1.8833	0.0000	1.3440	1.7699	0.0000	1.6253	1.6200	0.0000	1.1434	1.6997	0.0000	1.3495	1.7551	0.0000	1.2140	1.6986	0.0000	1.2953	1.5901	0.0000	1.2267	1.7288	0.0000	1.3008	1.5894	0.0000	1.5219	1.5422	0.0000	1.1978	1.7261	0.0000	1.1538	1.3016	0.0000	1.7062	1.6888	0.0000	1.3001	1.8111	0.0000	1.3602	1.9147	0.0000	1.0693	1.7930	0.0000	1.3526	1.8602	0.0000	1.1964	1.9937	0.0000	1.2574	1.7063	0.0000	1.1074	1.4280	0.0000	1.2909	1.7968	0.0000	1.2943	1.5233	0.0000	1.2801	1.3548	0.0000	1.3562	1.7555	0.0000	1.1933	1.6919	0.0000	1.4057	1.6254	0.0000	1.4605	1.6068	0.0000	1.2695	1.9194	0.0000	1.3570	1.2503	0.0000	1.3051	1.5950	0.0000	1.0305	1.3527	0.0000	1.0452	1.4569	0.0000	1.5083	1.9489	0.0000	1.0918	1.6999	0.0000	1.1457	1.5389	0.0000	1.4005	1.4900	0.0000	1.2262	1.8822	0.0000	1.3329	1.6469	0.0000	1.4382	1.8238	0.0000	1.2434	1.4199	0.0000	1.2390	1.9143	0.0000	1.1240	1.5300	0.000	1.5393	1.6699	0.0000	1.4098	1.8029	0.0000	1.1832	1.6022	0.0000	1.2942	1.6816	0.0000	1.2785	1.7398	0.0000	1.1567	1.3929	0.0000	1.2620	1.6506	0.0000	1.0794	1.5035	0.0000	1.5268	1.9079	0.0000	1.1479	1.7209	0.0000	1.0770	1.5058	0.0000	1.4058	1.7595	0.0000	1.2329	1.7663	0.0000	1.2894	1.7471	0.0000	1.1918	1.6484	0.0000	1.3402	1.6049	0.0000	1.1993	1.6514	0.0000	1.2416	1.8138	0.0000	1.4727	1.7080	0.0000	1.2223	1.2971	0.0000	1.0294	1.5657	0.0000

The dataframe p contains the probability of obtaining the real psi value by chance for each combination of sequences.

kable(random.psi$p, caption = "Probability of obtaining a given set of psi values by chance.")

Probability of obtaining a given set of psi values by chance.
A	B	p
MIS-4	MIS-5	0.6677852
MIS-4	MIS-6	0.3355705
MIS-5	MIS-6	0.6845638

#cleaning workspace
rm(list = ls())

Assessing the contribution of a variable to the dissimilarity between two sequences

What variables are more important in explaining the dissimilarity between two sequences?, or in other words, what variables contribute the most to the dissimilarity between two sequences? One reasonable answer is: the one that reduces dissimilarity the most when removed from the data. This section explains how to use the function workflowImportance follows such a principle to evaluate the importance of given variables in explaining differences between sequences.

First, we prepare the data. It is again sequencesMIS, but with only three groups selected (MIS 4 to 6) to simplify the analysis.

#getting example data
data(sequencesMIS)

#getting three groups only to simplify
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]

#preparing sequences
sequences <- prepareSequences(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  merge.mode = "complete"
)

The workflow function is pretty similar to the ones explained above. However, unlike the other functions in the package, that parallelize across the comparison of pairs of sequences, this one parallelizes the computation of psi on combinations of columns, removing one column each time.

WARNING: the argument ‘exclude.columns’ of ‘workflowImportance’ does not work in version 1.0.0 (available in CRAN), but the bug is fixed in version 1.0.1 (available in GitHub). If you are using 1.0.0, I recommend you to subset ‘sequences’ so only the grouping column and the numeric columns to be compared are available for the function.

psi.importance <- workflowImportance(
  sequences = sequencesMIS,
  grouping.column = "MIS",
  method = "euclidean",
  diagonal = TRUE,
  ignore.blocks = TRUE
  )

#there is also a high performance version of this function, but with fewer options (it uses euclidean, diagonal, and ignores blocks by default)
psi.importance <- workflowImportanceHP(
  sequences = sequencesMIS,
  grouping.column = "MIS"
  )

The output is a list with two slots named psi and psi.drop.

The dataframe psi contains psi values for each combination of variables (named in the coluns A and B) computed for all columns in the column All variables, and one column per variable named Without variable_name containing the psi value when that variable is removed from the compared sequences.

kable(psi.importance$psi, digits = 4, caption = "Psi values with all variables (column 'All variables'), and without one variable at a time.")

Psi values with all variables (column ‘All variables’), and without one variable at a time.
A	B	All variables	Without Carpinus	Without Tilia	Without Alnus	Without Pinus	Without Betula	Without Quercus
MIS-4	MIS-5	4.3929	4.4018	4.4070	4.4087	5.9604	4.4240	0.9878
MIS-4	MIS-6	1.0376	1.0374	1.0372	1.0303	1.6222	1.0311	0.9646
MIS-5	MIS-6	1.7680	1.7684	1.7685	1.7695	1.1980	1.7873	1.0706

This table can be plotted as a bar plot as follows:

#extracting object
psi.df <- psi.importance$psi

#to long format
psi.df.long <- tidyr::gather(psi.df, variable, psi, 3:ncol(psi.df))

#creating column with names of the sequences
psi.df.long$name <- paste(psi.df.long$A, psi.df.long$B, sep=" - ")

#plot
ggplot(data=psi.df.long, aes(x=variable, y=psi, fill=psi)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  facet_wrap("name") +
  scale_fill_viridis(direction = -1) +
  ggtitle("Contribution of separated variables to dissimilarity.") +
  labs(fill = "Psi")

Variable importance analysis of three combinations of sequences. The plot suggest that MIS-4 and MIS-6 are more similar (both are glacial periods), and that the column Quercus is the one with a higher contribution to dissimilarity between sequences.

The second table, named psi.drop describes the drop in psi values, in percentage, when the given variable is removed from the analysis. Large positive numbers indicate that dissimilarity drops (increase in similarity) when the given variable is removed, confirming that the variable is important to explain the dissimilarity between both sequences. Negative values indicate an increase in dissimilarity between the sequences when the variable is dropped.

In summary:

High psi-drop value: variable contributes to dissimilarity.
Low or negative psi-drop value: variable contributes to similarity.

kable(psi.importance$psi.drop, caption = "Drop in psi, as percentage of the psi value obtained when using all variables. Positive values indicate that the sequences become more similar when the given variable is removed (contribution to dissimilarity), while negative values indicate that the sequences become more dissimilar when the variable is removed (contribution to similarity).")

Drop in psi, as percentage of the psi value obtained when using all variables. Positive values indicate that the sequences become more similar when the given variable is removed (contribution to dissimilarity), while negative values indicate that the sequences become more dissimilar when the variable is removed (contribution to similarity).
A	B	Carpinus	Tilia	Alnus	Pinus	Betula	Quercus
MIS-4	MIS-5	-0.20	-0.32	-0.36	-35.68	-0.71	77.51
MIS-4	MIS-6	0.01	0.03	0.70	-56.35	0.62	7.03
MIS-5	MIS-6	-0.02	-0.03	-0.09	32.24	-1.09	39.44

#extracting object
psi.drop.df <- psi.importance$psi.drop

#to long format
psi.drop.df.long <- tidyr::gather(psi.drop.df, variable, psi, 3:ncol(psi.drop.df))

#creating column with names of the sequences
psi.drop.df.long$name <- paste(psi.drop.df.long$A, psi.drop.df.long$B, sep=" - ")

#plot
ggplot(data=psi.drop.df.long, aes(x=variable, y=psi, fill=psi)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  geom_hline(yintercept = 0, size = 0.3) + 
  facet_wrap("name") +
  scale_fill_viridis(direction = -1) +
  ggtitle("Drop in dissimilarity when variables are removed.") +
  ylab("Drop in dissimilarity (%)") +
  labs(fill = "Psi drop (%)")

Drop in psi values, represented as percentage, when a variable is removed from the analysis. Negative values indicate a contribution to similarity, while positive values indicate a contribution to dissimilarity. The plot suggest that Quercus is the variable with a higher contribution to dissimilarity, while Pinus has the higher contribution to similarity.

#cleaning workspace
rm(list = ls())

Partial matching: finding the section in a long sequence more similar to a given short sequence

In this scenario the user has one short and one long sequence, and the goal is to find the section in the long sequence that better matches the short one. To recreate this scenario we use the dataset sequencesMIS. The first 10 samples will serve as short sequence, and the first 40 samples as long sequence. These small subsets are selected to speed-up the execution time of this example.

#loading the data
data(sequencesMIS)

#removing grouping column
sequencesMIS$MIS <- NULL

#subsetting to get the short sequence
MIS.short <- sequencesMIS[1:10, ]

#subsetting to get the long sequence
MIS.long <- sequencesMIS[1:40, ]

The sequences have to be prepared and transformed. For simplicity, the sequences are named short and long, and the grouping column is named id, but the user can name them at will. Since the data represents community composition, a Hellinger transformation is applied.

MIS.short.long <- prepareSequences(
  sequence.A = MIS.short,
  sequence.A.name = "short",
  sequence.B = MIS.long,
  sequence.B.name = "long",
  grouping.column = "id",
  transformation = "hellinger"
)

The function workflowPartialMatch shown below is going to subset the long sequence in sizes between min.length and max.length. In the example below this search space has the same size as MIS.short to speed-up the execution of this example, but wider windows are possible. If left empty, the length of the segment in the long sequence to be matched will have the same number of samples as the short sequence. In the example below we look for segments of the same length, two samples shorter, and two samples longer than the shorter sequence.

MIS.psi <- workflowPartialMatch(
  sequences = MIS.short.long,
  grouping.column = "id",
  method = "euclidean",
  diagonal = TRUE,
  ignore.blocks = TRUE,
  min.length = nrow(MIS.short),
  max.length = nrow(MIS.short)
)

The function returns a dataframe with three columns: first.row (first row of the matched segment of the long sequence), last.row (last row of the matched segment of the long sequence), and psi (ordered from lower to higher). In this case, since the long sequence contains the short sequence, the first row shows a perfect match.

kable(MIS.psi[1:15, ], digits = 4, caption = "First and last row of a section of the long sequence along with the psi value obtained during the partial matching.")

First and last row of a section of the long sequence along with the psi value obtained during the partial matching.
first.row	last.row	psi
1	10	0.0000
3	12	0.2707
4	13	0.2815
2	11	0.3059
6	15	0.4343
5	14	0.5509
9	18	1.3055
10	19	1.3263
8	17	1.3844
7	16	1.3949
15	24	1.4428
12	21	1.4711
17	26	1.4959
11	20	1.5101
18	27	1.5902

Subsetting the long sequence to obtain the segment best matching with the short sequence goes as follows.

#indices of the best matching segment
best.match.indices <- MIS.psi[1, "first.row"]:MIS.psi[1, "last.row"]

#subsetting by these indices
best.match <- MIS.long[best.match.indices, ]

#cleaning workspace
rm(list = ls())

Sequence slotting: combining samples of two sequences into a single composite sequence

Under this scenario, the objective is to combine two sequences into a single composite sequence. The basic assumption followed by the algorithm building the composite sequence is most similar samples should go together, but respecting the original ordering of the sequences. Therefore, the output will contain the samples in both sequences ordered in a way that minimizes the multivariate distance between consecutive samples. This scenario assumes that at least one of the sequences do not have a time/age/depth column, or that the values in such a column are uncertain. In any case, time/age/depth is not considered as a factor in the generation of the composite sequence.

The example below uses the pollenGP dataset, which contains 200 samples, with 40 pollen types each. To create a smalle case study, the code below separates the first 20 samples of the sequence into two different sequences with 10 randomly selected samples each. Even though this scenario assumes that these sequences do not have depth or age, these columns will be kept so the result can be assessed. That is why these columns are added to the exclude.columns argument. Also, note that the argument transformation is set to “none”, so the output is not transformed, and the outcome can be easily interpreted. This will give more weight to the most abundant taxa, which will in fact guide the slotting.

#loading the data
data(pollenGP)

#getting first 20 samples
pollenGP <- pollenGP[1:20, ]

#sampling indices
set.seed(10) #to get same result every time
sampling.indices <- sort(sample(1:20, 10))

#subsetting the sequence
A <- pollenGP[sampling.indices, ]
B <- pollenGP[-sampling.indices, ]

#preparing the sequences
AB <- prepareSequences(
  sequence.A = A,
  sequence.A.name = "A",
  sequence.B = B,
  sequence.B.name = "B",
  grouping.column = "id",
  exclude.columns = c("depth", "age"),
  transformation = "none"
)

Once the sequences are prepared, the function workflowSlotting will allow to combine (slot) them. The function computes a distance matrix between the samples in both sequences according to the method argument, computes the least-cost matrix, and generates the least-cost path. Note that it only uses an orthogonal method considering blocks, since this is the only option really suitable for this task.

AB.combined <- workflowSlotting(
  sequences = AB,
  grouping.column = "id",
  time.column = "age", 
  exclude.columns = "depth",
  method = "euclidean",
  plot = TRUE
)

Distance matrix and least-cost path of the example sequences 'A' and 'B'..

The function reads the least-cost path in order to find the combination of samples of both sequences that minimizes dissimilarity, constrained by the order of the samples on each sequence. The output dataframe has a column named original.index, which has the index of each sample in the original datasets.

kable(AB.combined[1:15,1:10], digits = 4, caption = "Combination of sequences A and B.")

Combination of sequences A and B.
	id	original.index	depth	age	Abies	Hedera	Plantago	Boraginaceae	Crassulaceae
1	A	1	3	3.97	11243	5	12	21	95
11	B	1	1	3.92	11108	7	0	5	20
12	B	2	2	3.95	11189	3	3	15	47
13	B	3	4	4.00	11324	8	43	60	65
2	A	2	6	4.05	11459	20	73	94	1
14	B	4	5	4.02	11378	44	76	110	0
3	A	3	7	4.07	11514	20	80	100	0
4	A	4	8	4.10	11595	34	80	155	0
5	A	5	9	4.17	11784	22	44	131	0
15	B	5	13	4.27	12054	37	30	150	0
6	A	6	10	4.20	11865	35	30	112	0
7	A	7	11	4.22	11919	30	45	150	0
8	A	8	12	4.25	12000	44	35	150	0
9	A	9	15	4.32	12189	43	17	120	0
16	B	6	14	4.30	12135	50	10	120	0

Note that several samples show inverted ages with respect to the previous samples. This is to be expected, since the slotting algorithm only takes into account distance/dissimilarity between adjacent samples to generate the ordering.

Sequences A (green) and B (blue) with their ordered samples (upper panel), and the composite sequence resulting from them (lower panel) after applying the sequence slotting algorithm. Notice that the slotting takes into account all columns in both datasets, and therefore, a single column, as shown in the plot, might not be totally representative of the slotting solution.

#cleaning workspace
rm(list = ls())

Transferring an attribute from one sequence to another

This scenario assumes that the user has two METS, one of them with a given attribute (age/time) that needs to be transferred to the other sequence by using similarity/dissimilarity (constrained by sample order) as a transfer criterion. This case is relatively common in palaeoecology, when a given dataset is dated, and another taken at a close location is not.

The code below prepares the data for the example. The sequence pollenGP is the reference sequence, and contains the column age. The sequence pollenX is the target sequence, without an age column. We generate it by taking 40 random samples between the samples 50 and 100 of pollenGP. The sequences are prepared with prepareSequences, as usual, with the identificators “GP” and “X”

#loading sample dataset
data(pollenGP)

#subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]

#generating a subset of pollenGP 
set.seed(10)
pollenX <- pollenGP[sort(sample(1:50, 40)), ]

#we separate the age column
pollenX.age <- pollenX$age

#and remove the age values from pollenX
pollenX$age <- NULL
pollenX$depth <- NULL

#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]

#prepare sequences
GP.X <- prepareSequences(
  sequence.A = pollenGP,
  sequence.A.name = "GP",
  sequence.B = pollenX,
  sequence.B.name = "X",
  grouping.column = "id",
  time.column = "age",
  exclude.columns = "depth",
  transformation = "none"
)

The transfer of “age” values from GP to X can be done in two ways, both constrained by sample order:

Direct: each sample in X gets the age of its most similar sample in GP.
Interpolated: each sample in X gets an age interpolated from the ages of the two most similar samples in GP. The interpolation is weighted by the similarity between the samples.

Direct transfer

A direct transfer of an attribute from the samples of one sequence to the samples of another requires to compute a distance matrix between samples, the least-cost matrix and its least-cost path (both with the option diagonal activated), and to parse the least-cost path file to assign attribute values. This is done by the function workflowTransfer with the option .

#parameters
X.new <- workflowTransfer(
  sequences = GP.X,
  grouping.column = "id",
  time.column = "age",
  method = "euclidean",
  transfer.what = "age",
  transfer.from = "GP",
  transfer.to = "X",
  mode = "direct"
  )

kable(X.new[1:15, ], digits = 4)

	id	age	Abies	Hedera	Plantago	Boraginaceae	Crassulaceae	Ranunculaceae	Rhamnus	Caryophyllaceae	Dipsacaceae	Betula	Acer	Armeria	Tilia	Hippophae	Salix	Labiatae	Valeriana	Umbelliferae	Sanguisorba_minor	Plantago.lanceolata	Campanulaceae	Asteroideae	Fraxinus	Cichorioideae	Rumex	Ranunculus.subgen..Batrachium	Cyperaceae	Corylus	Myriophyllum	Filipendula	Vitis	Rubiaceae	Polypodium
41	X	3.92	11108	7	0	5	20	13	0	2	1	0	2	41	0	0	0	0	0	0	1	0	8	0	0	0	0	0	0	0	0	60	2	0	0
42	X	4.00	11324	8	43	60	65	10	0	0	2	4	0	0	0	0	0	0	2	4	0	0	0	0	0	0	0	0	0	0	1	11	0	2	0
43	X	4.02	11378	44	76	110	0	0	0	0	2	11	0	0	0	0	3	1	3	0	5	0	0	1	6	0	1	1	0	0	0	0	1	1	1
44	X	4.05	11459	20	73	94	1	0	0	0	1	10	0	2	0	0	3	1	3	0	3	0	0	0	3	0	2	0	0	0	0	0	1	0	0
45	X	4.07	11514	20	80	100	0	0	0	0	10	4	0	0	1	0	0	0	3	0	1	0	0	0	8	1	0	0	0	1	2	1	0	1	0
46	X	4.07	11595	34	80	155	0	0	2	0	2	13	0	0	0	0	1	0	2	0	2	0	0	0	6	0	0	0	0	1	1	0	0	1	0
47	X	4.17	11784	22	44	131	0	0	0	0	1	13	0	0	0	0	5	0	5	0	3	0	0	0	6	0	0	2	0	1	0	0	2	0	0
48	X	4.20	11865	35	30	112	0	0	0	0	4	8	0	0	0	2	2	1	2	0	7	0	1	0	10	1	0	1	0	1	1	0	2	1	0
49	X	4.22	11919	30	45	150	0	0	0	0	4	11	0	0	0	0	2	3	1	0	1	0	0	0	13	1	0	0	0	0	0	0	2	1	0
50	X	4.25	12000	44	35	150	0	0	0	0	2	8	0	0	0	0	3	1	0	0	5	0	0	0	6	3	0	0	0	0	2	0	1	0	0
51	X	4.25	12054	37	30	150	0	1	0	0	6	10	0	0	0	0	7	2	2	0	6	0	0	0	7	1	0	0	0	0	3	0	4	0	0
52	X	4.32	12135	50	10	120	0	0	0	0	1	7	0	0	0	0	1	0	1	0	8	0	0	0	8	2	2	0	0	0	0	0	1	0	0
53	X	4.32	12189	43	17	120	0	0	0	0	2	15	0	0	1	1	2	0	2	0	5	1	0	0	6	2	2	1	0	2	1	0	0	0	0
54	X	4.40	12324	50	11	86	0	0	0	0	2	15	0	0	2	1	4	5	3	0	6	0	0	0	5	1	2	0	1	1	0	0	3	1	0
55	X	4.40	12405	51	6	70	0	0	0	0	1	16	0	0	4	1	2	4	2	0	5	2	0	0	1	2	1	0	0	0	0	0	0	3	1

The algorithm finds the most similar samples, and transfers attribute values directly between them. This can result in duplicated attribute values, as highlighted in the table above. The Pearson correlation between the original ages (stored in pollenX.age) and the assigned ones is 0.9996, so it can be concluded that in spite of its simplicity, this algorithm yields accurate results.

Interpolated transfer

If we consider:

Two samples and of the sequence .
Each one with a value the attribute , and .
One sample of the sequence .
With an unknown attribute .
The multivariate distance between the samples and .
The multivariate distance between the samples and .
The weight , computed as
The weight , computed as

The unknwon value is computed as:

Bt_{k} = w_{i} \times At_{i} + w_{j} \times At_{j}

The code below exemplifies the operation, using the samples 1 and 4 of the dataset pollenGP as and , and the sample 3 as .

#loading data
data(pollenGP)

#samples in A
Ai <- pollenGP[1, 3:ncol(pollenGP)]
Aj <- pollenGP[4, 3:ncol(pollenGP)]

#ages of the samples in A
Ati <- pollenGP[1, "age"]
Atj <- pollenGP[4, "age"]

#sample in B
Bk <- pollenGP[2, 3:ncol(pollenGP)]

#computing distances between Bk, Ai, and Aj
DBkAi <- distance(Bk, Ai)
DBkAj <- distance(Bk, Aj)

#normalizing the distances to 1
wi <- DBkAi / (DBkAi + DBkAj)
wj <- DBkAj / (DBkAi + DBkAj)

#computing  Btk
Btk <- wi * Ati + wj * Atj

The table below shows the observed versus the predicted values for .

temp.df <- data.frame(Observed = pollenGP[3, "age"], Predicted = Btk)
kable(t(temp.df), digits = 4, caption = "Observed versus predicted value in the interpolation of an age based on similarity between samples.")

Observed versus predicted value in the interpolation of an age based on similarity between samples.
Observed	3.9700
Predicted	3.9735

Below we create some example data, where a subset of pollenGP will be the donor of age values, and another subset of it, named pollenX will be the receiver of the age values.

#loading sample dataset
data(pollenGP)

#subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]

#generating a subset of pollenGP 
set.seed(10)
pollenX <- pollenGP[sort(sample(1:50, 40)), ]

#we separate the age column
pollenX.age <- pollenX$age

#and remove the age values from pollenX
pollenX$age <- NULL
pollenX$depth <- NULL

#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]

#prepare sequences
GP.X <- prepareSequences(
  sequence.A = pollenGP,
  sequence.A.name = "GP",
  sequence.B = pollenX,
  sequence.B.name = "X",
  grouping.column = "id",
  time.column = "age",
  exclude.columns = "depth",
  transformation = "none"
)

To transfer attributes from GP to X we use the workflowTransfer function with the option mode = “interpolate”.

#parameters
X.new <- workflowTransfer(
  sequences = GP.X,
  grouping.column = "id",
  time.column = "age",
  method = "euclidean",
  transfer.what = "age",
  transfer.from = "GP",
  transfer.to = "X",
  mode = "interpolated"
  )

kable(X.new[1:15, ], digits = 4, caption = "Result of the transference of an age attribute from one sequence to another. NA values are expected when predicted ages for a given sample yield a higher number than the age of the previous sample.")  %>% 
  row_spec(c(8, 13), bold = T)

Result of the transference of an age attribute from one sequence to another. NA values are expected when predicted ages for a given sample yield a higher number than the age of the previous sample.
	id	age	Abies	Hedera	Plantago	Boraginaceae	Crassulaceae	Ranunculaceae	Rhamnus	Caryophyllaceae	Dipsacaceae	Betula	Acer	Armeria	Tilia	Hippophae	Salix	Labiatae	Valeriana	Umbelliferae	Sanguisorba_minor	Plantago.lanceolata	Campanulaceae	Asteroideae	Fraxinus	Cichorioideae	Rumex	Ranunculus.subgen..Batrachium	Cyperaceae	Corylus	Myriophyllum	Filipendula	Vitis	Rubiaceae	Polypodium
41	X	3.9497	11108	7	0	5	20	13	0	2	1	0	2	41	0	0	0	0	0	0	1	0	8	0	0	0	0	0	0	0	0	60	2	0	0
42	X	3.9711	11324	8	43	60	65	10	0	0	2	4	0	0	0	0	0	0	2	4	0	0	0	0	0	0	0	0	0	0	1	11	0	2	0
43	X	4.0009	11378	44	76	110	0	0	0	0	2	11	0	0	0	0	3	1	3	0	5	0	0	1	6	0	1	1	0	0	0	0	1	1	1
44	X	4.0219	11459	20	73	94	1	0	0	0	1	10	0	2	0	0	3	1	3	0	3	0	0	0	3	0	2	0	0	0	0	0	1	0	0
45	X	4.0522	11514	20	80	100	0	0	0	0	10	4	0	0	1	0	0	0	3	0	1	0	0	0	8	1	0	0	0	1	2	1	0	1	0
46	X	4.1361	11595	34	80	155	0	0	2	0	2	13	0	0	0	0	1	0	2	0	2	0	0	0	6	0	0	0	0	1	1	0	0	1	0
47	X	4.1972	11784	22	44	131	0	0	0	0	1	13	0	0	0	0	5	0	5	0	3	0	0	0	6	0	0	2	0	1	0	0	2	0	0
48	X	NA	11865	35	30	112	0	0	0	0	4	8	0	0	0	2	2	1	2	0	7	0	1	0	10	1	0	1	0	1	1	0	2	1	0
49	X	4.2027	11919	30	45	150	0	0	0	0	4	11	0	0	0	0	2	3	1	0	1	0	0	0	13	1	0	0	0	0	0	0	2	1	0
50	X	4.2237	12000	44	35	150	0	0	0	0	2	8	0	0	0	0	3	1	0	0	5	0	0	0	6	3	0	0	0	0	2	0	1	0	0
51	X	4.2999	12054	37	30	150	0	1	0	0	6	10	0	0	0	0	7	2	2	0	6	0	0	0	7	1	0	0	0	0	3	0	4	0	0
52	X	NA	12135	50	10	120	0	0	0	0	1	7	0	0	0	0	1	0	1	0	8	0	0	0	8	2	2	0	0	0	0	0	1	0	0
53	X	NA	12189	43	17	120	0	0	0	0	2	15	0	0	1	1	2	0	2	0	5	1	0	0	6	2	2	1	0	2	1	0	0	0	0
54	X	4.3501	12324	50	11	86	0	0	0	0	2	15	0	0	2	1	4	5	3	0	6	0	0	0	5	1	2	0	1	1	0	0	3	1	0
55	X	4.4155	12405	51	6	70	0	0	0	0	1	16	0	0	4	1	2	4	2	0	5	2	0	0	1	2	1	0	0	0	0	0	0	3	1

When interpolated values of the age column (transferred attribute via interpolation) show the value NA, it means that the interpolation yielded an age lower than the previous one. This happens when the same and are used to evaluate two or more different samples , and the second is more similar to than the first one. These NA values can be removed with na.omit(), or interpolated with the functions imputeTS::na.interpolation or zoo::na.approx.

Without taking into account these NA values, the Pearson correlation of the interpolated ages with the real ones is 0.9985.

IMPORTANT: the interpretation of the interpolated ages requires a careful consideration. Please, don’t do it blindly, because this algorithm has its limitations. For example, significant hiatuses in the data can introduce wild variations in interpolated ages.

#cleaning workspace
rm(list = ls())

betula	pinus	corylu	junipe	empetr	gramin	cypera	rumex
79	271	36	0	4	7	25	0
113	320	42	0	4	3	11	0
51	420	39	0	2	1	12	0
130	470	6	0	0	2	4	0
31	450	6	0	3	2	3	0
59	425	12	0	0	2	3	0
78	386	29	2	0	0	2	0
71	397	52	2	0	6	3	0
140	310	50	2	0	4	3	0
150	323	34	2	0	11	2	0
175	317	37	2	0	11	3	0
181	345	28	3	0	7	3	0
153	285	36	2	0	8	3	1
214	315	54	2	1	13	5	0
200	210	41	6	0	10	4	0

betula	pinus	corylu	junipe	gramin	cypera	artemi	rumex
19	175	NA	2	34	39	1	0
18	119	28	1	36	44	0	4
30	99	37	0	2	20	0	1
26	101	29	0	0	18	0	0
31	99	30	0	1	10	0	0
24	97	28	0	2	9	0	0
23	105	34	0	1	6	0	0
48	112	46	0	0	12	0	0
29	108	16	0	6	3	0	0
23	110	21	0	2	11	0	1
5	119	19	0	1	1	0	0
30	105	NA	0	9	7	0	0
22	116	17	0	1	7	0	0
24	115	20	0	2	4	0	0
26	119	23	0	4	0	0	0

betula	pinus	corylu	junipe	empetr	gramin	cypera	rumex
79	271	36	0	4	7	25	0
113	320	42	0	4	3	11	0
51	420	39	0	2	1	12	0
130	470	6	0	0	2	4	0
31	450	6	0	3	2	3	0
59	425	12	0	0	2	3	0
78	386	29	2	0	0	2	0
71	397	52	2	0	6	3	0
140	310	50	2	0	4	3	0
150	323	34	2	0	11	2	0
175	317	37	2	0	11	3	0
181	345	28	3	0	7	3	0
153	285	36	2	0	8	3	1
214	315	54	2	1	13	5	0
200	210	41	6	0	10	4	0

betula	pinus	corylu	junipe	gramin	cypera	artemi	rumex
19	175	NA	2	34	39	1	0
18	119	28	1	36	44	0	4
30	99	37	0	2	20	0	1
26	101	29	0	0	18	0	0
31	99	30	0	1	10	0	0
24	97	28	0	2	9	0	0
23	105	34	0	1	6	0	0
48	112	46	0	0	12	0	0
29	108	16	0	6	3	0	0
23	110	21	0	2	11	0	1
5	119	19	0	1	1	0	0
30	105	NA	0	9	7	0	0
22	116	17	0	1	7	0	0
24	115	20	0	2	4	0	0
26	119	23	0	4	0	0	0

betula	pinus	corylu	junipe	empetr	gramin	cypera	rumex
79	271	36	0	4	7	25	0
113	320	42	0	4	3	11	0
51	420	39	0	2	1	12	0
130	470	6	0	0	2	4	0
31	450	6	0	3	2	3	0
59	425	12	0	0	2	3	0
78	386	29	2	0	0	2	0
71	397	52	2	0	6	3	0
140	310	50	2	0	4	3	0
150	323	34	2	0	11	2	0
175	317	37	2	0	11	3	0
181	345	28	3	0	7	3	0
153	285	36	2	0	8	3	1
214	315	54	2	1	13	5	0
200	210	41	6	0	10	4	0

betula	pinus	corylu	junipe	gramin	cypera	artemi	rumex
19	175	NA	2	34	39	1	0
18	119	28	1	36	44	0	4
30	99	37	0	2	20	0	1
26	101	29	0	0	18	0	0
31	99	30	0	1	10	0	0
24	97	28	0	2	9	0	0
23	105	34	0	1	6	0	0
48	112	46	0	0	12	0	0
29	108	16	0	6	3	0	0
23	110	21	0	2	11	0	1
5	119	19	0	1	1	0	0
30	105	NA	0	9	7	0	0
22	116	17	0	1	7	0	0
24	115	20	0	2	4	0	0
26	119	23	0	4	0	0	0