Introduction

Topological data analysis is a relatively new area of data science which can compare and contrast data sets via non-linear global structure. The main tool of topological data analysis, persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005), builds on techniques from the field of algebraic topology to describe shape features present in a data set (stored in a “persistence diagram”). Persistent homology has been used in a number of applications, including

For a broad introduction to the mathematical background and main tools of topological data analysis with applications, see (Chazal and Michel 2017; G. Carlsson and Vejdemo-Johansson 2021).

Traditional data science pipelines in academia and industry are focused on machine learning and statistical inference of structured (tabular) data, being able to answer questions like:

While persistence diagrams have been found to be a useful summary of datasets in many domains, they are not structured data and therefore require special analysis methods. Some papers (for example (Robinson and Turner 2017; Le and Yamada 2018)) have described post-processing pipelines for analyzing persistence diagrams built on distance (Kerber, Morozov, and Nigmetov 2017) and kernel (Le and Yamada 2018) calculations, however these papers are lacking publicly available implementations in R (and Python), and many more data science methods are possible using such calculations (Murphy 2012; Scholkopf, Smola, and Muller 1998; Gretton et al. 2007; Cox and Cox 2008; Dhillon, Guan, and Kulis 2004).

TDApplied is the first R package which provides applied analysis implementations of published methods for analyzing persistence diagrams using machine learning and statistical inference. Its functions contain highly optimized and scalable code (see the package vignette “Benchmarking and Speedups”) and have been tested and validated (see the package vignette “Comparing Distance Calculations”). TDApplied can interface with other data science packages to perform powerful and flexible analyses (see the package vignette “Personalized Analyses with TDApplied”), and an example usage of TDApplied on real data has been demonstrated (see the package vignette “Human Connectome Project Analysis”).

This vignette documents the background of TDApplied functions and the usage of those functions on simulated data, by considering a typical data analysis workflow for topological data analysis:

  1. Computing and comparing persistence diagrams.
  2. Visualizing and interpreting persistence diagrams.
  3. Analyzing statistical properties of groups of persistence diagrams.
  4. Finding latent structure in groups of persistence diagrams.
  5. Predicting labels from persistence diagram structure.

To start we must load the TDApplied package:

library("TDApplied")

Let’s get started!

Computing and Comparing Persistence Diagrams

Computing Diagrams and TDApplied’s PyH Function

The main tool of topological data analysis is called persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005). Persistent homology starts with either data points and a distance function, or a distance matrix storing distance values between data points. It assumes that these points arise from a dataset with some kind of “shape”. This “shape” has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked include clusters (connected components), loops (ellipses) and voids (spheres), which are examples of cycles (i.e. different types of holes). The homological dimension of these features are 0, 1 and 2, respectively. What is interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods cannot provide.

The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter \(\epsilon \geq 0\) is grown starting at 0, and at each \(\epsilon\) value we compute a shape approximation of the dataset \(C_{\epsilon}\), called a simplicial complex or in this case a Rips complex. We construct \(C_{\epsilon}\) by connecting all pairs of points whose distance is at most \(\epsilon\). To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected (note that no triangles are formally shaded on the above diagram, even though there are certainly triples of connected points), a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a Rips-Vietoris filtration, and other methods exist for forming the approximations.

At any given \(\epsilon\) value, some topological features will exist in \(C_{\epsilon}\). As \(\epsilon\) grows, the \(C_{\epsilon}\)’s will contain each other, i.e. if \(\epsilon_{1} < \epsilon_{2}\), then every edge (triangle, tetrahedron etc.) in \(C_{\epsilon_1}\) will also be present in \(C_{\epsilon_2}\). Each topological feature of interest will be “born” at some \(\epsilon_{birth}\) value, and “die” at some some \(\epsilon_{death}\) value – certainly each feature will die once the whole dataset is connected and has trivial shape structure. Consider the example of a loop – a loop will be “born” when the last connection around the circumference of the loop is connected (at the \(\epsilon\) value which is the largest distance between consecutive points around the loop), and the loop will “die” when enough connections across the loop fill in its hole. Since the topological features are tracked across multiple scales, we can estimate their (static) location in the data, i.e. finding the points on these structures, by calculating what are called representative cycles.

The output of persistent homology, a persistence diagram, has one 2D point for each topological feature found in the filtration process in each desired homological dimension, where the \(x\)-value of the point is the birth \(\epsilon\)-value and the \(y\)-value is the death \(\epsilon\)-value. Hence every point in a persistence diagram lies above the diagonal – features die after they are born! The difference of a points \(y\) and \(x\) value, \(y-x\), is called the “persistence” of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise.

For a more practical and scalable computation of persistence diagrams, a method has been introduced called persistence cohomology (Silva and Vejdemo-Johansson 2011b, 2011a) which calculates the exact same output as persistent homology (with analogous “representative co-cycles” to persistent homology’s representative cycles) just much faster. Persistent cohomology is implemented in the c++ library ripser (Bauer 2015), which is wrapped in R in the TDAstats package (Wadhwa et al. 2019) and in Python in the ripser module. However, it was observed in simulations that the Python implementation of ripser seemed faster, even when called in R via the reticulate package (Ushey, Allaire, and Tang 2022) (see the package vignette “Benchmarking and Speedups”). Therefore, the TDApplied PyH function has been implemented as a wrapper of the Python ripser module for fast calculations of persistence diagrams.

There are three prerequisites that must be satisfied in order to use the PyH function:

  1. The reticulate package must be installed.
  2. Python must be installed and configured to work with reticulate.
  3. The ripser Python module must be installed, via reticulate::py_install("ripser"). Some windows machines have had issues with recent versions of the ripser module, but version 0.6.1 has been tested and does work on Windows. So, Windows users may try reticulate::py_install("ripser==0.6.1").

For a sample use of PyH we will use the following pre-loaded dataset called “circ” (which is stored as a data frame in this vignette):

We would then calculate the persistence diagram as follows:

# import the ripser module
ripser <- import_ripser()

# calculate the persistence diagram
diag <- PyH(X = circ,maxdim = 1,thresh = 2,ripser = ripser)

# view last five rows of the diagram
diag[47:51,]
#>    dimension     birth     death
#> 47         0 0.0000000 0.2545522
#> 48         0 0.0000000 0.2813237
#> 49         0 0.0000000 0.2887432
#> 50         0 0.0000000 2.0000000
#> 51         1 0.5579783 1.7385925

In the package vignette “Benchmarking and Speedups”, simulations are used to demonstrate the practical advantages of using PyH to calculate persistence diagrams compared to other alternatives.

Note that the installation status of Python for PyH is checked using the function reticulate::py_available(), which according to several online forums does not always behave as expected. If error messages occur using TDApplied functions regarding Python not being installed then we recommend consulting online resources to ensure that the py_available function returns TRUE on your system. Due to the complicated dependencies required to use the PyH function, it is only an optional function in the TDApplied package and therefore the reticulate package is only suggested in the TDApplied namespace.

Determining the Radius Threshold with TDApplied’s enclosing_radius Function

One of the most important parameters in a persistent (co)homology calculation is the maximum connectivity radius of the filtration (which is the maxdim parameter in the PyH function). A small radius threshold may prematurely stop the filtration before we can identify real large-scale features of our data, while an overly large threshold could quickly cause the calculations to become intractable. While there is no method that can determine an optimal balance between these two extremes, a radius threshold exists, called the “enclosing radius”, beyond which the topology of the dataset is trivial (i.e. there are no topological features except for one connected component). This radius is obtained by calculating, for each data point, the maximum distance from that point to all other points, and taking the minimum of these values.

The enclosing radius of a dataset can be computed using TDApplied’s enclosing_radius function. Consider the topology of a dataset sampled from a circle and its origin:

Without knowing anything about the structure of our new dataset we would not know which distance threshold to use for persistent homology calculations. We would compute the enclosing radius as follows:

enc_rad <- enclosing_radius(new_data, distance_mat = FALSE)
print(enc_rad)
#> [1] 1

This radius is the distance from the origin point to all the points on the circumference - at which point the whole graph is connected and the loop is gone: