Author: Ville-Petteri Mäkinen
Co-authors: Song Gao, Stefan Mutter, Aaron E Casey
Version: 1.8.2

Abstract: In textbook examples, multivariable datasets are clustered into distinct subgroups that can be clearly identified by a set of optimal mathematical criteria. However, many real-world datasets arise from synergistic consequences of multiple effects, noisy and partly redundant measurements, and may represent a continuous spectrum of the different phases of a phenomenon. In medicine, complex diseases associated with ageing are typical examples. We postulate that population-based biomedical datasets (and many other real-world examples) do not contain an intrinsic clustered structure that would give rise to mathematically well-defined subgroups. From a modeling point of view, the lack of intrinsic structure means that the data points inhabit a contiguous cloud in high-dimensional space without abrupt changes in density to indicate subgroup boundaries, hence a mathematical criteria cannot segment the cloud reliably by its internal structure. Yet we need data-driven classification and subgrouping to aid decision-making and to facilitate the development of testable hypotheses. For this reason, we developed the Numero package, a more flexible and transparent process that allows human observers to create usable multivariable subgroups even when conventional clustering frameworks struggle.

Citation: Gao S, Mutter S, Casey AE, Mäkinen V-P (2018) Numero: a statistical framework to define multivariable subgroups in complex population-based datasets, Int J Epidemiology,

image/svg+xml Map coloring Highregionalaverage(red) Lowregionalaverage(blue)   Data point Highvalue(red) Lowvalue(blue) Scatter plot Structured 2D layout

1 Getting started

1.1 Installation

You can install Numero using the standard procedure:

# Install the package from a remote repository.

The functions come in two flavors: the ones starting with numero provide high level pipeline components that will serve most users, whereas the nro functions perform specific tasks and provide a more granular interface to the package. In this introductory document, we will only use the numero group of functions.

To activate and list the library functions, type

# Activate the library.
## [1] '1.8.2'
##  [1] "nroAggregate"    "nroColorize"     "nroDestratify"   "nroKmeans"      
##  [5] "nroKohonen"      "nroLabel"        "nroMatch"        "nroPermute"     
##  [9] "nroPlot"         ""    "nroPostprocess"  "nroPreprocess"  
## [13] "nroRcppMatrix"   "nroRcppVector"   "nroSummary"      "nroTrain"       
## [17] "numero.clean"    "numero.create"   "numero.evaluate" "numero.plot"    
## [21] "numero.prepare"  "numero.quality"  "numero.subgroup" "numero.summary"

Each Numero function comes with a help page that contains a code example, such as

# Access function documentation (not shown in vignette).
? numero.create

You can run all the code examples in one go by typing

# Run all code examples (not shown in vignette).
fn <- system.file("extcode", "examples.R", package = "Numero")

Please use the tabs at the top to navigate to the next page.

1.2 Dataset

In the examples, we use data on diabetic kidney disease from a previous publication (see the readme file below). While simplified, the data contain enough information to replicate some of the findings from the original study. The main clinical characteristics are summarized in Table 1, and you can print the readme file on the screen by typing

# Show readme file on screen (not shown in vignette).
fn <- system.file("extdata", "finndiane.readme.txt", package = "Numero")
cat(readChar(fn, 1e5))
Table: Summary of the FinnDiane dataset. The mean and standard deviation are reported for each continuous variable. Abbreviations: urinary albumin excretion rate (uALB), triglycerides (TG), high density lipoprotein subclass 2 (HDL2). P-values were estimated by the t-test for continuous variables and by Fisher’s test for binary traits.
Trait No kidney disease Diabetic kidney disease P-value
Men / Women 192 / 196 119 / 106 0.45
Age (years) 38.8 ± 12.2 41.7 ± 9.7 0.0012
Type 1 diabetes duration (years) 25.3 ± 10.3 28.6 ± 7.8 <0.001
Log10 uALB (mg/24h) 1.20 ± 0.51 2.72 ± 0.59 <0.001
Log10 TG (mmol/L) 0.034 ± 0.201 0.159 ± 0.212 <0.001
Total cholesterol (mmol/L) 4.89 ± 0.77 5.35 ± 0.96 <0.001
HDL2 cholesterol (mmol/L) 0.54 ± 0.16 0.51 ± 0.18 0.027
Log10 serum creatinine (µmol/L) 1.94 ± 0.09 2.14 ± 0.24 <0.001
Metabolic syndrome 90 (23.2%) 114 (50.7%) <0.001
Macrovascular disease 16 (4.1%) 38 (16.9%) <0.001
Diabetic retinopathy 133 (34.4%) 178 (79.1%) <0.001
Died during follow-up 13 (3.4%) 43 (19.1%) <0.001

The data are included in the package as a tab-delimited file:

# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname, stringsAsFactors = FALSE)
## [1] 613
##  [1] "INDEX"       "AGE"         "T1D_DURAT"   "MALE"        "MACROVASC"  
##  [6] "METAB_SYNDR" "DIAB_KIDNEY" "DIAB_RETINO" "uALB"        "TG"         
## [11] "CHOL"        "HDL2C"       "CREAT"       "DECEASED"

Please use the tabs at the top to navigate to the next page.

1.3 Data integrity

Most datasets contain missing entries, duplicated rows and other unusable features, therefore it is important to carefully inspect all data points and their identification. In this case, the biochemical and clinical information about individual participants is organized as rows and each row is identified by the variable INDEX. To create a data frame where the row names are set to INDEX, type

# Manage unusable entries and data identification.
dataset <- numero.clean(data = dataset, identity = "INDEX")
## *** numero.clean ***
## Tue Jul 20 21:12:09 2021
## Identities:
## 1 identity columns consumed
## 613 unique row names
## Values, scan 1:
## 613 / 613 rows selected
## 6 binary columns
## 0 factor columns
## 13 numeric columns
## 0 other columns
## Summary:
## 613 usable rows
## 13 usable columns

Almost all functions within Numero work only on numeric values, so if the dataset contains factors or other categorical data types, please convert them to integer labels before analyses.

2 Basic pipeline

2.1 Prepare

Before any analysis, we recommend using numero.create to check that the data are in usable format (the previous section on data integrity).

A typical usage case of the Numero framework involves a set of training variables that are considered to explain medical or other outcomes. Here, we use the biochemical data as the training variables under the general hypothesis that the metabolic profile (as measured by biochemistry) predicts adverse clinical events. The kidney disease dataset includes five biochemical measures:

# Select training variables.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")

Centering and scaling are basic machine learning techniues to ensure that the variables with large values due to the choice of the measurement unit do not bias the results. To create a standardized training set, type

# Center and scale the training data.
trdat.basic <- scale.default(dataset[,trvars])

All the training variables have now unit variance and we can expect their impact on the SOM to be dependent only on the information content and statistical associations:

# Calculate standard deviations.
apply(trdat.basic, 2, sd, na.rm = TRUE)
##     1     1     1     1     1

Please use the tabs at the top to navigate to the next page.

2.2 Create

Conceptually, the SOM algorithm is analogous to a human observer who wants to make sense of a set of objects. Suppose there is a pile of portrait photos of ordinary people on a round table, and your task is to organize them on the table in such a way that similar photos are placed next to each other and, by consequence, dissimilar photos end up on the opposite sides of the table. During the organization process, your brain interprets the multi-variable visual data (age, gender, ethnicity, hair style, head shape, etc.) and projects it onto two dimensions (vertical and horizontal positions).

The SOM algorithm, also known as Kohonen map, mimics the organization process and was originally inspired by the plasticity of neuronal networks. As this is a self-organizing process, the starting point may have an impact on the outcomes. We have added an extra initialization step to ensure repeatable results. The function numero.create applies K-means clustering to create an initial map model, and then refines the model using the SOM algorithm:

# Create a new self-organizing map based on the training set.
modl.basic <- numero.create(data = trdat.basic)
## *** numero.create ***
## Tue Jul 20 21:12:09 2021
## Modeling setup:
## automatic radius set to 3
## automatic smoothness set to 1.21
## 613 / 613 rows included
## 5 / 5 columns included
## K-means clustering:
## 543 subsamples
## 27 training cycles
## Self-organizing map:
## 582 subsamples
## 39 training cycles
## Color amplitude:
## 20000 permutations
## reference Z-score 9.79
##        Length Class      Mode     
## stamp     1   -none-     character
## kmeans    4   -none-     list     
## map       5   -none-     list     
## layout    4   data.frame list     
## data   3065   -none-     numeric  
## zbase     1   -none-     numeric

The resulting data structure contains the SOM itself and additional information that is needed for further statistical analyses, please see the manual pages for details.

The output from numero.create contains the positions of the data points on a virtual round “table”; we use the term map instead of table, and refer to the positions as the layout of a dataset on the SOM. The results also contain an internal model of the training dataset; the model comprises a set of district centroids that summarize the salient features of the training dataset, please see the section on terminology for further descriptions.

Please use the tabs at the top to navigate to the next page.

2.3 Quality

After the training is complete, it is prudent to examine the SOM for potential problems with the data. The Numero package provides three different quality tools:

Histogram: The distribution of data points on the map can be expressed as a histogram that shows the counts of data points in the districts. If any areas of the SOM are devoid, or disproportionally populated by data points, it usually reflects unusual or discontinuous value distributions within the data (it can also be caused by poorly chosen preprocessing methods). The histogram is rarely flat and up to two-fold differences between the least and most populated districts are common, but if the pattern is more extreme, careful examination of the distributions of the input data in relation to the preprocessing procedures is warranted.

Coverage: The SOM is instrinsically robust against missing values if the missingness pattern is close to random. We define data coverage as the mean ratio of usable values per a multi-variable data point. If the coverage drops substantially in certain districts compared to others, it is possible that the data point layout has been biased by a non-random pattern of missing values.

Model fit: The location of each data point on the map depends on how similar it is to district centroids; the district with the most similar centroid is chosen as the optimal location. The similarity of a centroid and a data point can be quantified using mathematical formulas, such as the Euclidean distance. The Numero package provides two measures. The first is the residual distance \(d\) between a data point and a centroid. The second version is indpependent of scale and is calculated as \(z = (d - d_{mean}) / d_{sd}\), where \(d_{mean}\) and \(d_{sd}\) are the mean residual and standard deviation of residuals in the training set.

# Calculate map quality measures for the training data.
qc.basic <- numero.quality(model = modl.basic)
## *** numero.quality ***
## Tue Jul 20 21:12:10 2021
## Quality statistics:
## 4 quality measures
## 2082 permutations
## Map statistics:
## 20000 permutations
## reference Z-score 9.79

Outliers can be detected by plotting the distribution of quality measures:

# Plot frequencies of data points at different quality levels.
par(mar = c(5,4,1,0), mfrow = c(1,2))
hist(x = qc.basic$layout$RESIDUAL, breaks = 50,
     main = NULL, xlab = "RESIDUAL", ylab = "Number of data points",
     col = "#FFEFA0", cex = 0.8)
hist(x = qc.basic$layout$RESIDUAL.z, breaks = 50,
     main = NULL, xlab = "RESIDUAL.z", ylab = "Number of data points",
     col = "#FFEFA0", cex = 0.8)
Figure: Distribution of model residuals.

Figure: Distribution of model residuals.

In this example, the distribution of residuals is heavily skewed and there are a few data points that are several standard deviations from the mean (RESIDUAL.z). The skewness is often due to skewed distributions in the data; urinary albumin and serum creatinine are highly skewed towards larger values. The Numero framework includes automatic procedures based on log-transforms or rank-based preprocessing to mitigate skewness. We will explore these in separate examples later in the document.

The quality measures can be visualized on the map to reveal subgroups of problematic data points:

# Plot map quality measures.
numero.plot(results = qc.basic, subplot = c(1,4))