Analysis of Network Community Structures Using the CommKern Package

library(CommKern)
## 
## Attaching package: 'CommKern'
## The following object is masked from 'package:stats':
## 
##     kernel

1 Introduction

In the past decade, network data has become a popular method for analyzing data from a variety of fields, including genomics, metabolomics, ecology, social networks, and neuroimaging. Brain networks are a complex system of structural-functional couplings within a hierarchy of mesoscopic interactions. Because of the unique aspects to brain connectivity, many of the popular graph theoretic measures fail to capture its complex dynamics. The CommKern package was created to address the limitations in analysis of network community structures through two main aspects. The first is a hierarchical, multimodal spinglass algorithm that is computationally efficient and able to robustly identify communities using multiple sources of information. The flexibility of the multimodal and hierarchical aspects to this algorithm set it apart from other existing methods which may only allow for one source of information or lack the hierarchical option. The second is a semiparametric kernel test of association, which allows for the use of a kernel to model a variety of cluster evaluation metrics without making any assumption as to the parametric form of its association with the outcome. As well, because the model is semi-parametric, potential confounders can be controlled for within a parametric framework, allowing for ease of parameter estimate interpretation, should they be desired.

This vignette is designed to provide an overview of CommKern’s functionality using the simulated datasets provided within the package. While focusing on applications to neuroimaging, the methodologies within this package are flexible enough to be applied to any type of network data, which will be pointed out in the vignette when applicable.

2 Provided Data Set(s)

Several simulated datasets have been provided within the package.

2.1 SBM_net

SBM_net is a dataset containing multimodal network information simulated to emulate functional and structural connectivity within a nested hierarchical structure.

matrix_plot(SBM_net)

The SBM_net object is the result of calling matrix_to_df.

net <- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
identical(net, SBM_net)
## [1] TRUE

Network data used within the package’s community detection algorithm takes a specific form. Using functional and structural adjacency matrices as inputs, the matrix_to_df function creates a list of five dataframes:

str(SBM_net)
## List of 5
##  $ func_edges :'data.frame': 1233 obs. of  3 variables:
##   ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
##   ..$ func_end_node  : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ func_weight    : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
##  $ str_edges  :'data.frame': 453 obs. of  3 variables:
##   ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
##   ..$ str_end_node  : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ str_weight    : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
##  $ vertexes   :'data.frame': 80 obs. of  5 variables:
##   ..$ node_id    : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
##   ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
##   ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
##   ..$ community  : logi [1:80] NA NA NA NA NA NA ...
##  $ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##  $ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##  - attr(*, "class")= chr "spinglass_net"

2.2 simasd_covars

This dataframe is a simulated dataset of demographics based on summary statistics for a random subset of the ABIDE pre-processed database.

str(simasd_covars)
## 'data.frame':    49 obs. of  8 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dx_group    : num  1 1 1 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 1 0 1 1 0 0 0 1 ...
##  $ age         : num  25.2 18.9 16.1 27.6 37.6 20.6 7.7 21.6 19 25.5 ...
##  $ handedness  : num  2 0 0 0 1 0 1 0 0 0 ...
##  $ fullscale_IQ: num  107 123 100 109 114 130 118 120 101 106 ...
##  $ verbal_IQ   : num  105 125 96 107 118 136 122 124 103 103 ...
##  $ nonverbal_IQ: num  112 120 109 113 113 124 116 117 104 112 ...

2.3 simasd_comm_df

This is a dataset of partitions of nodes to communities from simulated group-level networks with community structures. This dataset is complementary to the simasd_covars dataset and is derived from the simasd_array dataset.

str(simasd_comm_df, max.level = 0)
## 'data.frame':    80 obs. of  49 variables:

2.4 simasd_hamil_df

This is a dataset of Hamiltonian values from simulated group-level networks with community structure. This dataset is complementary to the simasd_covars dataset and is derived from the simasd_array dataset.

str(simasd_hamil_df)
## 'data.frame':    49 obs. of  2 variables:
##  $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hamil: num  -342 -347 -327 -361 -363 ...

2.5 simasd_array

This is a dataset containing an array of simulated adjacency matrices. The dimensions of each matrix is 80 x 80, for a total of 49 simulated networks. This simulated array is the basis of the simasd_hamil_df and simasd_comm_df datasets and is complementary to the simasd_covars dataframe.

str(simasd_array)
##  num [1:49, 1:80, 1:80] 0 0 0 0 0 0 0 0 0 0 ...

3 Hierarchical Multimodal Spinglass (HMS) Algorithm

The hierarchical multimodal spinglass (HMS) algorithm was created as a way to address the unique characteristics of brain connectivity, specifically focusing on the hierarchical aspect to brain network segregation and integration and the structure-function relationship.

3.1 Derivation of the Algorithm

First introduced by Reichardt and Bornholdt in 2006, the spinglass algorithm relies on the analogy between the Potts spinglass model - a popular statistical mechanics model - and community detection, where any community partitioning is likened to a corresponding spin configuration and the problem of community detection falls into finding the ground state of the Potts spinglass model. Building on the idea of a quality function which, when optimized, represents the ideal partitioning of a network, they determined that a quality function should embody four components: it should (a) reward internal edges between nodes of the same group/spin state; (b) penalize missing edges (non-links) between nodes in the same group; (c) penalize existing edges between different groups/spin states; and (d) reward non-links between different groups. Using these requirements, they built the following function: \[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}a_{ij}\underbrace{W_{ij} \delta \left(C_i,C_j\right)}_{\text{internal links}} + \sum_{i \neq j}b_{ij}\underbrace{\left(1-W_{ij}\right) \delta \left(C_i,C_j\right)}_{\text{internal non-links}} \\ & + \sum_{i \neq j}c_{ij}\underbrace{W_{ij} \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external links}} - \sum_{i \neq j}d_{ij}\underbrace{\left(1-W_{ij}\right) \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external non-links}}, \end{split} \end{equation}\] where \(W_{ij}\) denotes the weighted adjacency matrix with \(W_{ij} \geq 0\) if an edge is present and zero otherwise, \(C_{i} \in \left\{1,2,...,q\right\}\) denotes the group index (or spin state) of node \(i\) in the graph, \(\delta\) is the Kronecker product equaling one if \(C_i\) and \(C_j\) belong to the same community and zero otherwise, and \(a_{ij}, b_{ij}, c_{ij}, d_{ij}\) denote the weights of the individual contributions, respectively. One of the appeals of the spinglass algorithm is that the number of groups \(q\) only determines the maximum number of groups allowed to exist and, in principle, can be as large as the total number of nodes in the network, \(N\). All group indices specified by \(q\) have to be used during the optimal assignment of nodes to communities, though some may be unpopulated in the ground state.

Generally, links and non-links are weighted equally, \(a_{ij}=c_{ij}\) and \(b_{ij}=d_{ij}\), which simplifies the algorithm to only need to consider internal links and non-links. A convenient choice for \(a_{ij}\) and \(b_{ij}\) is \(a_{ij} = 1-\gamma p_{ij}\) and \(b_{ij} = \gamma p_{ij}\), where \(p_{ij}\) is the probability that a non-zero edge exists between nodes \(i\) and \(j\), normalized such that \(\sum_{i \neq j}p_{ij} = 2M\), where \(M\) is the total weight (or sum of all edge weights) of the graph. Adopting all of these settings, we can simplify the above equation to

\[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}\left(W_{ij} - \gamma p_{ij}\right) \delta \left(C_{i},C_{j}\right). \end{split} \end{equation}\]

Expanding on the original algorithm of Reichardt and Bornholdt, in 2012, Eaton and Mansbach introduced the concept of semi-supervised community detection to the spinglass model. When knowledge can be used to inform the community search, either through knowledge of a specific community of interest or partial knowledge of community memberships, its incorporation can be used to help compensate for potential noise in the network. As the Hamiltonian captures the total energy of the Potts model, external knowledge can be incorporated into community detection by penalizing for community structures that violate guidance. Generally, let the disagreement of the communities to a given guidance be specified by a function \(U: C \rightarrow \mathbb{R}\), where \(U(C)\) is smaller when the communities \(C\) agree with the guidance and larger when they disagree, with the following form:

\[\begin{equation} U(C) = \sum_{i \neq j} \left( u_{ij} \left( 1-\delta\left(C_{i},C_{j}\right) \right) \pm \bar{u}_{ij} \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(u_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to the same community and \(\bar{u}_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to different communities. When incorporating the guidance into the Hamiltonian,

\[\begin{equation} \mathcal{H}'\left( C \right) = \mathcal{H}\left( C \right) + \mu \sum_{i \neq j} \left( u_{ij} - \left(u_{ij}-\bar{u}_{ij}\right) \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(\mu \geq 0\) controls the balance between the inherent community structure and the external guidance. Rewriting \(\mathcal{H}'\) to be more conducive to optimization, we can decompose it as

\[\begin{equation} \mathcal{H}'\left( C \right) = -\sum_{i \neq j}M_{ij}\delta\left(C_{i},C_{j}\right) - \mu \sum_{i \neq j}\Delta U_{ij} \delta\left(C_{i},C_{j}\right)+K, \end{equation}\] where \(M_{ij}=A_{ij}-\gamma P_{ij}\), \(\Delta U_{ij}=u_{ij} - \bar{u}_{ij}\), and all constant terms are contained within \(K = \mu \sum_{i \neq j}u_{ij}\), which can ultimately be dropped from the optimization problem.

Adapting Eaton and Mansbach’s method for multimodal community detection in the human brain, our modified Hamiltonian takes the following form,

\[\begin{equation} \mathcal{H}'\left( C \right) = - \sum_{i \neq j} M_{ij}\delta\left(C_{i},C_{j}\right) - \alpha \sum_{i \neq j} S_{ij} \delta\left(C_{i},C_{j}\right), \end{equation}\] where \(M_{ij}\) is the modularity matrix associated with functional connectivity, \(S_{ij}\) is the structural connectivity matrix, and \(\alpha \geq 0\) controls the balance between the community structure determined solely by functional connectivity and being informed by the structural connectivity via white matter fiber tract information. In contrast to Eaton and Mansbach’s approach, where penalties are applied when violating guidance for nodes belonging both to the same community or different communities, our method only decreases the total energy of the model when white matter fiber tracts exist between regions \(v_i\) and \(v_j\); otherwise, the guidance term is set to zero, with the algorithm deciding the community assignments for \(v_i\) and \(v_j\) based solely on their functional connectivity.

To then study the hierarchical community structure of these networks, we implement an iterative method. Specifically, we create a multilayer network, where communities from layer \(k\) each become subnetworks in layer \(k+1\) wherein the multimodal spinglass algorithm is applied. The number of iterations (or layers) is pre-specified in the algorithm to avoid the need for stopping criteria. The number of communities \(q\) is held constant across each layer of the network. An additional parameter, \(\tau\), is also specified a priori, controlling the acceptance rate of the algorithm.

3.2 Implementation

Before implementing the algorithm within the CommKern package, we start from a common brain atlas (e.g., Automated Anatomical Labeling (AAL), Harvard-Oxford (HO), Craddock 200 (CC200), etc.), where adjacency matrices representing functional and structural connectivity are derived from the correlation between resting state functional MRI (rs-FMRI) blood oxygen level dependent (BOLD) time series or the number of white matter fiber tracks computed from diffusion tensor MRI, respectively.

Once we have structural and functional adjacency matrices, we must create the network dataframe used within the hms algorithm function. This is done using the matrix_to_df function (Note: the SBM_net dataset is already in this form for the end user). It should be noted that both a functional and structural matrix need to be specified for the matrix_to_df function to work. If there is no structural (or guidance) matrix for use, the end user can simply create a matrix of \(0\)s of the same dimensions for input into the function. This will not affect the end results of the hms function.

net <- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
identical(net, SBM_net)
## [1] TRUE

From here, we can implement the hms function. There are several input parameters to discuss.

str(hms)
## function (input_net, spins, alpha, coolfact, tol, max_layers)

The spins parameter denotes the maximum number of spins, or communities, the algorithm can utilize during optimization. The whole number integer value of spins must lie within \(\left[2,\texttt{ # of network nodes}\right]\). However, the higher the number of spins, the more complex the optimization process, leading to longer computation times. If the number of communities is unknown, we recommend an exploratory analysis to determine the number of communities the algorithm tends to decide on using less restrictive parameters.

The alpha parameter is a numeric parameter \(\left(\geq 0\right)\) that balances the use of the guidance matrix in the algorithm’s optimization. When \(\alpha=0\), the algorithm ignores the guidance matrix completely and when \(\alpha=1\), the algorithm weights the functional and structural inputs equally.

The coolfact parameter indicates how quickly (or slowly) to cool the heatbath algorithm, which is used to find the optimal partitioning of nodes to communities. This parameter must be in \(\left(0,1\right)\), but is usually set to be \(0.95-0.99\).

The tol parameter indicates the tolerance level of accepting the proposed changes within a temperature; at the end of each sweep, the number of proposed changes to the partition is assessed to see if it exceeds a threshold determined as a function of tol and spins, typically set to be 0.01-0.05.

Finally, max_layers is the parameter controlling the number of layers of communities within the network. At least one layer must be specified in the algorithm. End users can specify the number of layers to be any positive whole number but several considerations should be made. First, as the number of layers increases, so does the computational time. Second, over-specification of the community layers can create singleton issues (a single node in a community), leading to the algorithm throwing an error. Finally, the number of layers specified should be data driven; therefore, we cannot provide a universally optimal number of layers to specify.

Using the SBM_net dataset as an example, below we have run the hms algorithm to find the first layer of bipartitioning, suppressing the guidance matrix in the optimization.

The output from the hms function call has two components within a list structure:

hms_object <-
  hms(
      input_net = SBM_net,
      spins = 2,
      alpha = 0,
      coolfact = 0.99,
      tol = 0.01,
      max_layers = 1)
str(hms_object)
## List of 3
##  $ comm_layers_tree:'data.frame':    80 obs. of  2 variables:
##   ..$ node_id: int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ layer_1: int [1:80] 2 2 2 2 2 2 2 2 2 2 ...
##  $ net             :List of 5
##   ..$ func_edges :'data.frame':  1233 obs. of  3 variables:
##   .. ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
##   .. ..$ func_end_node  : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ func_weight    : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
##   ..$ str_edges  :'data.frame':  453 obs. of  3 variables:
##   .. ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
##   .. ..$ str_end_node  : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ str_weight    : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
##   ..$ vertexes   :'data.frame':  80 obs. of  5 variables:
##   .. ..$ node_id    : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
##   .. ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
##   .. ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
##   .. ..$ community  : logi [1:80] NA NA NA NA NA NA ...
##   ..$ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   ..$ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   ..- attr(*, "class")= chr "spinglass_net"
##  $ best_hamiltonian: num -286
##  - attr(*, "class")= chr "spinglass_hms"

The HMS algorithm is flexible enough that either the multimodal or hierarchical aspects can be suppressed. If the multimodal aspect would like to be suppressed, set \(\alpha=0\) and if the hierarchical aspect would like to be suppressed, set \(\text{max_layers}=1\). If both are suppressed, the algorithm simplifies to the spinglass algorithm originally proposed by Reichardt and Bornholdt.

3.3 Plotting Results

Once the algorithm has finished running, you can visualize the results in a plot using community_plot. Using the output from the hms run of the SBM_net dataset, we can produce the following plot:

community_plot(hms_object)

3.4 Additional Visualization and Quantification

If the HMS algorithm is used on network data for which ground truth is unknown (like in the case of brain connectivity), the CommKern package has provided several functions to understand the robustness and stability of the partitioning of nodes to communities across multiple runs.

First is the community_allegiance function.For node \(i\), the stability of its allegiance to community \(A\) is calculated as the number of times where node \(i\) belongs to community \(A\), divided by the total number of runs. This measure is bounded in \(\left[0,1\right]\), where higher values of stability indicate that a node belong to a single community across a greater number of runs, and can only work on a single layer of the hierarchy at a time. Higher proportions of nodes with high levels of community allegiance indicate a robust community detection algorithm.

Below is a simple example of how community allegiance is calculated and visualized.

set.seed(7183)

x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z <- sample(x = rep(1:3, 4), 12)

xyz_comms <- data.frame(id=seq(1:length(x)),x_comm=x,y_comm=y,z_comm=z)
xyz_alleg <- community_allegiance(xyz_comms)
xyz_melt <- reshape2::melt(xyz_alleg)

print(xyz_comms)
##    id x_comm y_comm z_comm
## 1   1      3      3      2
## 2   2      3      3      2
## 3   3      2      2      3
## 4   4      3      2      2
## 5   5      1      2      3
## 6   6      2      1      1
## 7   7      3      1      3
## 8   8      2      3      3
## 9   9      1      3      1
## 10 10      2      1      2
## 11 11      1      2      1
## 12 12      1      1      1

ggplot2::ggplot(data = xyz_melt) +
  ggplot2::theme_minimal() +
  ggplot2::aes(x = as.factor(Var1), y = as.factor(Var2), fill = value) +
  ggplot2::geom_tile() +
  ggplot2::xlab('Node') + ggplot2::ylab('Node') +
  ggplot2::ggtitle('Community Allegiance Example') +
  ggplot2::scale_fill_gradient2(
    low  = 'navy',
    high = 'goldenrod1',
    mid  = 'darkturquoise',
    midpoint = 0.5,
    limit = c(0, 1),
    space = 'Lab',
    name='')

The average community allegiance value within this simple example is 0.3333333, meaning that, on average, two nodes appear in the same community 33% of the time.

If multiple runs of the HMS algorithm have been performed on the same dataset and a single, representative partition is desired, end users can utilize the consensus_similarity function. This function can be used to identify a single, representative partition from a set of partitions that is most similar to all others. Similarity is taken to be the z-score of the adjusted Rand coefficient. This definition of consensus similarity was first defined by Doron et al. in a 2012 PNAS publication. Pairwise z-scores of the adjusted Rand coefficient are calculated for all partitions, then the average pairwise similarity is calculated for each partition. The partition with the highest average pairwise similarity is then chosen to be the consensus partition.

Below is a simple example of how the consensus_similarity function works.

set.seed(7183)

x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z <- sample(x = rep(1:3, 4), 12)

xyz_comms_mat <- matrix(c(x,y,z),nrow=length(x),ncol=3)
consensus_similarity(xyz_comms_mat)
##  [1] 3 3 2 3 1 2 3 2 1 2 1 1

4 Simulating Group-Level Networks with Community Structure

End users can simulate group-level networks with community structure using two main functions within the CommKern package: group_network_perturb and group_adj_perturb.

str(group_network_perturb)
## function (n_nodes, n_comm, n_nets, perturb_prop, wcr, bcr, bfcr = NA, fuzzy_comms = NA)

str(group_adj_perturb)
## function (group_network_list, n_nets, n_nodes)

First, group_network_perturb creates a list of simulated networks, of which each network is in a data.frame format, which describes the community assignment for each node in the network and simulates the edge weights based on whether the node dyad is (a) within the same community; (b) between different communities; or (c) between different communities but designated as “fuzzy” in their distinction from one another. This function takes several inputs:

Once the group_network_perturb function is run, its output is used as one of the inputs for the group-adj_perturb function, which produces an array of adjacency matrices. These adjacency matrices can then be used as input to any community detection algorithm (such as the HMS algorithm).

Below is an example of how to simulate a group of networks with similar community structure, with a visualization of one resulting simulated adjacency matrix.

sim_nofuzzy <-
   group_network_perturb(
     n_nodes = 50,
     n_comm = 4,
     n_nets = 3,
     perturb_prop = 0.1,
     wcr = c(8, 8),
     bcr = c(1.5, 8)
   )

nofuzzy_adj <-
  group_adj_perturb(sim_nofuzzy, n_nets = 3, n_nodes = 50)

if (require(pheatmap)) {
 pheatmap::pheatmap(
   nofuzzy_adj[1,,],
   treeheight_row = FALSE,
   treeheight_col = FALSE
 )
}
## Loading required package: pheatmap

As well, the following is an example of group-level networks simulated with fuzzy network structure.

sim_fuzzy <-
   group_network_perturb(
     n_nodes = 50,
     n_comm = 4,
     n_nets = 3,
     perturb_prop = 0.1,
     wcr = c(8, 8),
     bcr = c(1.5, 8),
     bfcr = c(3.5, 8),
     fuzzy_comms = c('comm_b', 'comm_c')
   )

fuzzy_adj <-
  group_adj_perturb(sim_fuzzy, n_nets = 3, n_nodes = 50)

if (require(pheatmap)) {
 pheatmap::pheatmap(
   fuzzy_adj[1,,],
   treeheight_row = FALSE,
   treeheight_col = FALSE
 )
}

5 Cluster Evaluation Metrics

Within community detection, the methods for evaluating the performance of a particular algorithm can be classified as either (1) extrinsic, requiring ground truth labels or (2) intrinsic not requiring ground truth labels.

5.1 Extrinsic Metrics

All external clustering measures rely on a \(r \times k\) contingency matrix \(\mathbs{N}\) that is induced by a system-clustering \(\mathcal{C}\) and a ground truth partitioning \(\mathcal{T}\), where the rows represent the system-clusters \(\mathcal{C}_i \in \mathcal{C}\) for \(i=1,\dots,r\) and the columns represent the ground truth partitions \(\mathcal{T}_j \in \mathcal{T}\) for \(j=1,\dots,k\). A cell at the intersection of row \(i\) and column \(j\) contains the count \(n_{ij}\) of points that are common to cluster \(\mathcal{C}_i\) and ground truth partition \(\mathcal{T}_j\):

\[\begin{equation} N\left(i,j\right)=n_{ij}=\vert\mathcal{C}_i \cap \mathcal{T}_j \vert \end{equation}\] However, in many cases a ground truth labeling does not exist, such as brain connectivity. In this case, a two-way matching can be performed, where the matching is performed twice, each time with one of the clusterings being the ground truth and the results combined using a harmonic mean to produce a final matching measure.

There are several options for external cluster evaluation metrics. For all, assume \(\Omega=\left\{\omega_1,\omega_2,\dots,\omega_k\right\}\) and \(\mathbb{C}=\left\{c_1,c_2,\dots,c_j\right\}\) are two sets of clusters.

set.seed(7183)

x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)

purity(x,y)
## [1] 0.5
set.seed(7183)

x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)

NMI(x,y)
## [1] 0.05360537
set.seed(7183)

x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)

adj_RI(x,y)
## [1] -0.1458333

To create a matrix of distances, the function ext_distance can be used. This function uses the community output values from any community detection algorithm, such as the HMS algorithm, as an input. Because extrinsic cluster evaluation metrics use the underlying idea of similarity, distance is calculated as \(\left(1-\text{similarity}\right)\). The use of distance ensures that the distance matrix will be positive and semi-definite, a requirement for the semiparametric kernel methods detailed in the next section.

x <- c(2,2,3,1,3,1,3,3,2,2,1,1)
y <- c(3,3,2,1,1,1,1,2,2,3,2,3)
z <- c(1,1,2,3,2,3,2,1,1,2,3,3)

xyz_comms <- data.frame(x_comm = x, y_comm = y, z_comm = z)

ext_distance(xyz_comms, variant = 'NMI')
##              1         2         3
## [1,] 0.0000000 0.6963946 0.3412397
## [2,] 0.6963946 0.0000000 0.8412397
## [3,] 0.3412397 0.8412397 0.0000000
ext_distance(xyz_comms, variant = 'adj_RI')
##              1         2         3
## [1,] 0.0000000 0.9166667 0.4583333
## [2,] 0.9166667 0.0000000 1.0694444
## [3,] 0.4583333 1.0694444 0.0000000
ext_distance(xyz_comms, variant = 'purity')
##              1         2         3
## [1,] 0.0000000 0.4166667 0.1666667
## [2,] 0.4166667 0.0000000 0.5000000
## [3,] 0.1666667 0.5000000 0.0000000

5.2 Intrinsic Metrics

Rather than use extrinsic metrics, which rely either on a ground truth labeling or on a harmonic mean between two clusterings, intrinsic evaluation metrics do not rely on the existence of a ground truth. One of the most commonly used intrinsic metrics is the optimized value of the modularity function; this is based on the fact that networks with very similar community structure should have very similar modularity values. Within the HMS algorithm, the optimized value of the Hamiltonian van be used in the same manner as the modularity function. Because the optimized Hamiltonian values from the HMS algorithm are not bounded in the same manner as the extrinsic cluster evaluation metrics, a p-norm of the differences between the Hamiltonians for two clusterings was used such that

\[\begin{equation} d\left(\mathcal{H}_{\Omega},\mathcal{H}_{\mathbb{C}}\right)=\Vert \mathcal{H}_{\Omega} - \mathcal{H}_{\mathbb{C}} \Vert_{p} \end{equation}\] for \(1 \leq p < \infty.\)

To create a matrix of distances, the function ham_distance can be used. This function uses the Hamiltonian output values from a community detection algorithm that implements a Hamiltonian value, such as the HMS algorithm. To ensure a positive, semi-definite matrix (as required for the kernel machine methods), the absolute difference between Hamiltonian values is calculated.

hamil_df <- data.frame(id  = seq(1:8),
                       ham = c(-160.5375, -167.8426, -121.7128,
                               -155.7245, -113.9834, -112.5262,
                               -117.9724, -171.374))

ham_distance(hamil_df)
##            1       2       3       4       5       6       7       8
## [1,]  0.0000  7.3051 38.8247  4.8130 46.5541 48.0113 42.5651 10.8365
## [2,]  7.3051  0.0000 46.1298 12.1181 53.8592 55.3164 49.8702  3.5314
## [3,] 38.8247 46.1298  0.0000 34.0117  7.7294  9.1866  3.7404 49.6612
## [4,]  4.8130 12.1181 34.0117  0.0000 41.7411 43.1983 37.7521 15.6495
## [5,] 46.5541 53.8592  7.7294 41.7411  0.0000  1.4572  3.9890 57.3906
## [6,] 48.0113 55.3164  9.1866 43.1983  1.4572  0.0000  5.4462 58.8478
## [7,] 42.5651 49.8702  3.7404 37.7521  3.9890  5.4462  0.0000 53.4016
## [8,] 10.8365  3.5314 49.6612 15.6495 57.3906 58.8478 53.4016  0.0000

6 Kernel Machine Methods

6.1 Distance-Based Kernels

A kernel is a function that takes as its inputs vectors in some original space and returns the dot product of vectors in the feature space. More formally, if we have \(\mathbf{x},\mathbf{z} \in \mathcal{X}\) and a map \(\phi: \mathcal{X} \rightarrow \mathbb{R}^N\), then \(k\left(\mathbf{x},\mathbf{z}\right)=\langle\phi\left(\mathbf{x}\right),\phi\left(\mathbf{z}\right)\rangle\) is a kernel function. An important concept relating to kernel methods is the reproducing kernel Hilbert space (RKHS). Because all kernels are positive definite, there exists one or more feature spaces for which the kernel defines the inner product, without having to explicitly define such feature space. Using the Moore-Aronszajn theorem, it can be shown that for each kernel \(k\) on a set \(\mathcal{X}\), there is a unique space of functions (known as the Hilbert space) on \(\mathcal{X}\) for which \(k\) is a reproducing kernel.

Letting \(Z\) be a multidimensional array of variables and \(i,j\) be two subjects, then \(k\left(Z_i,Z_j\right)\) can be used as a measure of similarity between the pair of subjects \(i,j\) since \(k\left(\cdot,\cdot\right)\) is positive definite. This similarity measure can then be incorporated into a statistical inference framework to test what extend variation in \(Z\) between individuals can explain variation in some outcome of interest \(Y\). A range of kernel functions are used in statistics, where the choice of kernel determines the function space used to approximate the relationship between two variables. A distance-based kernel is denoted as

\[\begin{equation} K_d\left(x_1,x_2\right)=exp\left\{\dfrac{-d^2(x_1,x_2)}{\rho}\right\}, \end{equation}\] where \(d^2(x_1,x_2)\) is a distance function and \(\rho\) an unknown bandwidth or scaling parameter.

While this package provides several options for cluster evaluation, many more exist in the literature. End users can utilize their own custom metrics, so long as the resulting distance matrix is symmetric, positive, and semi-definite, as these are the requirements for incorporation into the kernel machine methods detailed in the next section.

6.2 Semiparametric Kernel Machine Methods

Suppose a dataset consists of \(n\) subjects, where for subject \(i=\left(1,\dots, n\right)\), \(y_i\) is an outcome variable (either binary or continuous), \(\mathbf{x}_i\) is a \(q \times 1\) vector of clinical covariates and \(\mathbf{x}_i\) is a \(p \times 1\) vector of cluster evaluation metrics. The outcome \(y_i\) depends on \(\mathbf{x}_i\) and \(\mathbf{z}_i\) through the following semiparametric linear model

\[\begin{equation} y_i = \mathbf{x}_{i}^{T}\mathbf{\beta} + h\left(\mathbf{z}_i\right) + e_i, \end{equation}\] where \(\mathbf{\beta}\) is a \(q \times 1\) vector of regression coefficients, \(h\left(\mathbf{z}_i\right)\) is an unknown, centered and smooth function, and \(e_i\) is an independent and identically distributed error term following \(N\left(0,\sigma^2\right)\).

This model allows for covariate effects to be modeled parametrically and the brain connectivity metric either parametrically or non-parametrically. There are several special cases of this model: when \(h\left(\cdot\right)=0\), the model reduces too a standard logistic regression model and when \(\mathbf{x}_i=1\), the model reduces to a least squares kernel machine regression. A hypothesis test can be conducted to determine whether the multidimensional variable set \(\mathbf{z}_i\) is associated with \(y_i\), controlling for \(\mathbf{x}_i\), of the form

\[\begin{equation} H_0: h\left( \cdot \right) = 0\\ H_A: h\left( \cdot \right) \neq 0\\ \end{equation}\]

Details of the derivation of the score test will be left out of this vignette. If end users are interested, please refer to Jensen et al. (2019); Liu, Lin, and Ghosh (2007); and Liu, Ghosh, and Lin (2008).

6.3 Examples Using Available Datasets

Using the datasets provided in the CommKern package, we can implement the various kernel machine types. In all cases, a parameter grid_gran is preset to 5000; this variable relates to the grid search length for the nuisance \(\rho\) parameter.

If the outcome is binary and no other covariates are of interest, then use the score_log_nonparam function:

str(score_log_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)

Using the simasd_covars demographics dataset, where dx_group will be used as the outcome of interest, we will use simasd_hamil_df as the basis for our distance matrix.

simasd_ham_mat <- ham_distance(simasd_hamil_df)

score_log_nonparam(outcome=simasd_covars$dx_group,
                   dist_mat=simasd_ham_mat)
## [1] 0.5703748

If we wish to incorporate additional covariates of interest, like age, sex, or handedness, this can easily be done via the score_log_semiparam function.

str(score_log_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)

For our purposes, we will still use the Hamiltonian as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.

simasd_ham_mat <- ham_distance(simasd_hamil_df)
simasd_confound <- simasd_covars[,3:5]
simasd_confound$handedness <- as.factor(simasd_confound$handedness)

score_log_semiparam(outcome=simasd_covars$dx_group,
                    covars=simasd_confound,
                    dist_mat=simasd_ham_mat)
## [1] 0.5368395

In a similar manner, if the outcome of interest is continuous and no other covariates are of interest, then use the score_cont_nonparam function:

str(score_cont_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)

Still using the simasd_covars demographics dataset, we will now use verbal_IQ as the outcome of interest and we will now use simasd_comm_df as the basis for our distance matrix, with the NMI as the extrinsic cluster evaluation metric

simasd_NMI_mat <- ext_distance(comm_df=simasd_comm_df,
                               variant=c("NMI"))

score_cont_nonparam(outcome=simasd_covars$verbal_IQ,
                   dist_mat=simasd_NMI_mat)
## [1] 0.5619891

If we wish to incorporate additional covariates of interest, like age, sex, or handedness, this can easily be done via the score_cont_semiparam function.

str(score_cont_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)

For our purposes, we will now use purity as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.

simasd_pur_mat <- ext_distance(comm_df=simasd_comm_df,
                               variant=c("purity"))
simasd_confound <- simasd_covars[,3:5]
simasd_confound$handedness <- as.factor(simasd_confound$handedness)

score_cont_semiparam(outcome=simasd_covars$verbal_IQ,
                    covars=simasd_confound,
                    dist_mat=simasd_pur_mat)
## [1] 0.3368465

References

Jensen, Alexandria M, Jason R Tregellas, Brianne Sutton, Fuyong Xing, and Debashis Ghosh. 2019. “Kernel Machine Tests of Association Between Brain Networks and Phenotypes.” Plos One 14 (3): e0199340. https://doi.org/10.1371/journal.pone.0199340.
Liu, Dawei, Debashis Ghosh, and Xihong Lin. 2008. “Estimation and Testing for the Effect of a Genetic Pathway on a Disease Outcome Using Logistic Kernel Machine Regression via Logistic Mixed Models.” BMC Bioinformatics 9 (1): 1–11. https://doi.org/10.1186/1471-2105-9-292.
Liu, Dawei, Xihong Lin, and Debashis Ghosh. 2007. “Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models.” Biometrics 63 (4): 1079–88. https://doi.org/10.1111/j.1541-0420.2007.00799.x.