The goal of pivmet
is to propose some pivotal methods in
order to:
undo the label switching problem which naturally arises during the MCMC sampling in Bayesian mixture models () pivotal relabelling (Egidi et al. 2018a)
initialize the K-means algorithm aimed at obtaining a good clustering solution () pivotal seeding (Egidi et al. 2018b)
You can then install pivmet
from github with:
# install.packages("devtools")
::install_github("leoegidi/pivmet") devtools
First of all, we load the package and we import the fish
dataset belonging to the bayesmix
package:
library(pivmet)
#> Loading required package: bayesmix
#> Loading required package: rjags
#> Loading required package: coda
#> Linked to JAGS 4.3.0
#> Loaded modules: basemod,bugs
#> Loading required package: mvtnorm
#> Loading required package: RcmdrMisc
#> Loading required package: car
#> Loading required package: carData
#> Loading required package: sandwich
#> Warning: replacing previous import 'rstan::plot' by 'graphics::plot' when
#> loading 'pivmet'
data(fish)
<- fish[,1]
y <- length(y) # sample size
N <- 5 # fixed number of clusters
k <- 12000 # MCMC iterations nMC
Then we fit a Bayesian Gaussian mixture using the
piv_MCMC
function:
<- piv_MCMC(y = y, k = k, nMC = nMC)
res #> Compiling model graph
#> Declaring variables
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 256
#> Unobserved stochastic nodes: 268
#> Total graph size: 1050
#>
#> Initializing model
Finally, we can apply pivotal relabelling and inspect the new
posterior estimates with the functions piv_rel
and
piv_plot
, respectively:
<- piv_rel(mcmc=res)
rel piv_plot(y = y, mcmc = res, rel_est = rel, type = "chains")
#> Description: traceplots of the raw MCMC chains and the relabelled chains for all the model parameters: means, sds and weights. Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sampler is not able to distinguish between the components.
piv_plot(y = y, mcmc = res, rel_est = rel, type = "hist")
#> Description: histograms of the data along with the estimated posterior means (red points) from raw MCMC and relabelling algorithm. The blue line is the estimated density curve.
Sometimes K-means algorithm does not provide an optimal clustering
solution. Suppose to generate some clustered data and to detect one
pivotal unit for each group with the MUS
(Maxima Units
Search algorithm) function:
#generate some data
set.seed(123)
<- 620
n <- 3
centers <- 20
n1 <- 100
n2 <- 500
n3 <- matrix(NA, n,2)
x <- c( rep(1,n1), rep(2, n2), rep(3, n3))
truegroup
for (i in 1:n1){
=rmvnorm(1, c(1,5), sigma=diag(2))}
x[i,]for (i in 1:n2){
+i,]=rmvnorm(1, c(4,0), sigma=diag(2))}
x[n1for (i in 1:n3){
+n2+i,]=rmvnorm(1, c(6,6), sigma=diag(2))}
x[n1
<- 1000
H <- matrix(NA, H, n)
a
for (h in 1:H){
<- kmeans(x,centers)$cluster
a[h,]
}
#build the similarity matrix
<- matrix(1, n,n)
sim_matr for (i in 1:(n-1)){
for (j in (i+1):n){
<- sum(a[,i]==a[,j])/H
sim_matr[i,j] <- sim_matr[i,j]
sim_matr[j,i]
}
}
<- KMeans(x, centers)$cluster
cl <- MUS(C = sim_matr, clusters = cl, prec_par = 5) mus_alg
Quite often, classical K-means fails in recognizing the true groups:
# launch classical KMeans
<- KMeans(x, centers)
kmeans_res # plots
par(mfrow=c(1,2))
<- c("grey", "darkolivegreen3", "coral")
colors_cluster <- c("black", "darkgreen", "firebrick")
colors_centers
::plot(x, col = colors_cluster[truegroup]
graphicsbg= colors_cluster[truegroup], pch=21,
,xlab="y[,1]",
ylab="y[,2]", cex.lab=1.5,
main="True data", cex.main=1.5)
::plot(x, col = colors_cluster[kmeans_res$cluster],
graphicsbg=colors_cluster[kmeans_res$cluster], pch=21, xlab="y[,1]",
ylab="y[,2]", cex.lab=1.5,main="K-means", cex.main=1.5)
points(kmeans_res$centers, col = colors_centers[1:centers],
pch = 8, cex = 2)
In such situations, we may need a more robust version of the
classical K-means. The pivots may be used as initial seeds for a
classical K-means algorithm. The function piv_KMeans
works
as the classical kmeans
function, with some optional
arguments (in the figure below, the colored triangles represent the
pivots).
# launch piv_KMeans
<- piv_KMeans(x, centers)
piv_res # plots
par(mfrow=c(1,2), pty="s")
<- c("grey", "darkolivegreen3", "coral")
colors_cluster <- c("black", "darkgreen", "firebrick")
colors_centers ::plot(x, col = colors_cluster[truegroup],
graphicsbg= colors_cluster[truegroup], pch=21, xlab="x[,1]",
ylab="x[,2]", cex.lab=1.5,
main="True data", cex.main=1.5)
::plot(x, col = colors_cluster[piv_res$cluster],
graphicsbg=colors_cluster[piv_res$cluster], pch=21, xlab="x[,1]",
ylab="x[,2]", cex.lab=1.5,
main="piv_Kmeans", cex.main=1.5)
points(x[piv_res$pivots[1],1], x[piv_res$pivots[1],2],
pch=24, col=colors_centers[1],bg=colors_centers[1],
cex=1.5)
points(x[piv_res$pivots[2],1], x[piv_res$pivots[2],2],
pch=24, col=colors_centers[2], bg=colors_centers[2],
cex=1.5)
points(x[piv_res$pivots[3],1], x[piv_res$pivots[3],2],
pch=24, col=colors_centers[3], bg=colors_centers[3],
cex=1.5)
points(piv_res$centers, col = colors_centers[1:centers],
pch = 8, cex = 2)
Egidi, L., Pappadà, R., Pauli, F. and Torelli, N. (2018a). Relabelling in Bayesian Mixture Models by Pivotal Units. Statistics and Computing, 28(4), 957-969.
Egidi, L., Pappadà, R., Pauli, F., Torelli, N. (2018b). K-means seeding via MUS algorithm. Conference Paper, Book of Short Papers, SIS2018, ISBN: 9788891910233.