Discriminability Statistic

Eric Bridgeford

2018-04-13

Discriminability Statistic

In this notebook, we will discuss the basics of the discriminability statistic in a real dat simulated and real data context, and demonstrate some of the useful plots that users may want to visualize the results with.

library(mgc)
library(reshape2)
library(ggplot2)

plot_mtx <- function(Dx, main.title="Distance Matrix", xlab.title="Sample Sorted by Source", ylab.title="Sample Sorted by Source") {
  data <- melt(Dx)
  ggplot(data, aes(x=Var1, y=Var2, fill=value)) +
    geom_tile() +
    scale_fill_gradientn(name="dist(x, y)",
                         colours=c("#f2f0f7", "#cbc9e2", "#9e9ac8", "#6a51a3"),
                         limits=c(min(Dx), max(Dx))) +
    xlab(xlab.title) +
    ylab(ylab.title) +
    theme_bw() +
    ggtitle(main.title)
}

Simulated Data

Here, we assume that we have 5 independent sources of a measurement, and take 10 measurements from each source. Each measurement source i has measurements sampled from N(i, I_d) where d=20.

nsrc <- 5
nobs <- 10
d <- 20
set.seed(12345)
src_id <- array(1:nsrc)
labs <- sample(rep(src_id, nobs))
dat <- t(sapply(labs, function(lab) rnorm(d, mean=lab, sd=1)))
discr.stat(dat, labs)  # expect high discriminability since measurements taken at a source have the same mean and sd of only 1
## [1] 0.9224444

we may find it useful to view the distance matrix, ordered by source label, to show that objects from the same source have a lower distance than objects from a different source:

Dx <- as.matrix(dist(dat[sort(labs, index=TRUE)$ix,]), method='euclidian')
plot_mtx(Dx)

as we can see, the ordering of the data elements does not matter, and users can pass in the data as either an \([n, d]\) array, or a \([n, n]\) distance matrix:

discr.stat(Dx, sort(labs))
## [1] 0.9224444

Real Data

Below, we show how discriminability might be used on real data, by demonstrating its usage on the first \(4\) dimensions of the iris dataset, to determine the relationship between the flower species and the distances between the different dimensions of the iris dataset (sepal width/length and petal width/length):

Dx <- as.matrix(dist(iris[sort(as.vector(iris$Species), index=TRUE)$ix,c(1,2,3,4)]))

plot_mtx(Dx)

we expect a high discriminability since the within-species relationship is clearly strong (the distances are low for same-species):

discr.stat(iris[,c(1,2,3,4)], as.vector(iris$Species))
## [1] 0.9320476

which is reflected in the high discriminability score.