GFM: A Simple Transcriptomics Data

Wei Liu

2023-08-10

Load real data

First, we load the ‘GFM’ package and the real data which can be downloaded here. This data is in the format of ‘.Rdata’ that inludes a gene expression matrix ‘X’ with 3460 rows (cells) and 2000 columns (genes), a vector ‘group’ specifying two groups of variable types (‘type’ variable) including ‘gaussian’ and ‘poisson’ and a vector ‘y’ meaning the clusters of cells annotated by experts. We compare the performance of ‘GFM’ and ‘LFM’ in downstream clustering analysis based on the benchchmarked clusters ‘y’.

githubURL <- "https://github.com/feiyoung/GFM/blob/main/vignettes_data/Brain76.Rdata?raw=true"
download.file(githubURL,"Brain76.Rdata",mode='wb')

Then load to R

load("Brain76.Rdata")
XList <- list(X[,group==1], X[,group==2])
types <- type
str(XList)

library("GFM")
#load("vignettes_data\\Brain76.Rdata")
#ls() # check the variables
set.seed(2023) # set a random seed for reproducibility.

Fit GFM model

We fit the GFM model using ‘gfm’ function.

q <- 15
system.time(
  gfm1 <- gfm(XList, types, q= q, verbose = TRUE)
)

Compare with LFM in downstream analysis

We conduct the clustering analysis based on the extracted factors by GFM and evaluate the adjusted rand index (ARI) value based on the annotated cluster labels by experts.

hH <- gfm1$hH
library(mclust)
set.seed(1)
gmm1 <- Mclust(hH, G=7)
ARI_gfm <- adjustedRandIndex(gmm1$classification, y)

We fit linear factor model using same number of factors.

fac <- Factorm(X, q=15)
hH_lfm <- fac$hH
set.seed(1)
gmm2 <- Mclust(hH_lfm, G=7)
ARI_lfm <- adjustedRandIndex(gmm2$classification, y)

Compare with the ARIs by visualization.

library(ggplot2)
df1 <- data.frame(ARI= c(ARI_gfm,ARI_lfm),
                    Method =factor(c('GFM', "LFM")))
ggplot(data=df1, aes(x=Method, y=ARI, fill=Method)) + geom_bar(position = "dodge", stat="identity",width = 0.5)