Fast Naive Bayes

Introduction

This is an extremely fast implementation of a Naive Bayes classifier. This package is currently the only package that supports a Bernoulli distribution, a Multinomial distribution, and a Gaussian distribution, making it suitable for both binary features, frequency counts, and numerical features. Another feature is the support of a mix of different event models. Only numerical variables are allowed, however, categorical variables can be transformed into dummies and used with the Bernoulli distribution.

This implementation offers a huge performance gain compared to other implementations in R. The execution times were compared on a data set of tweets and this package was found to be around 283 to 34,841 times faster for the Bernoulli event models and 17 to 60 times faster for the Multinomial model. For the Gaussian distribution this package was found to be between 2.8 and 1679 times faster. The implementation is largely based on the paper “A comparison of event models for Naive Bayes anti-spam e-mail filtering” written by K.M. Schneider (2003).

Any issues can be submitted to: https://github.com/mskogholt/fastNaiveBayes/issues.

The purpose of this vignette is to explain some key aspects of this implementation in detail. Firstly, a short introduction to text classification is given as the context for further explanations about the Naive Bayes classifier. It should be noted that the Naive Bayes classifier is not restricted to text classification. The Naive Bayes classifier is a general classification algorithm, but most commonly applied to text classification. Secondly, the general framework of a Naive Bayes classifier is outlined in order to subsequently delve deeper into the different event models. Thirdly, a mathematical explanation is given as to why this particular implementation has such an excellent performance in terms of speed. In the fourth section a description is given about the unique features that sets this implementation of a Naive Bayes classifier apart from other implementations within the R community. Lastly, some code examples are included.

Text Classification

Text classification is the task of classifying documents by their content: that is, by the words of which they are comprised. The documents are often represented as a bag of words. This means that only the occurrence or frequency of the words in the document are taken into account, any information about the syntactic structure of these words is discarded (Hu & Liu, 2012). In many research efforts regarding document classification, Naive Bayes has been successfully applied (McCallum & Nigam, 1998). Furthermore, text classification will serve as the basis for further elaboration on the inner workings of the Naive Bayes classifier and the different event models.

Naive Bayes

Naive Bayes is a probabilistic classification method based on the Bayes theorem with a strong and naive independence assumption. Naive Bayes assumes independence between all attributes. Despite this so-called “Naive Bayes assumption”, this technique has been proven to be very effective for text classification (McCallum & Nigam, 1998). In the context of text classification, Naive Bayes estimates the posterior probability that a document, consisting out of several words, belongs to a certain class and classifies the document as the class which has the highest posterior probability: \[P(C=k|D) = \frac{P(D|C=k)*P(C=k)}{P(D)}\] Where \(P(C=k|D)\) is the posterior probability that the class equals \(k\) given document, \(D\). The Bayes theorem is applied to rewrite this probability to three components:

\(P(D)\), the prior probability of document, \(D\)
\(P(C=k)\), the prior probability of class, \(k\)
\(P(D|C=k)\), the conditional probability of document, \(D\), given class, \(k\)

To classify a document, \(D\), the class, \(k\), with the highest probability is chosen as the classification. This means that we can simplify the equation a bit, since \(P(D)\) is the same for all classes. By removing the denominator, the focus is now solely on calculating the nominator, i.e. the first 2 components.

The prior

The prior probability of class, \(k\), i.e. \(P(C=k)\), is simply the proportion of documents in the training dataset that have class, \(k\). For example, if our training dataset consists of 100 emails that have been labeled as either \(Ham\) or \(Spam\) and there were 63 emails that were labeled \(Ham\) and 37 emails labeled as \(Spam\). In this case, \(P(C=Spam)\) is the proportion of emails that were labeled as \(Spam\), i.e. \(\frac{37}{100}=0.37\). This prior probability estimation is the same regardless of which distribution is used within the Naive Bayes Classifier.

Event models

Naive Bayes is a popular classification method, however, within the classification community there is some confusion about this classifier: There are three different generative models in common use, the Multinomial Naive Bayes, Bernoulli Naive Bayes, and finally the Gaussian Naive Bayes. Most confusion is surrounding the Multinomial and Bernoulli event models. Both are called Naive Bayes by their practitioners and both make use of the Naive Bayes assumption. However, they have different assumptions on the distributions of the features that are used. This means that these assumptions lead to two distinct models, which are very often confused (McCallum & Nigam, 1998).

Bernoulli Distribution

The most commonly used Naive Bayes classifier uses a Bernoulli model. This is applicable for binary features that indicate the presence or absence of a feature(1 and 0, respectively). Each document, \(D\), consists of a set of words, \(w\). Let \(V\) be the vocabulary, i.e. the collection of unique words in the complete dataset. Using the Bernoulli distribution, \(P(D_i|C=k)\) becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] Where \(b_{i,t}=1\) if the document, \(D_i\), contains the word, \(w_t\), and \(0\) otherwise. Furthermore, \(|V|\) is the number of unique words in the dataset and \(P(w_{t}|C=k)\) is the posterior probability of word, \(w_t\) occurring in a document with class, \(k\). This is simply calculated as the proportion of documents of class, \(k\), in which word, \(t\), occurs compared the total number of documents of class, \(k\). In other words: \[P(w_{t}|C=k)=\frac{\sum_{i=1}^{N}{x_{i,t}*z_{i,k}}}{\sum_{i=1}^{N}{z_{i,k}}}\] Where \(x_{i,t}\) equals \(1\) if word, \(t\), occurs in document, \(i\), and \(0\) otherwise. Furthermore, \(z_{i,k}\) equals \(1\) if document, \(i\), is labeled as class, \(k\), and \(0\) otherwise.

Multinomial Distribution

The multinomial distribution is used to model features, which represent the frequency of which the events occurred, or in other words it uses word counts in the documents instead of the binary representation. This means that the distribution used to calculate \(P(D_i|C=k)\) changes. This now becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)^{x_{i,t}}}\] Where \(x_{i,t}\) is the frequency of word, \(t\), in document, \(i\). Here: \[P(w_t|C=k)=\frac{\sum_{i=1}^{N}{x_{i,t}*z_{i,k}}}{\sum_{s=1}^{|V|}{\sum_{i=1}^{N}{x_{i,s}z_{i,k}}}}\] Where \(x_{i,t}\) is the frequency of word, \(t\), in document, \(i\) and \(z_{i,k}\) equals \(1\) if document, \(i\), is labeled as class, \(k\), and \(0\) otherwise. Furthermore, \(|V|\) is the length of the vocabulary, i.e. the total number of unique words in the dataset.

Gaussian Distribution

A Gaussian distribution can also be used to model numerical features. Quite simply the conditional probabilities are now assumed to follow a normal distribution, where the mean and standard deviation are estimated from the training data. In this case, \(P(D_i|C=k)\) becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)}\] where \[P(w_t|C=k)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] where \(\mu\) and \(\sigma\) are estimated by their sample estimators from the training data.

Mixed Distributions

As was explained, all three event models are part of a general Naive Bayes framework and all three prescribe different ways to estimate \[P(D_i|C=k)\]. Furthermore, all three use the general Naive Bayes approach, which is to assume independence between the features and simply use the product of each individual probability, as follows: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)}\] A big benefit of this independence assumption is that different event models can be mixed simply by using the individual event models for different features.

Laplace Smoothing

Another important aspect of Naive Bayes classifiers is the so-called Laplace smoothing. Consider again the probability calculation: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] If at any point \(P(w_t|C=k)=0\), then \(P(D_i|C=k)\) will also equal \(0\), since it’s a product of the individual probabilities. The same holds for the Multinomial distribution. In order to overcome this, Laplace smoothing is used, which simply adds a small non-zero count to all the word counts, so as to not encounter zero probabilities. There is a very important distinction to be made. A commonly made mistake is to assume that this is also applied to any features in the test set that were not encountered in the training set. This however, is not correct. The Laplace smoothing is applied, such that words that do not occur at all together with a specific class do not yield zero probabilities. Features in the test set that were not encountered in the training set are simply ignored from the equation. This also makes sense, if a word was never encountered in the training set then \(P(w_t|C=k)\) should be the same for every class, \(k\).

Why is it so fast?

As previously explained, when classifying a new document, one needs to calculate \(P(C=k|D_i) = \frac{P(D_i|C=k)*P(C=k)}{P(D_i)}\) for each class, \(k\). However, since the class with the highest posterior probability is used as the classification and \(P(D_i)\) is constant for all classes, the denominator can be ignored. This means that for prediction, only \(P(D_i|C=k)*P(C=k)\) needs to be calculated. As has been shown above this probability in the Bernoulli case can be rewritten to: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] By taking the log transformation this becomes: \[log(\prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}) = \sum_{t=1}^{|V|}{log(b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k)))}\] Furthermore, by rearranging some terms this becomes: \[\sum_{t=1}^{|V|}{b_{i,t}*log(P(w_{t}|C=k))} + \sum_{t=1}^{|V|}{(1-b_{i,t})*log((1-P(w_{t}|C=k)))} \] If we zoom in on the first part and keep in mind that our matrix, \(x\), with observations is a matrix where each column represents a word, from \(1\) to \(|V|\), with a \(1\) if the word was observed and \(0\) otherwise. This means that the matrix of observations has \(b_{i,t}\) as the values. The probabilities, \(P(w_t|C=k)\), is a vector of length \(|V|\). We can now use matrix multiplication to derive the sum as follows: \(x * P(w_t|C=k)\) for the first part and \((1-x) * (1-P(w_t|C=k))\) for the second part. After these two parts have been added up, one can simply raise \(e\) to the power of the outcomes to transform it back to the original probabilities. This mathematical trick is what allows one to use matrix multiplication, which in turn is what makes this specific implementation so efficient.

Unique Features

In this section, a brief overview is given of the unique features of this package. This implementation improves upon existing implementations on two points:

Speed of execution: by using the matrix multiplication trick this package is magnitudes faster
Easily mix event models by using the mixed model.
The only R package with Bernoulli, Multinomial, and Gaussian event models implemented.

In order to demonstrate the power of this package a comparison of estimation and prediction execution times has been done using this package and been compared to different packages. The comparison was made on a dataset consisting of 14640 tweets, where all were used to train the Naive Bayes classifier and all tweets were used to test. After processing a total of 2214 features, i.e. words, were used. In the table below the comparison between execution times is shown. The reported figures are measured in seconds and is the amount of time to train and predict a single time on the tweets data.

	Bernoulli	Multinomial	Gaussian
fastNaiveBayes	0.263	0.193	0.005
fastNaiveBayes_sparse	0.015	0.012	0.043
bnlearn	5.976
e1071	522.618		8.397
klar	421.323		8.040
naivebayes	4.247		0.349
quanteda	8.075	0.200
Rfast		0.724	0.014

For a relative comparison, the figures are also given with the shortest execution time is standardized to 1 in the table below:

	Bernoulli	Multinomial	Gaussian
fastNaiveBayes	17.5	16.8	1.0
fastNaiveBayes_sparse	1.0	1.0	8.6
bnlearn	398.4
e1071	34841.2		1679.4
klaR	28088.2		1608
naivebayes	283.1		69.8
quanteda	538.3	16.7
Rfast		60.3	2.8

As can be seen from the results, this package is magnitudes faster for all event models. Using only a Bernoulli event model, the smallest speed-up was compared to the ‘naivebayes’ package, where this package was found to be 283 times faster. The largerst speed-up was compared to the ‘klaR’ and ‘e1071’ packages, where this package is around 28,088 and 34,841 times faster, respectively. It seems unbelievable, but it should be noted that the data set of tweets resulted in a very sparse matrix. This is why the sparse matrix combined with the matrix multiplication results in such a large increase in speed.

For the Multinomial event model, there’s only two alternative implementations, from the ‘quanteda’ and ‘Rfast’ package. This implementation was found to be 17 times and 60 times faster, respectively.

Lastly, comparing the Gaussian event model the smallest speed-up was compared to the ‘Rfast’ package of 2.8 times. Compared to the ‘naivebayes’ package a speed-up of 70 times was achieved and finally compared to the ‘e1071’ and ‘klaR’ packages this package was found to be 1680 and 1608 times faster, respectively. Using a sparse matrix did not result in a faster execution time. This makes sense since the data used to test the Gaussian distribution is not sparse at all.

It should be noted, that these results can vary a lot between data sets and is dependent on both hardware and software. The tweets data is very sparse when converted to a document-term matrix and hence this is probably a best case scenario. In order to make it easier to compare execution times, the tweets data that was used to establish these results are included in the package as ‘tweets’, the raw data, and ‘tweetsDTM’ a clean document-term matrix of the previously mentioned ‘tweets’ data. The code used to convert the raw ‘tweets’ data can be found on github in the ‘data-raw’ folder. Moreover, the code to establish the results can be found below.

Code to compare Execution Times

rm(list=ls())
###################### LIBRARIES ###########################
library(tm) #used for text mining 
library(e1071) #this package includes the naive Bayes algorithm
library(Matrix)
library(microbenchmark)
library(e1071)
library(fastNaiveBayes)
library(quanteda)
library(naivebayes)
library(bnlearn)
library(klaR)
library(data.table)

############################ Timing Script ################
results <- NULL

# Bernoulli Event Model
tweets <- fastNaiveBayes::tweetsDTM


y_var <- tweets$airline_sentiment
y_var <- as.factor(ifelse(y_var=='negative','negative','non-negative'))
tweets <- tweets[,2:ncol(tweets)]
tweets[tweets>1] <- 1

tweets <- tweets[,which(colSums(tweets)!=0)]
tweets <- tweets[,which(colSums(tweets)!=nrow(tweets))]

tweet_mat <- as.matrix(tweets)
sparse_tweets <- Matrix(as.matrix(tweet_mat), sparse = TRUE)

for(i in 1:ncol(tweets)){
  tweets[[i]] <- as.factor(tweets[[i]])
}

# BNLearn
bn_tweets <- cbind(y_var, tweets)
colnames(bn_tweets)[1] <- 'y_var'

# Quanteda
dfm <- as.dfm(tweet_mat)

res <- microbenchmark(
  klar = predict(klaR::NaiveBayes(x=tweets, grouping = y_var, fL=1), tweets),
  e1071 = predict(e1071::naiveBayes(tweets, y_var, laplace = 1), tweets),
  fastNaiveBayes = predict(fastNaiveBayes.bernoulli(tweet_mat, y_var, laplace = 1), tweet_mat),
  fastNaiveBayes_sparse = predict(fastNaiveBayes.bernoulli(sparse_tweets, y_var, laplace = 1), sparse_tweets),
  bnlearn = predict(bnlearn::naive.bayes(bn_tweets, 'y_var'), bn_tweets),
  quanteda = predict(quanteda::textmodel_nb(dfm, y_var, prior = "docfreq", distribution = "Bernoulli"),
                     newdata = dfm),
  naivebayes = predict(naivebayes::naive_bayes(tweets, y_var, laplace = 1), newdata = tweets),
  times = 3,
  unit = "ms"
)

res <- as.data.table(res)
res[,nrows:=nrow(tweet_mat)]
res[,ncols:=ncol(tweet_mat)]
res[,model:='Bernoulli']

results <- res

# Multinomial Event Model
tweets <- fastNaiveBayes::tweetsDTM

y_var <- tweets$airline_sentiment
y_var <- as.factor(ifelse(y_var=='negative','negative','non-negative'))
tweets <- tweets[,2:ncol(tweets)]

tweets <- tweets[,which(colSums(tweets)!=0)]

tweet_mat <- as.matrix(tweets)
sparse_tweets <- Matrix(as.matrix(tweet_mat), sparse = TRUE)

# Quanteda
dfm <- as.dfm(tweet_mat)

res <- microbenchmark(
  fastNaiveBayes = predict(fastNaiveBayes.multinomial(tweet_mat, y_var, laplace = 1), tweet_mat),
  fastNaiveBayes_sparse = predict(fastNaiveBayes.multinomial(sparse_tweets, y_var, laplace = 1), sparse_tweets),
  quanteda = predict(quanteda::textmodel_nb(dfm, y_var, prior = "docfreq", distribution = "multinomial"),
                     newdata = dfm),
  Rfast = Rfast::multinom.nb(tweet_mat, tweet_mat, y_var),
  times = 3,
  unit = "ms"
)

res <- as.data.table(res)
res[,nrows:=nrow(tweet_mat)]
res[,ncols:=ncol(tweet_mat)]
res[,model:='Multinomial']

results <- rbind(results, res)

# Gaussian Event Model
cars <- mtcars
for(i in 1:6){
  cars <- rbind(cars, cars)
}

y_var <- cars$mpg
y_var <- as.factor(ifelse(y_var>20,'negative','non-negative'))

cars <- cars[,3:7]
for(i in 1:6){
  cars <- cbind(cars, cars)
}

cars_mat <- as.matrix(cars)
sparse_cars <- Matrix(as.matrix(cars_mat), sparse = TRUE)

res <- microbenchmark(
  klar = predict(klaR::NaiveBayes(x=cars_mat, grouping = y_var, fL=1), cars_mat),
  e1071 = predict(e1071::naiveBayes(cars_mat, y_var, laplace = 1), cars_mat),
  naivebayes = predict(naivebayes::naive_bayes(cars_mat, y_var, laplace = 1), newdata = cars_mat),
  fastNaiveBayes = predict(fastNaiveBayes.gaussian(cars_mat, y_var), cars_mat),
  fastNaiveBayes_sparse = predict(fastNaiveBayes.gaussian(sparse_cars, y_var), sparse_cars),
  Rfast = Rfast::gaussian.nb(cars_mat, cars_mat, y_var),
  times = 3,
  unit = "ms"
)

res <- as.data.table(res)
res[,nrows:=nrow(cars_mat)]
res[,ncols:=ncol(cars_mat)]
res[,model:='Gaussian']

results <- rbind(results, res)

print(results)
fwrite(results, file = "./package_timings.csv", row.names = FALSE)

References

Hu, X., & Liu, H. (2012). Text analytics in social media. In Mining text data (pp. 385-414). Springer, Boston, MA.

McCallum, A., & Nigam, K. (1998, July). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).

Schneider, K. M. (2003, April). A comparison of event models for Naive Bayes anti-spam e-mail filtering. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1 (pp. 307-314). Association for Computational Linguistics.