Feature hashing, also called as the hashing trick, is a method to transform features to vector. Without looking the indices up in an associative array, it applies a hash function to the features and uses their hash values as indices directly.

The package **FeatureHashing** implements the method in
Weinberger et al. (2009) to transform a
`data.frame`

to sparse matrix. The package provides a formula
interface similar to `model.matrix`

in `R`

and
`Matrix::sparse.model.matrix`

in the package
`Matrix`

. Splitting of concatenated data, check the help of
`test.tag`

for explanation of concatenated data, during the
construction of the model matrix.

To install the stable version from Cran, run this command:

For up-to-date version, please install from github. Windows user will need to install RTools first.

Feature hashing is useful when the user does not easy to know the dimension of the feature vector. For example, the bag-of-word representation in document classification problem requires scanning entire dataset to know how many words we have, i.e. the dimension of the feature vector.

In general, feature hashing is useful in the following environment:

- Streaming Environment
- Distirbuted Environment

Because it is expensive or impossible to know the real dimension of the feature vector.

The following scripts show how to use the
**FeatureHashing** to construct
`Matrix::dgCMatrix`

and train a model in other packages which
supports `Matrix::dgCMatrix`

as input.

The dataset is a sample from iPinYou dataset which is described in Zhang et al. (2014).

`glmnet`

```
# The following script assumes that the data.frame
# of the training dataset and testing dataset are
# assigned to variable `ipinyou.train` and `ipinyou.test`
# respectively
library(FeatureHashing)
# Checking version.
stopifnot(packageVersion("FeatureHashing") >= package_version("0.9"))
data(ipinyou)
f <- ~ IP + Region + City + AdExchange + Domain +
URL + AdSlotId + AdSlotWidth + AdSlotHeight +
AdSlotVisibility + AdSlotFormat + CreativeID +
Adid + split(UserTag, delim = ",")
# if the version of FeatureHashing is 0.8, please use the following command:
# m.train <- as(hashed.model.matrix(f, ipinyou.train, 2^16, transpose = FALSE), "dgCMatrix")
m.train <- hashed.model.matrix(f, ipinyou.train, 2^16)
m.test <- hashed.model.matrix(f, ipinyou.test, 2^16)
# logistic regression with glmnet
library(glmnet)
```

`## Loading required package: Matrix`

`## Loaded glmnet 4.1-8`

```
cv.g.lr <- cv.glmnet(m.train, ipinyou.train$IsClick,
family = "binomial")#, type.measure = "auc")
p.lr <- predict(cv.g.lr, m.test, s="lambda.min")
library(pROC)
```

`## Type 'citation("pROC")' for a citation.`

```
##
## Attaching package: 'pROC'
```

```
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
```

`## Setting levels: control = FALSE, case = TRUE`

`## Setting direction: controls < cases`

`## Area under the curve: 0.5187`

`xgboost`

Following the script above,

```
# GBDT with xgboost
if(require("xgboost")){
cv.g.gdbt <- xgboost(m.train, ipinyou.train$IsClick, max.depth=7, eta=0.1, subsample = 0.7, colsample_bytree = 0.7,
nround = 10, objective = "binary:logistic", verbose = ifelse(interactive(), 1, 0))
p.lm <- predict(cv.g.gdbt, m.test)
auc(ipinyou.test$IsClick, p.lm)
}
```

`## Loading required package: xgboost`

`## Setting levels: control = FALSE, case = TRUE`

`## Setting direction: controls < cases`

`## Area under the curve: 0.6412`

The following scripts use an implementation of the FTRL-Proximal for Logistic Regresion, which is published in McMahan et al. (2013), to predict the probability (1-step prediction) and update the model simultaneously.

```
source(system.file("ftprl.R", package = "FeatureHashing"))
m.train <- hashed.model.matrix(f, ipinyou.train, 2^16, transpose = TRUE)
ftprl <- initialize.ftprl(0.1, 1, 0.1, 0.1, 2^16)
ftprl <- update.ftprl(ftprl, m.train, ipinyou.train$IsClick, predict = TRUE)
auc(ipinyou.train$IsClick, attr(ftprl, "predict"))
```

`## Setting levels: control = FALSE, case = TRUE`

`## Setting direction: controls < cases`

`## Area under the curve: 0.5993`

If we use the same algorithm to predict the click through rate of the 3rd season of iPinYou, the overall AUC will be 0.77 which is comparable to the overall AUC of the 3rd season 0.76 reported in Zhang et al. (2014).

- character and factor
- numeric and integer
- array, i.e. concatenated strings such as
`c("a,b", "a,b,c", "a,c", "")`

McMahan, H. Brendan, Gary Holt, David Sculley, Michael Young, Dietmar
Ebner, Julian Grady, Lan Nie, et al. 2013. “Ad Click Prediction: A
View from the Trenches.” In *The 19th ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD 2013, Chicago, IL, USA, August 11-14,
2013*, edited by Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted
E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman,
and Ramasamy Uthurusamy, 1222–30. ACM. https://doi.org/10.1145/2487575.2488200.

Weinberger, Kilian Q., Anirban Dasgupta, John Langford, Alexander J.
Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale
Multitask Learning.” In *Proceedings of the 26th Annual
International Conference on Machine Learning, ICML 2009,
Montreal, Quebec, Canada, June 14-18, 2009*, edited by Andrea
Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, 1113–20. https://doi.org/10.1145/1553374.1553516.

Zhang, Weinan, Shuai Yuan, Jun Wang, and Xuehua Shen. 2014.
“Real-Time Bidding Benchmarking with iPinYou Dataset.”
*arXiv Preprint arXiv:1407.7073*.