FeatureFinder is designed to give comprehensive and accurate sets of features which can be used in modelling, either to build a new model or to enhance and diagnose an existing model. Both methods are available through a single function, FindFeatures. The following give examples for each method.

Identifying Features for a model target

If you have not yet built a model, findFeatures can be used to identify promising features. A typical modelling scenario involves a table consisting of a set of predictors \(\{x_i\}\) and a model target \(y\).

Identifying Features for a model residual

If you already have a model, and want to find features to further improve it, findFeatures can be (repeatedly) run using residuals. A typical modelling scenario involves a table consisting of a set of predictors \(\{x_i\}\) and a model residual \(r = y - p\), where \(p\) is a prediction from a previously-fit model and \(y\) is the model target.

Scanning over partitions of the data

The function generates a decision tree for the entire table, as well as decision trees for every possible subset of the table. Subsets are defined using factor-valued columns in the data. Fators can either be user-defined or already included as predictors. The more factors that are created in the data, the more partitions that will be tested and so it is helpful to create a comprehensive set of factors for this purpose. One easy way to create factors is to bin each predictor into 10 bands and create a factor for each. Factor labels should be prefixed with a string character so that they are interpreted as factors and not numerics, for example “s1”, “s2” rather than “1”,“2” labels.

The feature-finding process

For the case where no model has been fitted yet, we simply define \(r=y\), and for the case when a model has been fitted already we use the residual \(r = y - p\). We supply a single table consisting of all predictors together with \(residual=r, actual=y,expected=p\) and call the findFeatures function.

Each decision tree will consist of an rpart tree as shown in the following example. Leaves are labelled with a residual and a leaf volume \(n\). Nodes are also labelled with the cut-rule for the node, and these are used to identify the leaves. The code scans each leaf for cases with sufficiently high volume and a residual value which exceeds a user-specified threshold. These parameters are outlined in help(findFeatures). When a leaf meeting the criteria is found, it is printed in the txt file for the partition being scanned. These text files can then be manually or automatically parsed and included in models as required. Often a leaf will be a clue rather than the final form of a feature, and so manual inspection can be of assistance.

The summary of residual nodes according to user-specified criteria for residual value and leaf volume will be generated in txt files (for example treesAll.txt and allfactors\.[partitionvariable]factor.txt). These contain a summary of each significant term with its definition, volume and other parameters as shown:

In the examples, partitioning enables significant leaves to be found for each partition, although the full dataset does not yield leaves in the fitted tree. This illustrates the benefits of the partitioning technique.

Standard dataset

Once features are found, they can be customised and added to the model as shown:

library(featurefinder)
data(mpgdata)
data=mpgdata

# define some categorical factors here, for use in partition scanning. Define as many as desired.
data$transfactor=paste("trans",as.matrix(data$trans),sep="")
data$transfactor=as.factor(data$transfactor)

# define data dimensions
n=dim(data)[1] # total dimension
nn=floor(dim(data)[1]/2) # split point for training and test
data=as.data.frame(data)
nm=names(data)
nm[8]='y' ## select a column to be the target of the model
names(data)=nm

data0=data # retain full dataset
data=data[c("manufacturer","displ","year","transfactor","y") ] # select a subset for our first model

firstmodel=lm(formula=y ~ .,data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)
## 
## Call:
## lm(formula = y ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9945 -0.9118 -0.0586  0.8697 13.0827 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -132.89410   89.52660  -1.484 0.139214    
## manufacturerchevrolet         3.71471    0.99010   3.752 0.000228 ***
## manufacturerdodge             0.32889    0.82052   0.401 0.688960    
## manufacturerford              1.53629    0.87105   1.764 0.079247 .  
## manufacturerhonda             5.02595    0.94628   5.311 2.79e-07 ***
## manufacturerhyundai           0.80977    0.86008   0.942 0.347537    
## manufacturerjeep              1.45035    1.09554   1.324 0.187002    
## manufacturerland rover       -1.63853    1.31376  -1.247 0.213725    
## manufacturerlincoln           1.09111    1.59582   0.684 0.494907    
## manufacturermercury           0.76484    1.34193   0.570 0.569323    
## manufacturernissan            1.83283    0.86219   2.126 0.034700 *  
## manufacturerpontiac           3.04917    1.26452   2.411 0.016763 *  
## manufacturersubaru            1.23808    0.89394   1.385 0.167547    
## manufacturertoyota            2.01600    0.71210   2.831 0.005095 ** 
## manufacturervolkswagen        2.69745    0.69932   3.857 0.000153 ***
## displ                        -2.39932    0.20642 -11.623  < 2e-16 ***
## year                          0.07908    0.04459   1.773 0.077627 .  
## transfactortransauto(l3)     -0.32843    2.02987  -0.162 0.871620    
## transfactortransauto(l4)     -2.32296    1.20724  -1.924 0.055696 .  
## transfactortransauto(l5)     -2.82813    1.17031  -2.417 0.016531 *  
## transfactortransauto(l6)     -2.05385    1.47940  -1.388 0.166530    
## transfactortransauto(s4)     -0.84376    1.79665  -0.470 0.639112    
## transfactortransauto(s5)     -1.49049    1.67264  -0.891 0.373906    
## transfactortransauto(s6)     -1.86670    1.23773  -1.508 0.133029    
## transfactortransmanual(m5)   -1.41601    1.19345  -1.186 0.236786    
## transfactortransmanual(m6)   -1.91586    1.18808  -1.613 0.108353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.216 on 208 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7289 
## F-statistic: 26.06 on 25 and 208 DF,  p-value: < 2.2e-16
# drop terms that are not significant and refit model
data$manufacturerchevrolet=(data$manufacturer=='chevrolet')
data$manufacturerford=(data$manufacturer=='ford')
data$manufacturerhonda=(data$manufacturer=='honda')
data$manufacturernissan=(data$manufacturer=='nissan')
data$manufacturerpontiac=(data$manufacturer=='pontiac')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$manufacturervolkswagen=(data$manufacturer=='volkswagen')
#data$displ
data$transfactortransautol4=(data$transfactor=='transauto(l4)')
data$transfactortransautol5=(data$transfactor=='transauto(l5)')
firstmodel=lm(formula=y ~ manufacturerchevrolet+
                manufacturerford+
                manufacturerhonda+
                manufacturernissan+
                manufacturerpontiac+
                manufacturertoyota+
                manufacturervolkswagen+
                displ+
                year
                #transfactortransautol4
                #transfactortransautol5
               , data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)
## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4844 -1.0884 -0.0168  1.0811 13.4786 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -153.23187   66.81224  -2.293 0.022748 *  
## manufacturerchevroletTRUE     3.37717    0.60323   5.598 6.29e-08 ***
## manufacturerfordTRUE          1.19666    0.52566   2.276 0.023762 *  
## manufacturerhondaTRUE         4.28807    0.82778   5.180 4.94e-07 ***
## manufacturernissanTRUE        1.88155    0.66320   2.837 0.004971 ** 
## manufacturerpontiacTRUE       2.70460    1.03239   2.620 0.009401 ** 
## manufacturertoyotaTRUE        1.61596    0.45521   3.550 0.000469 ***
## manufacturervolkswagenTRUE    2.20946    0.52363   4.220 3.56e-05 ***
## displ                        -2.59037    0.14857 -17.435  < 2e-16 ***
## year                          0.08878    0.03337   2.660 0.008375 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.247 on 224 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7212 
## F-statistic: 67.98 on 9 and 224 DF,  p-value: < 2.2e-16
CSVPath=tempdir()
data1=cbind(data0,expected, actual, residual)
fcsv=paste(CSVPath,"/mpgdata.csv",sep="")
write.csv(data1[(nn+1):(length(data1$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=20,
         treeSummaryMinBucket=30,
         useSubDir=FALSE,
         tempDirFolderName="mpg")  
## [1] 20
## [1] 30
## [1] 0
## [1] 0
## [1] TRUE
## [1] 20
## [1] "Factor model converted to numeric as it has 21 levels."
## [1] "Doing level 1: ALL"
## [1] 20
## 
##  node number: 2 
##    root
##    model>=16.5
## 
##  node number: 12 
##    root
##    model< 16.5
##    hwy< 28.5
##    manufacturer=toyota,volkswagen
## 
##  node number: 13 
##    root
##    model< 16.5
##    hwy< 28.5
##    manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru
## 
##  node number: 7 
##    root
##    model< 16.5
##    hwy>=28.5
## [1] 3
## [1] "RESIDUAL:: ALL:ALL :: 0.575(34.2%: 40 of 117 in tree, E=16, A=16.6, residual=0.575) :: model< 16.5 and hwy< 28.5 and manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru"
## [1] "Variable 1: manufacturer has levels:"
##  [1] "hyundai"    "jeep"       "land rover" "lincoln"    "mercury"   
##  [6] "nissan"     "pontiac"    "subaru"     "toyota"     "volkswagen"
## [1] "Doing level 1: hyundai"
## [1] 20
## [1] "Doing level 2: jeep"
## [1] 20
## [1] "Doing level 3: land rover"
## [1] 20
## [1] "Doing level 4: lincoln"
## [1] 20
## [1] "Doing level 5: mercury"
## [1] 20
## [1] "Doing level 6: nissan"
## [1] 20
## [1] "Doing level 7: pontiac"
## [1] 20
## [1] "Doing level 8: subaru"
## [1] 20
## [1] "Doing level 9: toyota"
## [1] 20
## [1] "Doing level 10: volkswagen"
## [1] 20
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## [1] 1
## [1] "RESIDUAL:: manufacturer:toyota :: 7.14e-13(100%: 34 of 34 in tree, E=0, A=0, residual=0) :: NA and root"
## 
##  node number: 1 
##    root
## [1] "Variable 2: trans has levels:"
##  [1] "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"   "auto(l6)"  
##  [6] "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"
## [1] "Doing level 1: auto(av)"
## [1] 20
## [1] "Doing level 2: auto(l3)"
## [1] 20
## [1] "Doing level 3: auto(l4)"
## [1] 20
## [1] "Doing level 4: auto(l5)"
## [1] 20
## [1] "Doing level 5: auto(l6)"
## [1] 20
## [1] "Doing level 6: auto(s4)"
## [1] 20
## [1] "Doing level 7: auto(s5)"
## [1] 20
## [1] "Doing level 8: auto(s6)"
## [1] 20
## [1] "Doing level 9: manual(m5)"
## [1] 20
## [1] "Doing level 10: manual(m6)"
## [1] 20
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## [1] 1
## [1] "RESIDUAL:: trans:manual(m5) :: 0.566(100%: 36 of 36 in tree, E=0, A=0, residual=0) :: NA and root"
## 
##  node number: 1 
##    root
## [1] "Variable 3: drv has levels:"
## [1] "4" "f" "r"
## [1] "Doing level 1: 4"
## [1] 20
## [1] "Doing level 2: f"
## [1] 20
## [1] "Doing level 3: r"
## [1] 20
## 
##  node number: 1 
##    root
## 
##  node number: 2 
##    root
##    hwy< 28.5
## 
##  node number: 3 
##    root
##    hwy>=28.5
## 
##  node number: 1 
##    root
## [1] "Variable 4: fl has levels:"
## [1] "d" "e" "p" "r"
## [1] "Doing level 1: d"
## [1] 20
## [1] "Doing level 2: e"
## [1] 20
## [1] "Doing level 3: p"
## [1] 20
## [1] "Doing level 4: r"
## [1] 20
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 2 
##    root
##    hwy< 26.5
## 
##  node number: 3 
##    root
##    hwy>=26.5
## [1] 2
## [1] "RESIDUAL:: fl:r :: 1.11(38.4%: 33 of 86 in tree, E=20.1, A=21.2, residual=1.11) :: hwy>=26.5"
## [1] "Variable 5: class has levels:"
## [1] "compact"    "midsize"    "pickup"     "subcompact" "suv"       
## [1] "Doing level 1: compact"
## [1] 20
## [1] "Doing level 2: midsize"
## [1] 20
## [1] "Doing level 3: pickup"
## [1] 20
## [1] "Doing level 4: subcompact"
## [1] 20
## [1] "Doing level 5: suv"
## [1] 20
## 
##  node number: 1 
##    root
## [1] 1
## [1] "RESIDUAL:: class:compact :: 0.816(100%: 32 of 32 in tree, E=0, A=0, residual=0) :: NA and root"
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## [1] "Variable 6: transfactor has levels:"
##  [1] "transauto(av)"   "transauto(l3)"   "transauto(l4)"  
##  [4] "transauto(l5)"   "transauto(l6)"   "transauto(s4)"  
##  [7] "transauto(s5)"   "transauto(s6)"   "transmanual(m5)"
## [10] "transmanual(m6)"
## [1] "Doing level 1: transauto(av)"
## [1] 20
## [1] "Doing level 2: transauto(l3)"
## [1] 20
## [1] "Doing level 3: transauto(l4)"
## [1] 20
## [1] "Doing level 4: transauto(l5)"
## [1] 20
## [1] "Doing level 5: transauto(l6)"
## [1] 20
## [1] "Doing level 6: transauto(s4)"
## [1] 20
## [1] "Doing level 7: transauto(s5)"
## [1] 20
## [1] "Doing level 8: transauto(s6)"
## [1] 20
## [1] "Doing level 9: transmanual(m5)"
## [1] 20
## [1] "Doing level 10: transmanual(m6)"
## [1] 20
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## 
##  node number: 1 
##    root
## [1] 1
## [1] "RESIDUAL:: transfactor:transmanual(m5) :: 0.566(100%: 36 of 36 in tree, E=0, A=0, residual=0) :: NA and root"
## 
##  node number: 1 
##    root
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
# potential terms identified in residual scan
# RESIDUAL: ALL,ALL,0.575,34.2,40,117,16,16.6,0.575,model< 16.5 and hwy< 28.5 and 
#                                                   manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru
# RESIDUAL: fl,r,1.11,38.4,33,86,20.1,21.2,1.11,hwy>=26.5
# RESIDUAL: class,compact,0.816,100,32,32,0,0,0,NA and root
# RESIDUAL: manufacturer,toyota,7.14e-13,100,34,34,0,0,0,NA and root
# RESIDUAL: trans,manual(m5),0.566,100,36,36,0,0,0,NA and root
# RESIDUAL: transfactor,transmanual(m5),0.566,100,36,36,0,0,0,NA and root

# add terms to dataset and refit
data$hwy=data0$hwy
data$fl=data0$fl
data$model=as.numeric(as.factor(data0$model))
data$model16hwy28manufacturer=(data$model< 16.5) & (data$hwy< 28.5)&(data$manufacturer=="jeep"|data$manufacturer=="lincoln"|data$manufacturer=="mercury"|data$manufacturer=="nissan"|data$manufacturer=="pontiac"|data$manufacturer=="subaru")
data$flr_hwy26=(data$fl=="r") & (data$hwy>=26.5)
data$transfactortransmanualm5=(data$transfactor=='transmanual(m5)')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$classcompact=(data0$class=='compact')
data$flr=(data$fl=='r')
secondmodel=lm(formula=y ~ manufacturerchevrolet+
                            manufacturerford+
                            manufacturerhonda+
                            manufacturernissan+
                            manufacturerpontiac+
                            manufacturertoyota+
                            manufacturervolkswagen+
                            displ+
                            year+
                            # new terms
                            #model16hwy28manufacturer+
                            flr_hwy26+
                            transfactortransmanualm5+
                            manufacturertoyota
                            #classcompact+
                            #flr
               , data=data)
expected=predict(secondmodel,data)

summary(firstmodel)
## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4844 -1.0884 -0.0168  1.0811 13.4786 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -153.23187   66.81224  -2.293 0.022748 *  
## manufacturerchevroletTRUE     3.37717    0.60323   5.598 6.29e-08 ***
## manufacturerfordTRUE          1.19666    0.52566   2.276 0.023762 *  
## manufacturerhondaTRUE         4.28807    0.82778   5.180 4.94e-07 ***
## manufacturernissanTRUE        1.88155    0.66320   2.837 0.004971 ** 
## manufacturerpontiacTRUE       2.70460    1.03239   2.620 0.009401 ** 
## manufacturertoyotaTRUE        1.61596    0.45521   3.550 0.000469 ***
## manufacturervolkswagenTRUE    2.20946    0.52363   4.220 3.56e-05 ***
## displ                        -2.59037    0.14857 -17.435  < 2e-16 ***
## year                          0.08878    0.03337   2.660 0.008375 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.247 on 224 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7212 
## F-statistic: 67.98 on 9 and 224 DF,  p-value: < 2.2e-16
summary(secondmodel)
## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year + flr_hwy26 + transfactortransmanualm5 + 
##     manufacturertoyota, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0734 -1.2143 -0.0734  0.8966 13.5541 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -136.72051   67.18819  -2.035 0.043051 *  
## manufacturerchevroletTRUE       2.97267    0.59233   5.019 1.07e-06 ***
## manufacturerfordTRUE            0.99000    0.50859   1.947 0.052852 .  
## manufacturerhondaTRUE           3.75090    0.80683   4.649 5.72e-06 ***
## manufacturernissanTRUE          1.60842    0.64179   2.506 0.012923 *  
## manufacturerpontiacTRUE         2.22541    1.01005   2.203 0.028604 *  
## manufacturertoyotaTRUE          1.10726    0.45313   2.444 0.015322 *  
## manufacturervolkswagenTRUE      2.18570    0.50406   4.336 2.20e-05 ***
## displ                          -2.28853    0.15787 -14.497  < 2e-16 ***
## year                            0.07985    0.03356   2.379 0.018196 *  
## flr_hwy26TRUE                   1.60708    0.43107   3.728 0.000245 ***
## transfactortransmanualm5TRUE    0.70155    0.36450   1.925 0.055546 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.161 on 222 degrees of freedom
## Multiple R-squared:  0.7544, Adjusted R-squared:  0.7423 
## F-statistic:    62 on 11 and 222 DF,  p-value: < 2.2e-16
# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")
## These 6 new feature(s) were added to your data ('df'):
## [1] "df$auto_feat1 <- as.numeric(df$model< 16.5 & df$hwy< 28.5 & df$manufacturer %in% c('jeep','lincoln','mercury','nissan','pontiac','subaru'))"
## df$auto_feat1 <- as.numeric(df$model< 16.5 & df$hwy< 28.5 & df$manufacturer %in% c('jeep','lincoln','mercury','nissan','pontiac','subaru')) 
## [1] "df$auto_feat2 <- as.numeric(df$class == 'compact')"
## df$auto_feat2 <- as.numeric(df$class == 'compact') 
## [1] "df$auto_feat3 <- as.numeric(df$fl == 'r' & df$hwy>=26.5)"
## df$auto_feat3 <- as.numeric(df$fl == 'r' & df$hwy>=26.5) 
## [1] "df$auto_feat4 <- as.numeric(df$manufacturer == 'toyota')"
## df$auto_feat4 <- as.numeric(df$manufacturer == 'toyota') 
## [1] "df$auto_feat5 <- as.numeric(df$transfactor == 'transmanual(m5)')"
## df$auto_feat5 <- as.numeric(df$transfactor == 'transmanual(m5)') 
## [1] "df$auto_feat6 <- as.numeric(df$trans == 'manual(m5)')"
## df$auto_feat6 <- as.numeric(df$trans == 'manual(m5)')
head(dataWithNewFeatures)
##   manufacturer model displ year cyl      trans drv  y hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f 18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f 21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f 20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f 21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f 16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f 18  26  p compact
##       transfactor auto_feat1 auto_feat2 auto_feat3 auto_feat4 auto_feat5
## 1   transauto(l5)          0          1          0          0          0
## 2 transmanual(m5)          0          1          0          0          1
## 3 transmanual(m6)          0          1          0          0          0
## 4   transauto(av)          0          1          0          0          0
## 5   transauto(l5)          0          1          0          0          0
## 6 transmanual(m5)          0          1          0          0          1
##   auto_feat6
## 1          0
## 2          1
## 3          0
## 4          0
## 5          0
## 6          1
# https://vincentarelbundock.github.io/Rdatasets/datasets.html
# http://www.public.iastate.edu/~hofmann/data_in_r_sortable.html

The adjusted R-squared has improved from 0.721 to 0.745 as a result of the newly found features.

A more challenging dataset

A more challenging dataset is stock index data, where the target is to predict future relative movements of two indices, DAX and SMI. FindFeatures is able to identify features which can be added to the model as shown:

library(featurefinder)
data=futuresdata
data$SMIfactor=paste("smi",as.matrix(data$SMIfactor),sep="")
n=length(data$DAX)
nn=floor(length(data$DAX)/2)

# Can we predict the relative movement of DAX and SMI?
data$y=data$DAX*0 # initialise the target to 0
data$y[1:(n-1)]=((data$DAX[2:n])-(data$DAX[1:(n-1)]))/
  (data$DAX[1:(n-1)])-(data$SMI[2:n]-(data$SMI[1:(n-1)]))/(data$SMI[1:(n-1)])

# Fit a simple model
firstmodel=lm(formula=y ~ DAX+SMI+
                            #CAC+
                            FTSE,
                            #SMIfactorsmi1
                            data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
data0=data
data=cbind(data,expected, actual, residual)

CSVPath=tempdir()
fcsv=paste(CSVPath,"/futuresdata.csv",sep="")
write.csv(data[(nn+1):(length(data$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=30,
         treeSummaryMinBucket=50,
         useSubDir=FALSE,
         tempDirFolderName="futures")  
## [1] 30
## [1] 50
## [1] 0
## [1] 0
## [1] TRUE
## [1] 20
## [1] "Doing level 1: ALL"
## [1] 30
## 
##  node number: 1 
##    root
## [1] "Variable 1: SMIfactor has levels:"
## [1] "smi0" "smi1"
## [1] "Doing level 1: smi0"
## [1] 30
## [1] "Doing level 2: smi1"
## [1] 30
## 
##  node number: 8 
##    root
##    CAC< 2253
##    CAC>=1998
##    CAC< 2027
## 
##  node number: 9 
##    root
##    CAC< 2253
##    CAC>=1998
##    CAC>=2027
## 
##  node number: 10 
##    root
##    CAC< 2253
##    CAC< 1998
##    CAC< 1882
## 
##  node number: 11 
##    root
##    CAC< 2253
##    CAC< 1998
##    CAC>=1882
## 
##  node number: 3 
##    root
##    CAC>=2253
## [1] 4
## [1] "RESIDUAL:: SMIfactor:smi0 :: 0.00124(34.2%: 177 of 518 in tree, E=-0.000469, A=0.000773, residual=0.00124) :: CAC< 2253 and CAC< 1998 and CAC>=1882"
## 
##  node number: 8 
##    root
##    FTSE< 5915
##    FTSE>=4904
##    CAC< 2941
## 
##  node number: 18 
##    root
##    FTSE< 5915
##    FTSE>=4904
##    CAC>=2941
##    CAC>=3002
## 
##  node number: 19 
##    root
##    FTSE< 5915
##    FTSE>=4904
##    CAC>=2941
##    CAC< 3002
## 
##  node number: 10 
##    root
##    FTSE< 5915
##    FTSE< 4904
##    DAX< 3894
## 
##  node number: 11 
##    root
##    FTSE< 5915
##    FTSE< 4904
##    DAX>=3894
## 
##  node number: 3 
##    root
##    FTSE>=5915
## [1] 6
## [1] "RESIDUAL:: SMIfactor:smi1 :: 0.00167(14.6%: 60 of 412 in tree, E=-0.000158, A=0.00151, residual=0.00167) :: FTSE>=5915"
## [1] 1
## [1] 2
newfeat1=((data$SMIfactor=="smi0") & (data$CAC < 2253) & (data$CAC< 1998) & (data$CAC>=1882)) * 1.0
newfeat2=((data$SMIfactor=="smi1") & (data$SMI < 7837) & (data$SMI >= 7499)) * 1.0
newfeatures=cbind(newfeat1, newfeat2) # create columns for the newly found features
datanew=cbind(data0,newfeatures)

secondmodel=lm(formula=y ~ DAX+SMI+
                           #CAC+
                           FTSE+
                           #SMIfactorsmi1+
                           newfeat1+newfeat2,
                data=datanew[,])
expectednew=predict(secondmodel,datanew)

require(Metrics)
OriginalRMSE = rmse(data$y,expected)
NewRMSE = rmse(data$y,expectednew)

print(paste("OriginalRMSE = ",OriginalRMSE))
## [1] "OriginalRMSE =  0.00756638670538071"
print(paste("NewRMSE = ",NewRMSE))
## [1] "NewRMSE =  0.00752004575897434"
summary(firstmodel)
## 
## Call:
## lm(formula = y ~ DAX + SMI + FTSE, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.047277 -0.004239  0.000183  0.004355  0.031747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.099e-03  2.372e-03   2.994 0.002794 ** 
## DAX         -4.765e-06  1.289e-06  -3.697 0.000224 ***
## SMI          5.354e-06  1.318e-06   4.064 5.03e-05 ***
## FTSE        -3.722e-06  1.345e-06  -2.768 0.005690 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007575 on 1856 degrees of freedom
## Multiple R-squared:  0.009349,   Adjusted R-squared:  0.007747 
## F-statistic: 5.838 on 3 and 1856 DF,  p-value: 0.0005745
summary(secondmodel)
## 
## Call:
## lm(formula = y ~ DAX + SMI + FTSE + newfeat1 + newfeat2, data = datanew[, 
##     ])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046980 -0.004204  0.000177  0.004306  0.032340 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.962e-03  2.359e-03   2.951  0.00320 ** 
## DAX         -5.724e-06  1.302e-06  -4.395 1.17e-05 ***
## SMI          5.711e-06  1.314e-06   4.346 1.46e-05 ***
## FTSE        -3.458e-06  1.347e-06  -2.567  0.01034 *  
## newfeat1     1.217e-03  4.266e-04   2.853  0.00438 ** 
## newfeat2     4.580e-03  1.220e-03   3.753  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007532 on 1854 degrees of freedom
## Multiple R-squared:  0.02145,    Adjusted R-squared:  0.01881 
## F-statistic: 8.126 on 5 and 1854 DF,  p-value: 1.345e-07
# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")
## These 2 new feature(s) were added to your data ('df'):
## [1] "df$auto_feat1 <- as.numeric(df$SMIfactor == 'smi0' & df$CAC< 2253 & df$CAC< 1998 & df$CAC>=1882)"
## df$auto_feat1 <- as.numeric(df$SMIfactor == 'smi0' & df$CAC< 2253 & df$CAC< 1998 & df$CAC>=1882) 
## [1] "df$auto_feat2 <- as.numeric(df$SMIfactor == 'smi1' & df$FTSE>=5915)"
## df$auto_feat2 <- as.numeric(df$SMIfactor == 'smi1' & df$FTSE>=5915)
head(dataWithNewFeatures)
##       DAX    SMI    CAC   FTSE SMIfactor            y auto_feat1
## 1 1628.75 1678.1 1772.8 2443.6      smi0 -0.015480678          0
## 2 1613.63 1688.5 1750.5 2460.2      smi0  0.001450780          0
## 3 1606.51 1678.6 1718.0 2448.2      smi0  0.005767910          0
## 4 1621.04 1684.1 1708.1 2470.4      smi0 -0.003261110          0
## 5 1618.16 1686.6 1723.1 2484.7      smi0  0.004227839          0
## 6 1610.61 1671.6 1714.3 2466.8      smi0  0.005744589          0
##   auto_feat2
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

The newly discovered features are statistically significant, with improved adjusted R-squared, lower residual errors and improved F-statistic and p-value.