About This Demo

This file demonstrates a typical process of using R package “cleandata” to prepare data for machine learning.

R Package “cleandata”

A collection of functions that work with data frame to inspect, impute, encode, and partition data. The functions for imputation, encoding, and partitioning can produce log files to help you keep track of data manipulation process.

Available on CRAN: https://cran.r-project.org/package=cleandata

Source Codes on GitHub: https://github.com/sherrisherry/cleandata

  • New in V0.3.0: log files can be produced by an environment variable log_arg or by assigning a list to parameter log.

Dataset Used in This Demo

  • Name: Ames Housing dataset
  • Source: Kaggle contest House Prices

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa for predicting housing prices, this dataset is a typical example of what a business analyst encounters everyday.


Inspection

According to the description of this dataset, the “NA”s in some columns aren’t missing value. To prevent R from comfusing them with true missing values, read in the data files without converting any value to the NA in R.

The train set should have only one more column SalePrice than the test set.

# import 'cleandata' package.
library('cleandata')

# read in the training and test datasets without converting 'NA's to missing values.
train <- read.csv('data/train.csv', na.strings = "", strip.white = TRUE)
test <- read.csv('data/test.csv', na.strings = "", strip.white = TRUE)

# summarize the training set and test set
cat(paste('train: ', nrow(train), 'obs. ', ncol(train), 'cols\ncolumn names:\n', toString(colnames(train)), 
          '\n\ntest: ', nrow(test), 'obs. ', ncol(test), 'cols\ncolumn names:\n', toString(colnames(test)), '\n'))
## train:  1460 obs.  81 cols
## column names:
##  Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice 
## 
## test:  1459 obs.  80 cols
## column names:
##  Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition

To ensure consistency in the following imputation and encoding process across the train set and the test set, I appended the test set to the train set. The SalePrice values of the rows of the test set was set to NA to distinguish them from the rows of the train set. The resulting data frame was called df.

# filling the target columns of the test set with NA then combining test and training sets
test$SalePrice <- NA
df <- rbind(train, test)
rm(train, test)

Function inspect_na

inspect_na() counts the number of NAs in each column and sort them in descending order. In the following operation, inspect_na() returned the top 5 columns with missing values. If you want to see the number of missing values in every column, leave parameter top as default. As supposed, only SalePrice contained missing values, which equaled to the number of rows in the test set.

inspect_na(df, top = 5)
##   SalePrice          Id  MSSubClass    MSZoning LotFrontage 
##        1459           0           0           0           0

The NAs in the columns listed in NAisNoA were what was refered to as ‘none’-but-not-‘NA’ values. In these columns, NA had only one possible value - “not applicable”. I replaced these NAs with NoA to prevent imputing them later.

# in the 'NAisNoA' columns, NA means this attribute doesn't apply to them, not missing.
NAisNoA <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 
             'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 
             'PoolQC', 'Fence', 'MiscFeature')
for(i in NAisNoA){levels(df[, i])[levels(df[, i]) == "NA"] <- "NoA"}

At this stage, I reconstructed the data frame df to inspect the true missing values.

We can see that only LotFrontage had about 20% missing values. The other columns had few to no missing value.

# write the dataset into a csv file then read this file back to df to reconstruct df
write.csv(df, file = 'data/data.csv', row.names = FALSE)
df <- read.csv('data/data.csv', na.strings = "NA", strip.white = TRUE)

# see which predictors have most NAs
inspect_na(df[, -ncol(df)], top = 25)
##  LotFrontage  GarageYrBlt   MasVnrType   MasVnrArea     MSZoning 
##          486          159           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1 
##           Id   MSSubClass      LotArea       Street        Alley 
##            0            0            0            0            0

Function inspect_map

inspect_map() classifies the columns of a data frame. Before I further explain this function, I’d like to introduce ‘scheme’. In package “cleandata”, a scheme refers to a set of all the possible values of an enumerator. The factor objects in R are enumerators.

Function inspect_map returns a list of factor_cols (list), factor_levels (list), num_cols (vector), char_cols (vector), ordered_cols (vector), and other_cols (vector).

In the following codes, I specified that 2 factorial columns share the same scheme if their levels had more than 2 same values by setting the common parameter to 2. By default, the common parameter is 0, which means every level of 2 factorial columns should be the same for them to share the same scheme.

# create a map for imputation and encoding
data_map <- inspect_map(df[, -ncol(df)], common = 2)
## Id integer factors: 0 nums: 1 chars: 0 ordered: 0 others: 0 
## MSSubClass integer factors: 0 nums: 2 chars: 0 ordered: 0 others: 0 
## MSZoning factor factors: 1 nums: 2 chars: 0 ordered: 0 others: 0 
## LotFrontage integer factors: 1 nums: 3 chars: 0 ordered: 0 others: 0 
## LotArea integer factors: 1 nums: 4 chars: 0 ordered: 0 others: 0 
## Street factor factors: 2 nums: 4 chars: 0 ordered: 0 others: 0 
## Alley factor factors: 3 nums: 4 chars: 0 ordered: 0 others: 0 
## LotShape factor factors: 4 nums: 4 chars: 0 ordered: 0 others: 0 
## LandContour factor factors: 5 nums: 4 chars: 0 ordered: 0 others: 0 
## Utilities factor factors: 6 nums: 4 chars: 0 ordered: 0 others: 0 
## LotConfig factor factors: 7 nums: 4 chars: 0 ordered: 0 others: 0 
## LandSlope factor factors: 8 nums: 4 chars: 0 ordered: 0 others: 0 
## Neighborhood factor factors: 9 nums: 4 chars: 0 ordered: 0 others: 0 
## Condition1 factor factors: 10 nums: 4 chars: 0 ordered: 0 others: 0 
## Condition2 factor factors: 11 nums: 4 chars: 0 ordered: 0 others: 0 
## BldgType factor factors: 12 nums: 4 chars: 0 ordered: 0 others: 0 
## HouseStyle factor factors: 13 nums: 4 chars: 0 ordered: 0 others: 0 
## OverallQual integer factors: 13 nums: 5 chars: 0 ordered: 0 others: 0 
## OverallCond integer factors: 13 nums: 6 chars: 0 ordered: 0 others: 0 
## YearBuilt integer factors: 13 nums: 7 chars: 0 ordered: 0 others: 0 
## YearRemodAdd integer factors: 13 nums: 8 chars: 0 ordered: 0 others: 0 
## RoofStyle factor factors: 14 nums: 8 chars: 0 ordered: 0 others: 0 
## RoofMatl factor factors: 15 nums: 8 chars: 0 ordered: 0 others: 0 
## Exterior1st factor factors: 16 nums: 8 chars: 0 ordered: 0 others: 0 
## Exterior2nd factor factors: 17 nums: 8 chars: 0 ordered: 0 others: 0 
## MasVnrType factor factors: 18 nums: 8 chars: 0 ordered: 0 others: 0 
## MasVnrArea integer factors: 18 nums: 9 chars: 0 ordered: 0 others: 0 
## ExterQual factor factors: 19 nums: 9 chars: 0 ordered: 0 others: 0 
## ExterCond factor factors: 20 nums: 9 chars: 0 ordered: 0 others: 0 
## Foundation factor factors: 21 nums: 9 chars: 0 ordered: 0 others: 0 
## BsmtQual factor factors: 22 nums: 9 chars: 0 ordered: 0 others: 0 
## BsmtCond factor factors: 23 nums: 9 chars: 0 ordered: 0 others: 0 
## BsmtExposure factor factors: 24 nums: 9 chars: 0 ordered: 0 others: 0 
## BsmtFinType1 factor factors: 25 nums: 9 chars: 0 ordered: 0 others: 0 
## BsmtFinSF1 integer factors: 25 nums: 10 chars: 0 ordered: 0 others: 0 
## BsmtFinType2 factor factors: 26 nums: 10 chars: 0 ordered: 0 others: 0 
## BsmtFinSF2 integer factors: 26 nums: 11 chars: 0 ordered: 0 others: 0 
## BsmtUnfSF integer factors: 26 nums: 12 chars: 0 ordered: 0 others: 0 
## TotalBsmtSF integer factors: 26 nums: 13 chars: 0 ordered: 0 others: 0 
## Heating factor factors: 27 nums: 13 chars: 0 ordered: 0 others: 0 
## HeatingQC factor factors: 28 nums: 13 chars: 0 ordered: 0 others: 0 
## CentralAir factor factors: 29 nums: 13 chars: 0 ordered: 0 others: 0 
## Electrical factor factors: 30 nums: 13 chars: 0 ordered: 0 others: 0 
## X1stFlrSF integer factors: 30 nums: 14 chars: 0 ordered: 0 others: 0 
## X2ndFlrSF integer factors: 30 nums: 15 chars: 0 ordered: 0 others: 0 
## LowQualFinSF integer factors: 30 nums: 16 chars: 0 ordered: 0 others: 0 
## GrLivArea integer factors: 30 nums: 17 chars: 0 ordered: 0 others: 0 
## BsmtFullBath integer factors: 30 nums: 18 chars: 0 ordered: 0 others: 0 
## BsmtHalfBath integer factors: 30 nums: 19 chars: 0 ordered: 0 others: 0 
## FullBath integer factors: 30 nums: 20 chars: 0 ordered: 0 others: 0 
## HalfBath integer factors: 30 nums: 21 chars: 0 ordered: 0 others: 0 
## BedroomAbvGr integer factors: 30 nums: 22 chars: 0 ordered: 0 others: 0 
## KitchenAbvGr integer factors: 30 nums: 23 chars: 0 ordered: 0 others: 0 
## KitchenQual factor factors: 31 nums: 23 chars: 0 ordered: 0 others: 0 
## TotRmsAbvGrd integer factors: 31 nums: 24 chars: 0 ordered: 0 others: 0 
## Functional factor factors: 32 nums: 24 chars: 0 ordered: 0 others: 0 
## Fireplaces integer factors: 32 nums: 25 chars: 0 ordered: 0 others: 0 
## FireplaceQu factor factors: 33 nums: 25 chars: 0 ordered: 0 others: 0 
## GarageType factor factors: 34 nums: 25 chars: 0 ordered: 0 others: 0 
## GarageYrBlt integer factors: 34 nums: 26 chars: 0 ordered: 0 others: 0 
## GarageFinish factor factors: 35 nums: 26 chars: 0 ordered: 0 others: 0 
## GarageCars integer factors: 35 nums: 27 chars: 0 ordered: 0 others: 0 
## GarageArea integer factors: 35 nums: 28 chars: 0 ordered: 0 others: 0 
## GarageQual factor factors: 36 nums: 28 chars: 0 ordered: 0 others: 0 
## GarageCond factor factors: 37 nums: 28 chars: 0 ordered: 0 others: 0 
## PavedDrive factor factors: 38 nums: 28 chars: 0 ordered: 0 others: 0 
## WoodDeckSF integer factors: 38 nums: 29 chars: 0 ordered: 0 others: 0 
## OpenPorchSF integer factors: 38 nums: 30 chars: 0 ordered: 0 others: 0 
## EnclosedPorch integer factors: 38 nums: 31 chars: 0 ordered: 0 others: 0 
## X3SsnPorch integer factors: 38 nums: 32 chars: 0 ordered: 0 others: 0 
## ScreenPorch integer factors: 38 nums: 33 chars: 0 ordered: 0 others: 0 
## PoolArea integer factors: 38 nums: 34 chars: 0 ordered: 0 others: 0 
## PoolQC factor factors: 39 nums: 34 chars: 0 ordered: 0 others: 0 
## Fence factor factors: 40 nums: 34 chars: 0 ordered: 0 others: 0 
## MiscFeature factor factors: 41 nums: 34 chars: 0 ordered: 0 others: 0 
## MiscVal integer factors: 41 nums: 35 chars: 0 ordered: 0 others: 0 
## MoSold integer factors: 41 nums: 36 chars: 0 ordered: 0 others: 0 
## YrSold integer factors: 41 nums: 37 chars: 0 ordered: 0 others: 0 
## SaleType factor factors: 42 nums: 37 chars: 0 ordered: 0 others: 0 
## SaleCondition factor factors: 43 nums: 37 chars: 0 ordered: 0 others: 0
summary(data_map)
##               Length Class  Mode     
## factor_cols   31     -none- list     
## factor_levels 31     -none- list     
## num_cols      37     -none- character
## char_cols      0     -none- NULL     
## ordered_cols   0     -none- NULL     
## other_cols     0     -none- NULL

This dataset only had factorial and numeric columns. I unpacked data_map before heading to imputation and encoding.

factor_cols <- data_map$factor_cols
factor_levels <- data_map$factor_levels
num_cols <- data_map$num_cols
rm(data_map)

Imputation and Encoding

The functions for imputation and encoding keep track of your process by producing log files. This feature is by default disabled. To enable log files, I created an environment variable log_arg to storep the list of arguments for sink(). In old versions, log_arg should be assigned to log parameter in every imputation or encoding function, which is still supported by this version.

# create a list of arguments for producing a log file
log_arg <- list(file = 'log.txt', append = TRUE, split = FALSE)

log_arg can be a list of any arguments for sink(). In this example, the log file was named “log.txt”, new information was appended to the file, and the contents to the log file weren’t printed to the standard output.

In this version, parameter log by default searches a list called log_arg in the dynamic scope parent environment and takes the value of log_arg. If log is assigned a list, it takes the assigned value. If no a list log_arg in the parent and no list is assigned to log, no log file.

Imputation Functions

To prevent leakage, I instructed the imputation functions to use only rows of the train set to calculate the imputation values by passing an index to parameter idx.

Function impute_mode, impute_median, impute_mean

impute_mode() works with both numerical, string, and factorial columns. It impute NAs by the modes of their corresponding columns.

impute_median() and impute_mean() only work with numerical columns. They impute NAs by medians and means respectively.

# impute NAs in factorial columns by the mode of corresponding columns
lst <- unlist(factor_cols)
df <- impute_mode(df, cols = lst, idx = !is.na(df$SalePrice))

# impute NAs in numerical columns by the median of corresponding columns
lst <- num_cols
df <- impute_median(df, cols = lst, idx = !is.na(df$SalePrice))

# check the result
inspect_na(df[, -ncol(df)], top = 5)
##          Id  MSSubClass    MSZoning LotFrontage     LotArea 
##           0           0           0           0           0

Encoding Functions

Every encoding function prints summary of the columns before and after encoding by default. The output of encode_ordinal() and encode_binary() is by default factorial. If you want numerical output, set parameter out.int to TRUE after making sure no missing value in the input.

In this demo, I kept the encoded columns factorial because I intended to save the dataset into a csv file, which doesn’t distinguish between factorial and numerical columns.

Encoding Ordinal Columns

In business datasets, we can often find ratings, which are ordinal and use similar schemes. Based on my experience, if many columns share the same scheme, they are likely to be ratings.

summary(factor_cols)
##               Length Class  Mode     
## MSZoning       1     -none- character
## Street         1     -none- character
## Alley          1     -none- character
## LotShape       1     -none- character
## LandContour    1     -none- character
## Utilities      1     -none- character
## LotConfig      1     -none- character
## LandSlope      1     -none- character
## Neighborhood   1     -none- character
## Condition1     2     -none- character
## BldgType       1     -none- character
## HouseStyle     1     -none- character
## RoofStyle      1     -none- character
## RoofMatl       1     -none- character
## Exterior1st    2     -none- character
## MasVnrType     1     -none- character
## ExterQual     10     -none- character
## Foundation     1     -none- character
## BsmtExposure   1     -none- character
## BsmtFinType1   2     -none- character
## Heating        1     -none- character
## CentralAir     1     -none- character
## Electrical     1     -none- character
## Functional     1     -none- character
## GarageType     1     -none- character
## GarageFinish   1     -none- character
## PavedDrive     1     -none- character
## Fence          1     -none- character
## MiscFeature    1     -none- character
## SaleType       1     -none- character
## SaleCondition  1     -none- character

In our dataset ExterQual and other 9 columns share the same scheme. After I checked their scheme and the description file, I was sure that they were ordinal.

factor_levels$ExterQual
## [1] "Ex"  "Fa"  "Gd"  "NoA" "Po"  "TA"

“Po”: poor, “Fa”: fair, “TA”: typical/average, “Gd”: good, “Ex”: excellent

Function encode_ordinal

encode_ordinal() encodes ordinal data into sequential integers by a given order. The argument passed to none is always encoded to 0. The 1st member of the vector passed to order is encoded to 1.

# encoding ordinal columns
i <- 'ExterQual'; lst <- c('Po', 'Fa', 'TA', 'Gd', 'Ex')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]]], order = lst, none = 'NoA')
##  ExterQual ExterCond BsmtQual   BsmtCond   HeatingQC KitchenQual
##  Ex: 107   Ex:  12   Ex : 258   Fa : 104   Ex:1493   Ex: 205    
##  Fa:  35   Fa:  67   Fa :  88   Gd : 122   Fa:  92   Fa:  70    
##  Gd: 979   Gd: 299   Gd :1209   NoA:  82   Gd: 474   Gd:1151    
##  TA:1798   Po:   3   NoA:  81   Po :   5   Po:   3   TA:1493    
##            TA:2538   TA :1283   TA :2606   TA: 857              
##                                                                 
##  FireplaceQu GarageQual GarageCond PoolQC    
##  Ex :  43    Ex :   3   Ex :   3   Ex :   4  
##  Fa :  74    Fa : 124   Fa :  74   Fa :   2  
##  Gd : 744    Gd :  24   Gd :  15   Gd :   4  
##  NoA:1420    NoA: 159   NoA: 159   NoA:2909  
##  Po :  46    Po :   5   Po :  14             
##  TA : 592    TA :2604   TA :2654             
## coded 10 cols 5 levels 
##  ExterQual ExterCond BsmtQual BsmtCond HeatingQC KitchenQual FireplaceQu
##  5: 107    5:  12    5: 258   2: 104   5:1493    5: 205      5:  43     
##  2:  35    2:  67    2:  88   4: 122   2:  92    2:  70      2:  74     
##  4: 979    4: 299    4:1209   0:  82   4: 474    4:1151      4: 744     
##  3:1798    1:   3    0:  81   1:   5   1:   3    3:1493      0:1420     
##            3:2538    3:1283   3:2606   3: 857                1:  46     
##                                                              3: 592     
##  GarageQual GarageCond PoolQC  
##  5:   3     5:   3     5:   4  
##  2: 124     2:  74     2:   2  
##  4:  24     4:  15     4:   4  
##  0: 159     0: 159     0:2909  
##  1:   5     1:  14             
##  3:2604     3:2654
# removing encoded columns from the map
factor_levels[[i]] <- NULL
factor_cols[[i]] <- NULL

The Utilities column was binary according the dataset.

factor_levels$Utilities
## [1] "AllPub" "NoSeWa"
levels(df$Utilities)
## [1] "AllPub" "NoSeWa"

However, the description file indicates that it has 4 possible values: ‘ELO’, ‘NoSeWa’, ‘NoSewr’, ‘AllPub’. Therefore, I encoded it as having 4 levels.

# in dataset only "AllPub" "NoSeWa", with 2 NAs
i <- 'Utilities'; lst <- c('ELO', 'NoSeWa', 'NoSewr', 'AllPub')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]], drop=FALSE], order = lst)
##   Utilities   
##  AllPub:2918  
##  NoSeWa:   1  
## coded 1 cols 4 levels 
##  Utilities
##  4:2918   
##  2:   1
factor_levels[[i]]<-NULL
factor_cols[[i]]<-NULL

Encoding Binary Columns

# find all the 2-level columns
lst <- lapply(factor_levels, length)
lst <- as.data.frame(lst)
lst <- colnames(lst[, lst == 2])
cat(lst)
## Street CentralAir

Function encode_binary

encode_binary() encodes binary data into 0 and 1, regardless of order.

# encode all the 2-level columns
i <- unlist(factor_cols[lst])
df[, i] <- encode_binary(df[, i, drop=FALSE])
##   Street     CentralAir
##  Grvl:  12   N: 196    
##  Pave:2907   Y:2723    
## coded 1 cols 
## coded 1 cols 
##  Street   CentralAir
##  0:  12   0: 196    
##  1:2907   1:2723
factor_levels[lst] <- NULL
factor_cols[lst] <- NULL

Encoding other categorical Columns

Although we may have found more ordinal columns, I wanted to speed up our process so I assumed that all the remaining categorical columns were not ordered.

Function encode_onehot

encode_onehot() encodes categorical data by One-hot encoding.

# encode all the other categorical Columns
i <- unlist(factor_cols)
df0 <- encode_onehot(df[, i, drop=FALSE])
##     MSZoning     Alley      LotShape   LandContour   LotConfig   
##  C (all):  25   Grvl: 120   IR1: 968   Bnk: 117    Corner : 511  
##  FV     : 139   NoA :2721   IR2:  76   HLS: 120    CulDSac: 176  
##  RH     :  26   Pave:  78   IR3:  16   Low:  60    FR2    :  85  
##  RL     :2269               Reg:1859   Lvl:2622    FR3    :  14  
##  RM     : 460                                      Inside :2133  
##                                                                  
##                                                                  
##  LandSlope   Neighborhood    Condition1     Condition2     BldgType   
##  Gtl:2778   NAmes  : 443   Norm   :2511   Norm   :2889   1Fam  :2425  
##  Mod: 125   CollgCr: 267   Feedr  : 164   Feedr  :  13   2fmCon:  62  
##  Sev:  16   OldTown: 239   Artery :  92   Artery :   5   Duplex: 109  
##             Edwards: 194   RRAn   :  50   PosA   :   4   Twnhs :  96  
##             Somerst: 182   PosN   :  39   PosN   :   4   TwnhsE: 227  
##             NridgHt: 166   RRAe   :  28   RRNn   :   2                
##             (Other):1428   (Other):  35   (Other):   2                
##    HouseStyle     RoofStyle       RoofMatl     Exterior1st  
##  1Story :1471   Flat   :  20   CompShg:2876   VinylSd:1026  
##  2Story : 872   Gable  :2310   Tar&Grv:  23   MetalSd: 450  
##  1.5Fin : 314   Gambrel:  22   WdShake:   9   HdBoard: 442  
##  SLvl   : 128   Hip    : 551   WdShngl:   7   Wd Sdng: 411  
##  SFoyer :  83   Mansard:  11   ClyTile:   1   Plywood: 221  
##  2.5Unf :  24   Shed   :   5   Membran:   1   CemntBd: 126  
##  (Other):  27                  (Other):   2   (Other): 243  
##   Exterior2nd     MasVnrType    Foundation   BsmtExposure BsmtFinType1
##  VinylSd:1015   BrkCmn :  25   BrkTil: 311   Av : 418     ALQ:429     
##  MetalSd: 447   BrkFace: 879   CBlock:1235   Gd : 276     BLQ:269     
##  HdBoard: 406   None   :1766   PConc :1308   Mn : 239     GLQ:849     
##  Wd Sdng: 391   Stone  : 249   Slab  :  49   No :1904     LwQ:154     
##  Plywood: 270                  Stone :  11   NoA:  82     NoA: 79     
##  CmentBd: 126                  Wood  :   5                Rec:288     
##  (Other): 264                                             Unf:851     
##  BsmtFinType2  Heating     Electrical   Functional    GarageType  
##  ALQ:  52     Floor:   1   FuseA: 188   Maj1:  19   2Types :  23  
##  BLQ:  68     GasA :2874   FuseF:  50   Maj2:   9   Attchd :1723  
##  GLQ:  34     GasW :  27   FuseP:   8   Min1:  65   Basment:  36  
##  LwQ:  87     Grav :   9   Mix  :   1   Min2:  70   BuiltIn: 186  
##  NoA:  80     OthW :   2   SBrkr:2672   Mod :  35   CarPort:  15  
##  Rec: 105     Wall :   6                Sev :   2   Detchd : 779  
##  Unf:2493                               Typ :2719   NoA    : 157  
##  GarageFinish PavedDrive   Fence      MiscFeature    SaleType   
##  Fin: 719     N: 216     GdPrv: 118   Gar2:   5   WD     :2526  
##  NoA: 159     P:  62     GdWo : 112   NoA :2814   New    : 239  
##  RFn: 811     Y:2641     MnPrv: 329   Othr:   4   COD    :  87  
##  Unf:1230                MnWw :  12   Shed:  95   ConLD  :  26  
##                          NoA  :2348   TenC:   1   CWD    :  12  
##                                                   ConLI  :   9  
##                                                   (Other):  20  
##  SaleCondition 
##  Abnorml: 190  
##  AdjLand:  12  
##  Alloca :  24  
##  Family :  46  
##  Normal :2402  
##  Partial: 245  
##                
## coded col MSZoning ; 5 levels 
## coded col Alley ; 3 levels 
## coded col LotShape ; 4 levels 
## coded col LandContour ; 4 levels 
## coded col LotConfig ; 5 levels 
## coded col LandSlope ; 3 levels 
## coded col Neighborhood ; 25 levels 
## coded col Condition1 ; 9 levels 
## coded col Condition2 ; 8 levels 
## coded col BldgType ; 5 levels 
## coded col HouseStyle ; 8 levels 
## coded col RoofStyle ; 6 levels 
## coded col RoofMatl ; 8 levels 
## coded col Exterior1st ; 15 levels 
## coded col Exterior2nd ; 16 levels 
## coded col MasVnrType ; 4 levels 
## coded col Foundation ; 6 levels 
## coded col BsmtExposure ; 5 levels 
## coded col BsmtFinType1 ; 7 levels 
## coded col BsmtFinType2 ; 7 levels 
## coded col Heating ; 6 levels 
## coded col Electrical ; 5 levels 
## coded col Functional ; 7 levels 
## coded col GarageType ; 7 levels 
## coded col GarageFinish ; 4 levels 
## coded col PavedDrive ; 3 levels 
## coded col Fence ; 5 levels 
## coded col MiscFeature ; 5 levels 
## coded col SaleType ; 9 levels 
## coded col SaleCondition ; 6 levels 
##      MSZoning_C (all)           MSZoning_FV           MSZoning_RH 
##                    25                   139                    26 
##           MSZoning_RL           MSZoning_RM            Alley_Grvl 
##                  2269                   460                   120 
##             Alley_NoA            Alley_Pave          LotShape_IR1 
##                  2721                    78                   968 
##          LotShape_IR2          LotShape_IR3          LotShape_Reg 
##                    76                    16                  1859 
##       LandContour_Bnk       LandContour_HLS       LandContour_Low 
##                   117                   120                    60 
##       LandContour_Lvl      LotConfig_Corner     LotConfig_CulDSac 
##                  2622                   511                   176 
##         LotConfig_FR2         LotConfig_FR3      LotConfig_Inside 
##                    85                    14                  2133 
##         LandSlope_Gtl         LandSlope_Mod         LandSlope_Sev 
##                  2778                   125                    16 
##  Neighborhood_Blmngtn  Neighborhood_Blueste   Neighborhood_BrDale 
##                    28                    10                    30 
##  Neighborhood_BrkSide  Neighborhood_ClearCr  Neighborhood_CollgCr 
##                   108                    44                   267 
##  Neighborhood_Crawfor  Neighborhood_Edwards  Neighborhood_Gilbert 
##                   103                   194                   165 
##   Neighborhood_IDOTRR  Neighborhood_MeadowV  Neighborhood_Mitchel 
##                    93                    37                   114 
##    Neighborhood_NAmes  Neighborhood_NoRidge  Neighborhood_NPkVill 
##                   443                    71                    23 
##  Neighborhood_NridgHt   Neighborhood_NWAmes  Neighborhood_OldTown 
##                   166                   131                   239 
##   Neighborhood_Sawyer  Neighborhood_SawyerW  Neighborhood_Somerst 
##                   151                   125                   182 
##  Neighborhood_StoneBr    Neighborhood_SWISU   Neighborhood_Timber 
##                    51                    48                    72 
##  Neighborhood_Veenker     Condition1_Artery      Condition1_Feedr 
##                    24                    92                   164 
##       Condition1_Norm       Condition1_PosA       Condition1_PosN 
##                  2511                    20                    39 
##       Condition1_RRAe       Condition1_RRAn       Condition1_RRNe 
##                    28                    50                     6 
##       Condition1_RRNn     Condition2_Artery      Condition2_Feedr 
##                     9                     5                    13 
##       Condition2_Norm       Condition2_PosA       Condition2_PosN 
##                  2889                     4                     4 
##       Condition2_RRAe       Condition2_RRAn       Condition2_RRNn 
##                     1                     1                     2 
##         BldgType_1Fam       BldgType_2fmCon       BldgType_Duplex 
##                  2425                    62                   109 
##        BldgType_Twnhs       BldgType_TwnhsE     HouseStyle_1.5Fin 
##                    96                   227                   314 
##     HouseStyle_1.5Unf     HouseStyle_1Story     HouseStyle_2.5Fin 
##                    19                  1471                     8 
##     HouseStyle_2.5Unf     HouseStyle_2Story     HouseStyle_SFoyer 
##                    24                   872                    83 
##       HouseStyle_SLvl        RoofStyle_Flat       RoofStyle_Gable 
##                   128                    20                  2310 
##     RoofStyle_Gambrel         RoofStyle_Hip     RoofStyle_Mansard 
##                    22                   551                    11 
##        RoofStyle_Shed      RoofMatl_ClyTile      RoofMatl_CompShg 
##                     5                     1                  2876 
##      RoofMatl_Membran        RoofMatl_Metal         RoofMatl_Roll 
##                     1                     1                     1 
##      RoofMatl_Tar&Grv      RoofMatl_WdShake      RoofMatl_WdShngl 
##                    23                     9                     7 
##   Exterior1st_AsbShng   Exterior1st_AsphShn   Exterior1st_BrkComm 
##                    44                     2                     6 
##   Exterior1st_BrkFace    Exterior1st_CBlock   Exterior1st_CemntBd 
##                    87                     2                   126 
##   Exterior1st_HdBoard   Exterior1st_ImStucc   Exterior1st_MetalSd 
##                   442                     1                   450 
##   Exterior1st_Plywood     Exterior1st_Stone    Exterior1st_Stucco 
##                   221                     2                    43 
##   Exterior1st_VinylSd   Exterior1st_Wd Sdng   Exterior1st_WdShing 
##                  1026                   411                    56 
##   Exterior2nd_AsbShng   Exterior2nd_AsphShn   Exterior2nd_Brk Cmn 
##                    38                     4                    22 
##   Exterior2nd_BrkFace    Exterior2nd_CBlock   Exterior2nd_CmentBd 
##                    47                     3                   126 
##   Exterior2nd_HdBoard   Exterior2nd_ImStucc   Exterior2nd_MetalSd 
##                   406                    15                   447 
##     Exterior2nd_Other   Exterior2nd_Plywood     Exterior2nd_Stone 
##                     1                   270                     6 
##    Exterior2nd_Stucco   Exterior2nd_VinylSd   Exterior2nd_Wd Sdng 
##                    47                  1015                   391 
##   Exterior2nd_Wd Shng     MasVnrType_BrkCmn    MasVnrType_BrkFace 
##                    81                    25                   879 
##       MasVnrType_None      MasVnrType_Stone     Foundation_BrkTil 
##                  1766                   249                   311 
##     Foundation_CBlock      Foundation_PConc       Foundation_Slab 
##                  1235                  1308                    49 
##      Foundation_Stone       Foundation_Wood       BsmtExposure_Av 
##                    11                     5                   418 
##       BsmtExposure_Gd       BsmtExposure_Mn       BsmtExposure_No 
##                   276                   239                  1904 
##      BsmtExposure_NoA      BsmtFinType1_ALQ      BsmtFinType1_BLQ 
##                    82                   429                   269 
##      BsmtFinType1_GLQ      BsmtFinType1_LwQ      BsmtFinType1_NoA 
##                   849                   154                    79 
##      BsmtFinType1_Rec      BsmtFinType1_Unf      BsmtFinType2_ALQ 
##                   288                   851                    52 
##      BsmtFinType2_BLQ      BsmtFinType2_GLQ      BsmtFinType2_LwQ 
##                    68                    34                    87 
##      BsmtFinType2_NoA      BsmtFinType2_Rec      BsmtFinType2_Unf 
##                    80                   105                  2493 
##         Heating_Floor          Heating_GasA          Heating_GasW 
##                     1                  2874                    27 
##          Heating_Grav          Heating_OthW          Heating_Wall 
##                     9                     2                     6 
##      Electrical_FuseA      Electrical_FuseF      Electrical_FuseP 
##                   188                    50                     8 
##        Electrical_Mix      Electrical_SBrkr       Functional_Maj1 
##                     1                  2672                    19 
##       Functional_Maj2       Functional_Min1       Functional_Min2 
##                     9                    65                    70 
##        Functional_Mod        Functional_Sev        Functional_Typ 
##                    35                     2                  2719 
##     GarageType_2Types     GarageType_Attchd    GarageType_Basment 
##                    23                  1723                    36 
##    GarageType_BuiltIn    GarageType_CarPort     GarageType_Detchd 
##                   186                    15                   779 
##        GarageType_NoA      GarageFinish_Fin      GarageFinish_NoA 
##                   157                   719                   159 
##      GarageFinish_RFn      GarageFinish_Unf          PavedDrive_N 
##                   811                  1230                   216 
##          PavedDrive_P          PavedDrive_Y           Fence_GdPrv 
##                    62                  2641                   118 
##            Fence_GdWo           Fence_MnPrv            Fence_MnWw 
##                   112                   329                    12 
##             Fence_NoA      MiscFeature_Gar2       MiscFeature_NoA 
##                  2348                     5                  2814 
##      MiscFeature_Othr      MiscFeature_Shed      MiscFeature_TenC 
##                     4                    95                     1 
##          SaleType_COD          SaleType_Con        SaleType_ConLD 
##                    87                     5                    26 
##        SaleType_ConLI        SaleType_ConLw          SaleType_CWD 
##                     9                     8                    12 
##          SaleType_New          SaleType_Oth           SaleType_WD 
##                   239                     7                  2526 
## SaleCondition_Abnorml SaleCondition_AdjLand  SaleCondition_Alloca 
##                   190                    12                    24 
##  SaleCondition_Family  SaleCondition_Normal SaleCondition_Partial 
##                    46                  2402                   245
df[, i] <- NULL
df <- cbind(df, df0)

Partitioning

Function partition_random

partition_random() partitions a dataset randomly.

# partition the dataset
df0 <- partition_random(df[!is.na(df$SalePrice),], train = 8, test = FALSE)
## Train: 80%, Validation: 20%, Test: 0%

The Log File

Let’s check the log file at the end.

cat(paste(readLines('log.txt'), collapse = '\n'))
## Columns Imputed by Mode:
##  MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC, Foundation, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageType, GarageFinish, PavedDrive, Fence, MiscFeature, SaleType, SaleCondition
## 
## Columns Imputed by Median:
##  Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, MiscVal, MoSold, YrSold
## 
## Columns:
##  ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC
## Scheme:
## NoA  Po  Fa  TA  Gd  Ex 
##   0   1   2   3   4   5 
## 
## Columns:
##  Utilities
## Scheme:
##           ELO NoSeWa NoSewr AllPub 
##      0      1      2      3      4 
## 
## Columns:
##  Street, CentralAir
## Scheme:
## Grvl Pave 
##    0    1 
## 
## Columns:
##  Street, CentralAir
## Scheme:
## N Y 
## 0 1 
## 
## Columns Encoded by Onehot:
##  MSZoning, Alley, LotShape, LandContour, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, Foundation, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, Electrical, Functional, GarageType, GarageFinish, PavedDrive, Fence, MiscFeature, SaleType, SaleCondition
## 
## Details:
## Template of New Column Names: oldname_level; Dropping 1st Level: FALSE
## 
## Columns Partitioned by Random:
##  Partition
## 
## Details:
## Train: 80%, Validation: 20%, Test: 0%

=== end ===