ProSGPV is a package that performs variable selection with Second-Generation P-Values (SGPV). This document illustrates how
ProSGPV works with continuous outcomes in linear regression. Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2020).
To install the
ProSGPV pacKakge from CRAN, you can do
Alternatively, you can install a development version of
ProSGPV by doing
Once the package is installed, we can load the package to the current environment.
The data set is stored in the
ProSGPV package with the name
t.housing. The goal is to find important variables associated with the sale prices of real estate units and then build a prediction model. More details about data collection are available in Rafiei and Adeli (2016). There are 26 explanatory variables and one outcome, and variable description is shown below.
|Outcome||V9||Actual sales price|
|Project physical and financial features||V2
|Total floor area of the building
Total preliminary estimated construction cost
Preliminary estimated construction cost
Equivalent preliminary estimated construction cost in a selected base year
Duration of construction
Price of the unit at the beginning of the project
|Economic variables and indices||V11
|The number of building permits issued
Building services index for a pre-selected base year
Wholesale price index of building materials for the base year
Total floor areas of building permits issued by the city/municipality
Private sector investment in new buildings
Land price index for the base year
The number of loans extended by banks in a time resolution
The amount of loans extended by banks in a time resolution
The interest rate for loan in a time resolution
The average construction cost by private sector when completed
The average cost of buildings by private sector at the beginning
Official exchange rate with respect to dollars
Nonofficial (street market) exchange rate with respect to dollars
Consumer price index (CPI) in the base year
CPI of housing, water, fuel & power in the base year
Stock market index
Population of the city
Gold price per ounce
We can load the data and feed into
pro.sgpv function. By default, a two-stage algorithm is run and prints the indices of the selected variables.
<- t.housing[, -ncol(t.housing)] x <- t.housing$V9 y .2s <- pro.sgpv(x,y) sgpv.2s sgpv#> Selected variables are V8 V12 V13 V15 V17 V26
We can print the summary of the linear regression with selected variables with the S3 method
summary(sgpv.2s) #> #> Call: #> lm(formula = Response ~ ., data = data.d) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1276.35 -75.59 -9.58 59.46 1426.22 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.708e+02 3.471e+01 4.920 1.31e-06 *** #> V8 1.211e+00 1.326e-02 91.277 < 2e-16 *** #> V12 -2.737e+01 2.470e+00 -11.079 < 2e-16 *** #> V13 2.185e+01 2.105e+00 10.381 < 2e-16 *** #> V15 2.041e-03 1.484e-04 13.756 < 2e-16 *** #> V17 -3.459e+00 8.795e-01 -3.934 0.00010 *** #> V26 -4.683e+00 1.780e+00 -2.630 0.00889 ** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 194.8 on 365 degrees of freedom #> Multiple R-squared: 0.9743, Adjusted R-squared: 0.9739 #> F-statistic: 2310 on 6 and 365 DF, p-value: < 2.2e-16
Coefficient estimates can be extracted by use of S3 method
coef. Note that it returns a vector of length \(p\).
coef(sgpv.2s) #>  0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #>  0.000000000 1.210755031 0.000000000 -27.367601037 21.853920174 #>  0.000000000 0.002040784 0.000000000 -3.459496972 0.000000000 #>  0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #>  0.000000000 0.000000000 -4.683172725 0.000000000 0.000000000 #>  0.000000000
In-sample prediction can be made using S3 method
predict and an external sample can be provided to make out-of-sample prediction with an argument of
newdata in the
head(predict(sgpv.2s)) #> 1 2 3 4 5 6 #> 1565.7505 3573.7793 741.7576 212.1297 5966.1682 5724.0172
ProSGPV selection path can be extracted by use of S3 method
lambda.max argument controls the range of \(\lambda\). The black vertical dotted line is the \(\lambda\) selected by generalized information criterion (Fan and Tang (2013)). The null zone is the grey shaded region near 0. The blue labels on the Y-axis are the selected variables.
plot(sgpv.2s,lambda.max = 0.005)
By default, three lines per variables are provided. You can also choose to view only one bound per variable by setting
lpv argument to 1, where the one bound is the confidence bound that is closer to 0.
plot(sgpv.2s, lambda.max=0.005, lpv=1)
One-stage algorithm is available when \(n>p\) but may have reduced support recovery rate and higher parameter estimation bias. Its advantage is its fast computation speed and its result being fixed for a given data set.
.1s <- pro.sgpv(x,y,stage=1) sgpv.1s sgpv#> Selected variables are V8 V12 V13 V15 V17 V25 V26
Note that the one-stage algorithm selects one more variable than the two-stage algorithm.
plot are available for the one-stage algorithm. Particularly,
plot(sgpv.1s) would presents the variable selection results in the full model. Point estimates and 95% confidence intervals are shown for each variable, and the null bounds are shown in green vertical bars. Selected variables are colored in blue.