The R package `surveyCV`

carries out cross-validation for
complex sample survey data.

It is a companion R package to our
SDSS 2021 presentation, and to our *Stat* article “K-fold cross-validation for
complex sample surveys” (published online Jan 12, 2022).

`surveyCV`

is designed to work with the `survey`

package to specify the sampling design (strata, clusters, sampling
weights, etc.), and to account for this design when forming CV folds and
estimating the CV test error.

The package currently handles the entire CV process for linear and
logistic regression models. For other models, users can generate
design-based CV folds with the `folds.svy()`

function, then
use these folds in their own custom training/testing CV loop.

Install a stable version of the package from CRAN:

`install.packages("surveyCV")`

Or, for the latest development version, install directly from GitHub:

```
# install.packages("remotes")
::install_github("ColbyStatSvyRsch/surveyCV", build_vignettes = TRUE) remotes
```

The function `cv.svy()`

carries out K-fold CV on a dataset
for a set of linear or logistic regression formulas, using specified
strata, clusters, weights, and FPCs. For each model under consideration,
it reports the design-based mean CV loss and a design-based estimate of
its SE.

Use `nest = TRUE`

only if cluster IDs are nested within
strata (i.e., if clusters in different strata might reuse the same
names).

```
library(surveyCV)
library(splines)
data(NSFG_data)
cv.svy(NSFG_data, c("income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt")
#> mean SE
#> .Model_1 22616 756.02
#> .Model_2 22536 748.01
#> .Model_3 22559 766.89
# The 2nd model (spline with 3 df) had lowest survey CV MSE,
# although it's well within one SE of the other models.
```

For convenience, the function `cv.svydesign()`

only needs
a `svydesign`

object and a formula, and will parse the
relevant survey design information before passing it to
`cv.svy()`

.

Similarly, the function `cv.svyglm()`

only needs a
`svyglm`

object, and will parse both the formula and the
survey design.

```
<- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
NSFG.svydes weights = ~wgt, data = NSFG_data)
cv.svydesign(formulae = c("income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)"),
design_object = NSFG.svydes, nfolds = 4)
#> mean SE
#> .Model_1 22576 744.59
#> .Model_2 22436 739.81
#> .Model_3 22577 752.62
<- svyglm(income ~ ns(age, df = 3), design = NSFG.svydes)
NSFG.svyglm cv.svyglm(glm_object = NSFG.svyglm, nfolds = 4)
#> mean SE
#> .Model_1 22411 741.93
```

The function `folds.svy()`

generates design-based fold IDs
for K-fold CV, using any specified strata and clusters.

(Briefly: For a stratified sample, each fold will contain data from each
stratum. For a cluster sample, a given cluster’s rows will all be
assigned to the same fold. See our *Stat* paper for
details.)

Using these fold IDs, you can write your own CV loop for models that our packages does not currently handle.

Here is an example of tuning the bin size for a design-based random
forest, using the `rpms_forest()`

function from the `rpms`

package. Note that while `folds.svy()`

accounts for the
clustering, we also need to pass the cluster IDs and survey weights to
`rpms_forest()`

for design-consistent model-fitting, and we
need to use the survey weights in the MSE calculations.

```
library(rpms)
data(CE)
# Generate fold IDs that account for clustering in the survey design
# for a subset of the CE dataset
<- 5
nfolds <- CE[which(CE$IRAX > 0), ]
CEsubset $.foldID <- folds.svy(CEsubset, nfolds = nfolds, clusterID = "CID")
CEsubset
# Use CV to tune the bin_size parameter of rpms_forest()
<- c(10, 20, 50, 100, 250, 500)
bin_sizes <- rep(0, length(bin_sizes))
SSEs for(ff in 1:nfolds) {
<- subset(CEsubset, .foldID != ff)
train <- subset(CEsubset, .foldID == ff)
test for(bb in 1:length(bin_sizes)) {
<- rpms_forest(IRAX ~ EDUCA + AGE + BLS_URBN,
rf data = train,
weights = ~FINLWT21, clusters = ~CID,
bin_size = bin_sizes[bb], f_size = 50)
<- predict(rf, newdata = test)
yhat <- (yhat - test$IRAX)^2
res2 # Sum up weighted SSEs, not MSEs yet,
# b/c cluster sizes may differ across folds and b/c of survey weights
<- SSEs[bb] + sum(res2 * test$FINLWT21)
SSEs[bb]
}
}# Divide entire weighted sum by the sum of weights
<- SSEs / sum(CEsubset$FINLWT21)
MSEs # Show results
cbind(bin_sizes, MSEs)
#> bin_sizes MSEs
#> [1,] 10 204246617270
#> [2,] 20 202870633392
#> [3,] 50 201393921358
#> [4,] 100 201085838446
#> [5,] 250 201825549231
#> [6,] 500 204155844501
# Bin size 100 had the lowest survey-weighted CV MSE estimate,
# though sizes 50 and 250 were quite similar.
# Now we could fit a random forest with bin size 100 on full `CEsubset` dataset.
```

Our GitHub repo includes R code to reproduce figures for our
*Stat* article “K-fold
cross-validation for complex sample surveys” (published online Jan
12, 2022).

Scripts for the PPI and NSFG examples are in the
`data-raw`

folder, in the `PPI_Zambia_plot.R`

and
`NSFG_plot.R`

scripts. We cannot share the proprietary PPI
dataset, but the preprocessed NSFG dataset is included in the package as
`NSFG_data`

, and instructions for preprocessing the NSFG data
are in the same folder in the `NSFG_data.R`

script.

Simulation code is in the `plots-for-Stat-paper`

vignette.