surveyCV: Cross Validation Based on Survey Design

Cole Guerin, Thomas McMahon, Jerzy Wieczorek

Introduction to surveyCV

This package implements cross validation (CV) for complex survey data, by accounting for strata, clusters, FPCs, and survey weights when creating CV folds as well as when calculating test-set loss estimates (currently either mean squared error (MSE) for linear models, or binary cross-entropy for logistic models). It is meant to work smoothly with the survey package.

In addition, the function folds.svy() (which makes the design-based CV folds) can be used to code up custom CV loops to evaluate other models besides linear and logistic regression.

To understand why we believe it’s important to account for survey design when carrying out CV, please see our paper: Wieczorek, Guerin, and McMahon (2022), “K-Fold Cross-Validation for Complex Sample Surveys,” Stat <doi:10.1002/sta4.454>.
Briefly, the problem is that standard CV assumes exchangeable data, which is not true of most complex sampling designs. We argue (1) that CV folds should mimic the sampling design (so that we’re honest about how much the model-fits will vary across “samples like this one”), and (2) that test set MSEs should be averaged and weighted in ways that generalize to the full population from which the sample was drawn, assuming that is the population for which the models will be applied.

This vignette briefly illustrates how to use surveyCV with several common types of sampling design. We provide three equivalent ways to carry out CV: direct control with cv.svy, use of an existing survey design object with cv.svydesign, or use of an existing survey GLM object with cv.svyglm. We also illustrate how to write your own design-based CV loop, with a design-consistent random forest example from the rpms package.

Setup

# We set a random seed in this vignette only to ensure
#   that our discussion will match the CV output;
# otherwise, each time we reran the vignette, we'd get different CV folds
set.seed(2022)

library(surveyCV)
data("NSFG_data")
library(survey)
data("api")

Linear and logistic regression with cv.svy, cv.svydesign, and cv.svyglm

Direct control with cv.svy

First, we borrow some examples from the documentation for survey::svyglm(), based on the api data (the Academic Performance Index for California schools in the year 2000).

Say that we want to carry out complex survey CV for several linear models, and we are working with the stratified sample in apistrat. Using cv.svy(), we specify the dataset; the formulas for the models being compared; the number of folds; and any necessary information about the sampling design. For this particular stratified design, we must specify the strata variable, the weights variable, and the FPC variable.

#stratified sample
cv.svy(apistrat, c("api00~ell",
                   "api00~ell+meals",
                   "api00~ell+meals+mobility"),
       nfolds = 5, strataID = "stype", weightsID = "pw", fpcID = "fpc")
#>            mean     SE
#> .Model_1 9101.0 872.52
#> .Model_2 5333.9 504.76
#> .Model_3 5390.0 511.97

The 2nd model (api00~ell+meals) appears to have the lowest average test MSE (in the mean) column, although its performance does not differ much from the 3rd model and the difference between them is much smaller than either of their standard errors (in the SE column).

If we were to use the single-stage cluster sample in apiclus1 instead, the command would be similar but we would need to specify clusters instead of strata:

# one-stage cluster sample
cv.svy(apiclus1, c("api00~ell",
                   "api00~ell+meals",
                   "api00~ell+meals+mobility"),
       nfolds = 5, clusterID = "dnum", weightsID = "pw", fpcID = "fpc")
#>            mean      SE
#> .Model_1 8541.8 2127.89
#> .Model_2 3363.1  707.12
#> .Model_3 3473.8  726.64

On the other hand, if we are working with a simple random sample as in apisrs, we could just use standard CV instead of complex survey CV to generate the folds. However, complex survey CV may still be useful with a SRS if we want to account for the finite population correction (FPC) when estimating the SEs of our CV MSEs.

# simple random sample
cv.svy(apisrs, c("api00~ell",
                 "api00~ell+meals",
                 "api00~ell+meals+mobility"),
       nfolds = 5, fpcID = "fpc")
#>            mean      SE
#> .Model_1 9880.3  994.73
#> .Model_2 6720.3 1151.04
#> .Model_3 6869.2 1178.15

Note that our intent in this vignette is not to compare the stratified sample results vs. the cluster sample results vs. the SRS results. We show 3 approaches only to illustrate how the code depends on the sampling design. Each of these samples was taken in a different way, and so it should be analyzed in a different way. In most cases, the data analyst will have only one dataset, and they ought to choose the right form of CV based on that one dataset’s sampling design.

(But if we do have different datasets taken under different sampling designs from the same population, it’s plausible that the “best model” might differ across sampling designs. For instance, a stratified sample will likely have more precision than a cluster sample with the same number of ultimate observation units—so it’s possible that the stratified-sample’s CV would choose a larger model than the cluster-sample’s CV. That is, the stratified sample might have enough precision to fit a larger model without overfitting, while CV might show us that the cluster sample is overfitting when it tries to fit a larger model.)

Finally, we show an example from a different dataset, the 2015-2017 National Survey of Family Growth (NSFG), which uses clusters nested within strata as well as unequal sample weights. In this case we are using 4 folds, not 5 as above, because each stratum only has 4 clusters.

(Note: We use nest = TRUE only if cluster IDs are nested within strata, i.e., if clusters in different strata might reuse the same names.)

# complex sample from NSFG
library(splines)
cv.svy(NSFG_data, c("income ~ ns(age, df = 1)",
                    "income ~ ns(age, df = 2)",
                    "income ~ ns(age, df = 3)",
                    "income ~ ns(age, df = 4)",
                    "income ~ ns(age, df = 5)",
                    "income ~ ns(age, df = 6)"),
       nfolds = 4,
       strataID = "strata", clusterID = "SECU",
       nest = TRUE, weightsID = "wgt")
#>           mean     SE
#> .Model_1 22674 730.26
#> .Model_2 22450 733.46
#> .Model_3 22416 734.37
#> .Model_4 22468 745.27
#> .Model_5 22497 749.42
#> .Model_6 22517 739.71

Here we tend to see that the spline with 3 degrees of freedom fits a little better than the others, but the differences between models are mostly smaller than the SEs.

Using a survey design object with cv.svydesign

When using the survey package, we will often create a svydesign object in order to calculate means and totals, fit GLMs, etc. In that case, instead of using cv.svy(), it is more convenient to use cv.svydesign(), which will read the relevant information out of the svydesign object and internally pass it along to cv.svy() for us.

We repeat the api stratified, cluster, and SRS examples above using this cv.svydesign() approach.

#stratified sample
dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
                          "api00~ell+meals",
                          "api00~ell+meals+mobility"),
             design_object = dstrat, nfolds = 5)
#>            mean     SE
#> .Model_1 9138.9 879.50
#> .Model_2 5332.5 498.04
#> .Model_3 5416.8 502.89

# one-stage cluster sample
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
                          "api00~ell+meals",
                          "api00~ell+meals+mobility"),
             design_object = dclus1, nfolds = 5)
#>            mean      SE
#> .Model_1 8992.7 2131.82
#> .Model_2 3351.6  648.11
#> .Model_3 3394.9  654.34

# simple random sample
dsrs <- svydesign(id = ~1, data = apisrs, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
                          "api00~ell+meals",
                          "api00~ell+meals+mobility"),
             design_object = dsrs, nfolds = 5)
#>            mean     SE
#> .Model_1 9850.6 1009.0
#> .Model_2 7224.2 1302.3
#> .Model_3 7587.6 1360.2

Next, we repeat the NSFG example using this approach.

NSFG.svydes <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
                         weights = ~wgt, data = NSFG_data)
cv.svydesign(formulae = c("income ~ ns(age, df = 1)",
                          "income ~ ns(age, df = 2)",
                          "income ~ ns(age, df = 3)",
                          "income ~ ns(age, df = 4)",
                          "income ~ ns(age, df = 5)",
                          "income ~ ns(age, df = 6)"),
             design_object = NSFG.svydes, nfolds = 4)
#>           mean     SE
#> .Model_1 22668 733.45
#> .Model_2 22464 740.10
#> .Model_3 22364 748.72
#> .Model_4 22374 731.03
#> .Model_5 22452 753.67
#> .Model_6 22523 735.79

Using a survey GLM object with cv.svyglm

Finally, if we have already fit a svyglm object, we can use cv.svyglm() which will read the formula as well as the survey design information and internally pass it along to cv.svy() for us. However, this approach only works for one GLM at time instead of comparing several models in a single function.

We repeat the api stratified, cluster, and SRS examples above using this cv.svyglm() approach, with the help of the surveydesign objects already created above.

#stratified sample
glmstrat <- svyglm(api00 ~ ell+meals+mobility, design = dstrat)
cv.svyglm(glmstrat, nfolds = 5)
#>            mean     SE
#> .Model_1 5632.6 548.47

# one-stage cluster sample
glmclus1 <- svyglm(api00 ~ ell+meals+mobility, design = dclus1)
cv.svyglm(glmclus1, nfolds = 5)
#>            mean     SE
#> .Model_1 3446.1 701.11

# simple random sample
glmsrs <- svyglm(api00 ~ ell+meals+mobility, design = dsrs)
cv.svyglm(glmsrs, nfolds = 5)
#>            mean     SE
#> .Model_1 6738.8 1106.1

Next, we repeat the NSFG example using this approach.

NSFG.svyglm <- svyglm(income ~ ns(age, df = 3), design = NSFG.svydes)
cv.svyglm(glm_object = NSFG.svyglm, nfolds = 4)
#>           mean     SE
#> .Model_1 22387 732.46

Fitting a logistic model instead of linear

If we have a binary response variable and wish to fit a logistic regression instead, we can use family = quasibinomial() in the call to svyglm() and feed that directly into cv.svyglm():

NSFG.svyglm.logistic <- svyglm(LBW ~ ns(age, df = 3), design = NSFG.svydes,
                               family = quasibinomial())
cv.svyglm(glm_object = NSFG.svyglm.logistic, nfolds = 4)
#>             mean     SE
#> .Model_1 0.34523 0.0177

In this case, the mean column shows the average of the binary cross-entropy loss \(-[y_i \log(\hat y_i) + (1-y_i)\log(1-\hat y_i)]\) rather than MSE.
As with MSE, lower values of this loss indicate better-fitting models.

Or we can set method = "logistic" in cv.svydesign() or in cv.svy():

cv.svydesign(formulae = c("LBW ~ ns(age, df = 1)",
                          "LBW ~ ns(age, df = 2)",
                          "LBW ~ ns(age, df = 3)",
                          "LBW ~ ns(age, df = 4)",
                          "LBW ~ ns(age, df = 5)",
                          "LBW ~ ns(age, df = 6)"),
             design_object = NSFG.svydes, nfolds = 4,
             method = "logistic")
#>             mean     SE
#> .Model_1 0.33997 0.0173
#> .Model_2 0.33973 0.0171
#> .Model_3 0.34199 0.0173
#> .Model_4 0.34470 0.0178
#> .Model_5 0.34635 0.0179
#> .Model_6 0.34689 0.0180

cv.svy(NSFG_data, c("LBW ~ ns(age, df = 1)",
                    "LBW ~ ns(age, df = 2)",
                    "LBW ~ ns(age, df = 3)",
                    "LBW ~ ns(age, df = 4)",
                    "LBW ~ ns(age, df = 5)",
                    "LBW ~ ns(age, df = 6)"),
       nfolds = 4,
       strataID = "strata", clusterID = "SECU",
       nest = TRUE, weightsID = "wgt",
       method = "logistic")
#>             mean     SE
#> .Model_1 0.34134 0.0175
#> .Model_2 0.34081 0.0174
#> .Model_3 0.34073 0.0174
#> .Model_4 0.34242 0.0174
#> .Model_5 0.34441 0.0176
#> .Model_6 0.34539 0.0176

These two examples are carrying out the same CV over the same data and same models; they have slightly different results only because of randomness in forming the CV folds for each example.

In these examples, model 2 or model 3 has the lowest loss, though any differences between models are well within one SE.

Other models with folds.svy and folds.svydesign

The function folds.svy() generates design-based fold IDs for K-fold CV, using any specified strata and clusters.
(Briefly: For a stratified sample, each fold will contain data from each stratum. For a cluster sample, a given cluster’s rows will all be assigned to the same fold. See our Stat paper for details.)

Using these fold IDs, you can write your own CV loop for models that our packages does not currently handle. These might be other models from the survey package (Poisson regression, Cox proportional hazards model, etc.) or from other design-based modeling packages such as rpms, mase, or others in the CRAN Task View on Official Statistics & Survey Statistics.

Here is an example of tuning the bin size parameter for a design-based random forest, using the rpms_forest() function from the rpms package. (I have also set the forest size to 50 trees instead of the default 500, purely in order to make the example run faster.)

Note that while folds.svy() accounts for the clustering in this survey design, we need to do a bit of extra work to ensure that the model training and testing steps also account for the survey design. In this particular example, we must pass the cluster IDs and survey weights to rpms_forest() for design-consistent model-fitting, and we must use the survey weights in the MSE calculations.

# Based on example("rpms"):
#   model the mean of retirement account value `IRAX` among households with 
#   reported retirement account values > 0,
#   predicted from householder education, age, and urban/rural location
library(rpms)
data(CE)

# Generate fold IDs that account for clustering in the survey design
# for the IRAX>0 subset of the CE dataset
nfolds <- 5
CEsubset <- CE[which(CE$IRAX > 0), ]
CEsubset$.foldID <- folds.svy(CEsubset, nfolds = nfolds, clusterID = "CID")

# Use CV to tune the bin_size parameter of rpms_forest()
bin_sizes <- c(10, 20, 50, 100, 250, 500)

# Create placeholder for weighted Sums of Squared Errors
SSEs <- rep(0, length(bin_sizes))

for(ff in 1:nfolds) { # For every fold...
  # Use .foldID to split data into training and testing sets
  train <- subset(CEsubset, .foldID != ff)
  test  <- subset(CEsubset, .foldID == ff)

  for(bb in 1:length(bin_sizes)) { # For every value of the tuning parameter...
    # Fit a new model
    rf <- rpms_forest(IRAX ~ EDUCA + AGE + BLS_URBN, 
                      data = train,
                      weights = ~FINLWT21, clusters = ~CID,
                      bin_size = bin_sizes[bb], f_size = 50)
    # Get predictions and squared errors
    yhat <- predict(rf, newdata = test)
    res2 <- (yhat - test$IRAX)^2
    # Sum up weighted SSEs, not MSEs yet,
    # b/c cluster sizes may differ across folds and b/c of survey weights
    SSEs[bb] <- SSEs[bb] + sum(res2 * test$FINLWT21)
  }
}

# Divide entire weighted sum by the sum of weights
MSEs <- SSEs / sum(CEsubset$FINLWT21)

# Show results
cbind(bin_sizes, MSEs)
#>      bin_sizes         MSEs
#> [1,]        10 204246617270
#> [2,]        20 202870633392
#> [3,]        50 201393921358
#> [4,]       100 201085838446
#> [5,]       250 201825549231
#> [6,]       500 204155844501

Bin size 100 had the lowest survey-weighted CV MSE estimate, although sizes 50 and 250 were quite similar.

Using this chosen tuning parameter value, the next step could be to fit a random forest with bin size 100 on the full CEsubset dataset.

In this case, rpms_forest() does not work with the survey package so there was no need to create a svydesign object for this analysis. But if we were working with a different model that expects us to create a svydesign object from CEsubset, we could have used folds.svydesign() instead, which would read out the cluster variable from the svydesign object automatically:

CE.svydes <- svydesign(id = ~CID, weights = ~FINLWT21, data = CEsubset)
# Use update() to add variables to a svydesign object
CE.svydes <- update(CE.svydes,
                    .foldID = folds.svydesign(CE.svydes, nfolds = nfolds))
# Now the fold IDs are available as CE.svydes$variables$.foldID
table(CE.svydes$variables$.foldID)
#>   1   2   3   4   5 
#> 813 923 928 885 960