This document presents cumulative distribution function (CDF) analysis of a GRTS survey design. The resource used in the analysis is small lakes in Florida, which is a finite resource. Lake basins are used to delineate basins among the lakes in the resource. The analysis will include calculation of CDF estimates and testing for difference between CDFs from subpopulations of the resource.

The initial step is to use the library function to load the `spsurvey`

package. After the package is loaded, a message is printed to the R console indicating that the `spsurvey`

package was loaded successfully.

Load the `spsurvey`

package:

The original Florida small lakes data file contains more than 3,800 records and 29 basins. To produce a more manageable number of records, only six basins were retained in the data that will be analyzed, which produced a file containing 930 records.

The next step is to load the data set, which includes both survey design variables and analytical variables. The data function is used to load the data set and assign it to a data frame named `FL_lakes`

. The `nrow`

function is used to determine the number of rows in the `FL_lakes`

data frame, and the resulting value is assigned to an object named `nr`

. Finally, the initial six lines and the final six lines in the `FL_lake`

s data frame are printed using the `head`

and `tail`

functions, respectively.

Load the survey design and analytical variables data set:

Display the initial six lines in the data file:

```
head(FL_lakes)
#> siteID xcoord ycoord wgt Basin Status TNT
#> 1 FLW03414-0014 8635535 12860896 5.369048 NWFWMD-1 Sampled Target
#> 2 FLW03414-0046 8636136 12886783 5.369048 NWFWMD-1 Physical_Barrier Target
#> 3 FLW03414-0062 8617834 12869126 5.369048 NWFWMD-1 NonTarget NonTarget
#> 4 FLW03414-0078 8673500 12883071 5.369048 NWFWMD-1 Physical_Barrier Target
#> 5 FLW03414-0086 8631884 12816428 5.369048 NWFWMD-1 NonTarget NonTarget
#> 6 FLW03414-0118 8607699 12856644 5.369048 NWFWMD-1 NonTarget NonTarget
#> pH_Cat Coliform_Cat Oxygen Turbidity
#> 1 (0,6] (0,5] 9.9 0.4
#> 2 <NA> <NA> NA NA
#> 3 <NA> <NA> NA NA
#> 4 <NA> <NA> NA NA
#> 5 <NA> <NA> NA NA
#> 6 <NA> <NA> NA NA
```

Display the final six lines in the data file:

```
tail(FL_lakes)
#> siteID xcoord ycoord wgt Basin Status TNT
#> 925 FLW03414-3878 8880656 12694963 4.80791 SWFWMD-4 Dry Target
#> 926 FLW03414-3886 8892406 12732977 4.80791 SWFWMD-4 Sampled Target
#> 927 FLW03414-3894 8836528 12723056 4.80791 SWFWMD-4 Dry Target
#> 928 FLW03414-3918 8923107 12725502 4.80791 SWFWMD-4 Landowner_Denial Target
#> 929 FLW03414-3926 8861298 12715824 4.80791 SWFWMD-4 Dry Target
#> 930 FLW03414-3950 8888601 12715641 4.80791 SWFWMD-4 NonTarget NonTarget
#> pH_Cat Coliform_Cat Oxygen Turbidity
#> 925 <NA> <NA> NA NA
#> 926 (6,8] (5,50] 1.98 8.2
#> 927 <NA> <NA> NA NA
#> 928 <NA> <NA> NA NA
#> 929 <NA> <NA> NA NA
#> 930 <NA> <NA> NA NA
```

The sample of small lakes in Florida is displayed below. The sample sites for each basin are displayed using a unique color.

CDF analysis will be investigated by examining the dissolved oxygen variable. The `summary`

function is used to summarize the data structure of the dissolved oxygen values.

```
cat("\nSummarize the data structure of the dissolved oxygen variable:\n")
#>
#> Summarize the data structure of the dissolved oxygen variable:
summary(FL_lakes$Oxygen)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.830 4.880 6.870 6.468 8.310 12.480 759
```

Note that there is an extensive number of missing (NA) values. The `cont.analysis`

function will be used to calculate CDF estimates. Four data frames constitute the primary input to the cont.analysis function. The first column (variable) in the four data frames provides the unique identifier (site ID) for each sample site and is used to connect records among the data frames. The siteID variable in the `FL_lakes`

data frame is assigned to the siteID variable in the data frames. The four data frames that will be created are named as follows: `sites`

, `subpop`

, `design`

, and `data.cont`

. The `sites`

data frame identifies sites to use in the analysis and contains two variables: (1) siteID - site ID values and (2) Use - a logical vector indicating which sites to use in the analysis. The Use logical variables in `sites`

is set equal to the value “Sampled” of the status variable, so that only sampled sites are used in the analysis. The `subpop`

data frame defines populations and, optionally, subpopulations for which estimates are desired. Unlike the `sites`

and `design`

data frames, the `subpop`

data frame can contain an arbitrary number of columns. The first variable in the `subpop`

data frame identifies site ID values and each subsequent variable identifies a type of population, where the variable name is used to identify type. A type variable identifies each site with a character value. If the number of unique values for a type variable is greater than one, then the set of values represent subpopulations of that type. When a type variable consists of a single unique value, then the type does not contain subpopulations. For this analysis, the `subpop`

data frame contains two variables: (1) siteID - site ID values and (2) Basin - which will be used to calculate estimates for each basin individually. The basin variable in the `FL_lakes`

data frame is assigned to the Basin variable in the `subpop`

data frame. The `design`

data frame consists of survey design variables. For the analysis under consideration, the `design`

data frame contains the following variables: (1) siteID - site ID values; (2) wgt - final, adjusted, survey design weights; (3) xcoord - x-coordinates for location; and (4) ycoord - y-coordinates for location. The wgt, xcoord, and ycoord variables in the `design`

data frame are assigned values using variables with the same names in the `FL_lakes`

data frame. Like the `subpop`

data frame, the data.cont data frame can contain an arbitrary number of columns. The first variable in the `data.cont`

data frame identifies site ID values and each subsequent variable identifies a response variable. For this analysis, the response variable is DissolvedOxygen, which is assigned variable oxygen in the `FL_lakes`

data frame.

Create the `sites`

data frame, which identifies sites to use in the analysis:

Create the subpop data frame:

Create the design data frame:

```
design <- data.frame(siteID=FL_lakes$siteID,
wgt=FL_lakes$wgt,
xcoord=FL_lakes$xcoord,
ycoord=FL_lakes$ycoord)
```

Create the data.cont data frame:

The frame is a data structure containing spatial location data in addition to other attributes regarding a resource of interest and is used to create a survey design. A frame often takes the form of a shapefile. The frame can be used to obtain size values (e.g., number of lakes) for the populations and subpopulations examined in an analysis. The popsize (population size) argument to `cont.analysis`

provides a mechanism for forcing the size estimates of the number of lakes for each basin to sum to a desired value, i.e., the known number of lakes in each basin. As a first step for use of the popsize argument, the `combine`

function is used to create a named vector of frame size values for each basin. Output from the `combine`

function is assigned to an object named framesize. The `popsize`

argument is a list, which is a particular type of R object. The `popsize`

list must include an entry for each population type included in the `subpop`

data frame, i.e., Basin for this analysis. Recall that the Basin population type contains subpopulations, i.e., basins. When a population type contains subpopulations, the entry in the `popsize`

list also is a list. The `as.list`

function is applied to framesize, and the result is assigned to the Basin entry in the `popsize`

list. The `popsize`

argument is included in the call to `cont.analysis`

.

Assign frame size values:

```
framesize <- c("NWFWMD-1"=451, "NWFWMD-2"=394, "SFWMD-9"=834, "SJRWMD-1"=1216,
"SRWMD-1"=1400, "SWFWMD-4"=851)
```

Use the `cont.analysis`

function to calculate CDF estimates for the quantitative variables:

```
CDF_Estimates <- cont.analysis(sites, subpop, design, data.cont,
popsize=list(Basin=as.list(framesize)))
```

The object produced by `cont.analysis`

is a list containing two objects: (1) `CDF`

, a data frame containing the CDF estimates and (2) `Pct`

, a data frame containing percentile estimates plus estimates of population values for mean, variance, and standard deviation. The `CDF`

data frame contains thirteen columns. The first five columns identify the population (Type), subpopulation (Subpopulation), response variable (Indicator), values at which the CDF was estimated (Value), and number of response variable values for each value at which the `CDF`

was estimated(NResp). The next four columns in the `CDF`

data frame provide results for the proportion (percent scale) estimates, i.e., the CDF expressed as percentage values: the proportion estimate (Estimate.P), standard error of the estimate (StdError.P), lower confidence bound (LCB95Pct.P), and upper confidence bound (UCB95Pct.P). Argument conf for `cont.analysis`

allows control of the confidence bound level. The default value for conf is 95, hence the column names for confidence bounds contain the value 95. Supplying a different value to the conf argument will be reflected in the confidence bound names. Confidence bounds are obtained using the standard error and the Normal distribution multiplier corresponding to the confidence level. The final four columns in the data frame provide results for the size (units scale) estimates, i.e., the CDF expressed in terms of number of lakes: the size estimate (Estimate.U), standard error of the estimate (StdError.U), lower confidence bound (LCB95Pct.U), and upper confidence bound (UCB95Pct.U).

Unlike the `CDF`

data frame, the `Pct`

data frame contains only nine columns since there is a single set of estimates rather than two sets of estimates. In addition, the fourth column is labeled Statistic and identifies either a percentile or the mean, variance, or standard deviation estimate. Finally, since percentile estimates are obtained by inverting the CDF estimate, the percentile estimates do not have a standard error value associated with them.

Use the `write.csv`

function to write the CDF estimates as a csv file:

The `cont.cdfplot`

function in `spsurvey`

can be used to produce a PDF file containing plots of the CDF estimates. The primary arguments to `cont.cdfplot`

are a character string containing a name for the PDF file and the CDF data frame in the `CDF_Estimates`

object.

Produce a PDF file containing plots of the CDF estimates:

The `cont.cdftest`

function in `spsurvey`

can be used to test for statistical difference between the CDFs from subpopulations. For this analysis we will test for statistical difference between the CDFs from the six basins. The `cont.cdftest`

function will test all possible pairs of basins. Arguments to `cont.cdftest`

are the same as arguments to `cont.analysis`

.

The `print`

function is used to display results for dissolved oxygen of the statistical tests for difference between CDFs for basins. The object produced by `cont.cdftest`

is a data frame containing eight columns. The first column (Type) identifies the population. The second and third columns (Subpopulation_1 and Subpopulation_2) identify the subpopulations. The fourth column (Indicator) identifies the response variable. Column five contains values of the test statistic. Six test statistics are available, and the default statistic is an F-distribution version of the Wald statistic, which is identified in the data frame as “Wald_F”. The default statistic is used in this analysis. For further information about the test statistics see the Appendix. Columns six and seven (Degrees_of_Freedom_1 and Degrees_of_Freedom_2) provide the numerator and denominator degrees of freedom for the Wald test. The final column (p_Value) provides the p-value for the test.

```
print(CDF_Tests, digits=3)
#> Type Subpopulation_1 Subpopulation_2 Indicator Wald_F
#> 1 Basin NWFWMD-1 NWFWMD-2 DissolvedOxygen 3.144
#> 2 Basin NWFWMD-1 SFWMD-9 DissolvedOxygen 4.480
#> 3 Basin NWFWMD-1 SJRWMD-1 DissolvedOxygen 20.292
#> 4 Basin NWFWMD-1 SRWMD-1 DissolvedOxygen 0.305
#> 5 Basin NWFWMD-1 SWFWMD-4 DissolvedOxygen 10.668
#> 6 Basin NWFWMD-2 SFWMD-9 DissolvedOxygen 2.609
#> 7 Basin NWFWMD-2 SJRWMD-1 DissolvedOxygen 6.161
#> 8 Basin NWFWMD-2 SRWMD-1 DissolvedOxygen 2.819
#> 9 Basin NWFWMD-2 SWFWMD-4 DissolvedOxygen 3.822
#> 10 Basin SFWMD-9 SJRWMD-1 DissolvedOxygen 12.760
#> 11 Basin SFWMD-9 SRWMD-1 DissolvedOxygen 6.088
#> 12 Basin SFWMD-9 SWFWMD-4 DissolvedOxygen 14.118
#> 13 Basin SJRWMD-1 SRWMD-1 DissolvedOxygen 16.973
#> 14 Basin SJRWMD-1 SWFWMD-4 DissolvedOxygen 5.237
#> 15 Basin SRWMD-1 SWFWMD-4 DissolvedOxygen 6.409
#> Degrees_of_Freedom_1 Degrees_of_Freedom_2 p_Value
#> 1 2 55 5.09e-02
#> 2 2 57 1.56e-02
#> 3 2 57 2.21e-07
#> 4 2 54 7.39e-01
#> 5 2 51 1.35e-04
#> 6 2 55 8.27e-02
#> 7 2 55 3.85e-03
#> 8 2 52 6.88e-02
#> 9 2 49 2.87e-02
#> 10 2 57 2.63e-05
#> 11 2 54 4.13e-03
#> 12 2 51 1.32e-05
#> 13 2 54 1.91e-06
#> 14 2 51 8.54e-03
#> 15 2 48 3.41e-03
```

Since there is a fairly large number of CDFs being compared (15), it is reasonable to use a conservative p-value for assessing whether a pair of CDFs is significantly differenct. We will use a p-value no larger than 0.01 to test for significant difference. Using that value, basins with significanly different CDFs are displayed in the figure below using “X” to indicate CDFs that are different. Note that only unique pairs of basins are displayed in the figure, which is reflected in the figure by the red line. Significantly different CDFs for pairs of basins located to the right of the red line are displayed in the figure. Nine pairs of CDFs are different: (1) basin NWFWMD-1 is different from basins SJRWMD-1 and SWFWMD-4; (2) basin NWFWMD-2 is different from basin SJRWMD-1; (3) SFWMD-9 is different from basins SJRWMD-1, SRWMD-1, and SWFWMD-4; (4) SJRWMD-1 is different from basins SRWMD-1 and SWFWMD-4; and (5) SRWMD-1 is different from basin SWFWMD-4. The majority of the significantly different CDFs are attributable to two basins. The CDF for basin SJRWMD-1 is different from the CDFs for all of the other basins. The CDF for basin SWFWMD-4 is different from the CDFs for all of the other basins except basin NWFWMD-2. The only other pair of basins with different CDFs are basins SFWMD-9 and SRWMD-1.

```
n1 <- length(levels(CDF_Tests$Subpopulation_1))
n2 <- length(levels(CDF_Tests$Subpopulation_2))
plot(1:n2, 1:n1, type="n", xlab="Second Basin", ylab="First Basin", xaxt="n",
yaxt="n")
count=1
for(i in 1:n1) {
for(j in i:n2) {
text(j, i, ifelse(CDF_Tests$p_Value[count] < 0.01, "X", " "))
count <- count+1
}
}
axis(side=1, at=1:n2, labels=levels(CDF_Tests$Subpopulation_2), cex.axis=0.75)
axis(side=2, at=1:n1, labels=levels(CDF_Tests$Subpopulation_1), cex.axis=0.75)
title("Significantly Different CDFs")
abline(1, 1, col="red", lwd=2)
```

Use the `write.csv`

function to write CDF test results as a csv file:

The inferential procedure employed to test for differences between CDFs partitions the two CDFs that are being tested into a discrete set of intervals (classes) and then utilizes procedures that have been developed for analysis of categorical data from probability surveys (Kincaid 2000). The `cdf.est`

function, which is called by the `cont.cdftest`

function, calculates the Wald, Rao-Scott first order corrected (mean eigenvalue corrected), and Rao-Scott second order corrected (Satterthwaite corrected) test statistics (Rao and Scott 1981; Wald 1943). Both standard versions of the three statistics, which are distributed as Chi-squared random variables (Rao and Thomas 1988), and alternate versions of the statistics, which are distributed as F random variables (Thomas, Roberts, and Singh 1996), are calculated by the `cdf.test`

function. The `cont.cdftest`

function includes an argument (testname) that identifies the test statistic that will be reported in the output object created by the function. The Horvitz-Thompson ratio estimator, i.e., the ratio of two Horvitz-Thompson estimators (Horvitz and Thompson 1952), is used by `cdf.test`

to calculate estimates of the class proportions for the CDFs. Variance estimates for the test statistics are calculated using either the local mean variance estimator (Stevens and Olsen 2003) or the simple random sampling (SRS) variance estimator. The choice of variance estimator is subject to user control. The SRS variance estimator uses the independent random sample approximation to calculate joint inclusion probabilities. For additional discussion of CDF estimation and hypothesis testing in the context of natural resource monitoring programs see Kincaid and Olsen (Kincaid and Olsen 2012).

Horvitz, D. G., and D. J. Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” *Journal of the American Statistical Association* 47: 663–85.

Kincaid, Thomas M. 2000. “Testing for Differences Between Cumulative Distribution Functions from Complex Environmental Sampling Surveys.” In *Proceeding of the Section on Statistics and the Environment*. Alexandria, VA: American Statistical Association.

Kincaid, Thomas M., and Anthony R. Olsen. 2012. “Survey Analysis in Natural Resource Monitoring Programs with a Focus on Cumulative Distribution Functions.” In *Design and Analysis of Long-Term Ecological Monitoring Studies*, edited by Robert A. Gitzen, Joshua J. Millspaugh, Andrew B. Cooper, and Daniel S. Licht, 313–24. Cambridge University Press.

Rao, J. N. K., and A. J. Scott. 1981. “The Analysis of Categorical Data from Complex Sample Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables.” *Journal of the American Statistical Association* 76: 221–30.

Rao, J. N. K., and D. Roland Thomas. 1988. “The Analysis of Cross-Classified Categorical Data from Complex Sample Surveys.” *Sociological Methodology* 18: 213–69.

Stevens, D. L., and A. R. Olsen. 2003. “Variance Estimation for Spatially Balanced Samples of Environmental Resources.” *Environmetrics* 14: 593–610.

Thomas, D. Roland, G. R. Roberts, and A. C. Singh. 1996. “Tests of Independence on Two-Way Tables Under Cluster Sampling.” *International Statistical Review* 64: 295–311.

Wald, A. 1943. “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations Is Large.” *Transactions of the American Mathematical Society* 54: 426–82.