When trying to understand data, most often not only the actual data is required, but also so called meta data. Meta data usually includes:
While the data.frame
class in R
supports value labels to a certain degree with the factor
class, its functionality is limited. Other data formats like .xlsx
or .csv
support no meta data at all. Commercial software like SPSS
provides such functionality but can not compete with the various tools for analyzing data that R
provides.
eatGADS
is an R
package that was developed to bridge this gap. Its main purpose is providing a data format in R
specifically designed for storing meta data together with data in on place. Therefore it provides an S3
class called GADSdat
. The following vignette concentrates on how to import data into the GADSdat
format and work with it in the R
environment. In collaboration with the IQB Forschungsdatenzentrum (FDZ)
the package can also be used to distribute data.
Note that eatGADS
also allows the handling of large hierarchical data structures via relational data bases. This functionality is explained in more detail in an additional vignette.
The package can be installed from GitHub. Note that older R versions had issues with installations from online repositories like GitHub. R
version > 3.6.0
should work without any issues.
::install_github("beckerbenj/eatGADS") devtools
# loading the package
library(eatGADS)
GADSdat
formatR
offers a variety of tools to import data from all sorts of data formats. SPSS
data (.sav
files) can be imported directly into the GADSdat
format, with haven
used as a backend. Note that this is the easiest way to import data into the GADSdat
format.
# importing an SPSS file
<- import_spss("path/example.sav") gads
All other file types should be imported into R
first and then supplied as data.frames
to import_raw
. Below is a small selection of functions that import data as data.frames
. For an extensive overview of importing functions using the package readr
see also this book chapter, while the package readxl
is explained in more detail on this [homepage] (https://readxl.tidyverse.org/). As these files are plain data files, meta data has to be supplied as separate data sheets.
Note that none of the data.frames
can contain variables of the class factor
, as this in itself constitutes meta data. If using base R
to import data make sure to use the argument stringsAsFactors = FALSE
. If necessary, convert factors
to character via as.character
.
# importing text files
<- read.table("path/example.txt", stringsAsFactors = FALSE)
input_txt # importing German csv files (; separated)
<- read.csv2("path/example.csv", stringsAsFactors = FALSE)
input_csv # importing Excel files
<- readxl::read_excel("path/example.xlsx") input_xlsx
import_raw
takes three separate data.frames
as input. The actual data set (df
), the variable labels (varLabels
) and the value labels (valLabels
). These three objects have to be supplied in a very specific format.
The varLabels
object has to contain two variables: varName
, which should exactly correspond to the variable names in df
and varLabels
which should contain the desired variable labels as strings. Note that this data.frame
should contain as many rows as there are variables in df
.
The optional valLabels
object has to contain 4 variables: varName
, which should exactly correspond to the variable names in df
; values
, which should correspond to the respective values in df
and has to be a numeric vector (labels for character vectors are currently not supported); valLabels
, which should contain the value labels as strings and missings
, a column indicating whether the value indicates a missing value. Valid values for missings
are "valid"
= no missing code and "miss"
= missing code. Note that this data.frame
can not contain any varNames
that are not variables in df
. However, not all variables in df
have to occur in valLabels
.
# Example data set
<- data.frame(ID = 1:4, sex = c(0, 0, 1, 1),
df forename = c("Tim", "Bill", "Ann", "Chris"), stringsAsFactors = FALSE)
# Example variable labels
<- data.frame(varName = c("ID", "sex", "forename"),
varLabels varLabel = c("Person Identifier", "Sex as self reported",
"first name as reported by teacher"),
stringsAsFactors = FALSE)
# Example value labels
<- data.frame(varName = rep("sex", 3),
valLabels value = c(0, 1, -99),
valLabel = c("male", "female", "missing - omission"),
missings = c("valid", "valid", "miss"), stringsAsFactors = FALSE)
df#> ID sex forename
#> 1 1 0 Tim
#> 2 2 0 Bill
#> 3 3 1 Ann
#> 4 4 1 Chris
varLabels#> varName varLabel
#> 1 ID Person Identifier
#> 2 sex Sex as self reported
#> 3 forename first name as reported by teacher
valLabels#> varName value valLabel missings
#> 1 sex 0 male valid
#> 2 sex 1 female valid
#> 3 sex -99 missing - omission miss
# import
<- import_raw(df = df, varLabels = varLabels, valLabels = valLabels) gads
GADSdat
classThe resulting object is of the class GADSdat
and contains a data sheet and a meta data sheet.
# Inpsect resulting object
gads #> $dat
#> ID sex forename
#> 1 1 0 Tim
#> 2 2 0 Bill
#> 3 3 1 Ann
#> 4 4 1 Chris
#>
#> $labels
#> varName varLabel format display_width labeled value
#> 1 ID Person Identifier <NA> NA no NA
#> 2 sex Sex as self reported <NA> NA yes -99
#> 3 sex Sex as self reported <NA> NA yes 0
#> 4 sex Sex as self reported <NA> NA yes 1
#> 5 forename first name as reported by teacher <NA> NA no NA
#> valLabel missings
#> 1 <NA> <NA>
#> 2 missing - omission miss
#> 3 male valid
#> 4 female valid
#> 5 <NA> <NA>
#>
#> attr(,"class")
#> [1] "GADSdat" "list"
GADSdat
objectsGADSdat
objects can for example be saved as RDS
files. This is also the preferred data format for distributing GADSdat
objects to the FDZ
.
# Inpsect resulting object
saveRDS(gads, "path/gads.RDS")
GADSdat
objects in ReatGADS
provides convenient functions for extracting data and meta data from GADSdat
objects. extractMeta
is used to access the meta data for specific variables (or all variables, if no specific variable name is provided).
# Inpsect resulting object
extractMeta(gads, vars = c("sex"))
#> varName varLabel format display_width labeled value
#> 2 sex Sex as self reported <NA> NA yes -99
#> 3 sex Sex as self reported <NA> NA yes 0
#> 4 sex Sex as self reported <NA> NA yes 1
#> valLabel missings
#> 2 missing - omission miss
#> 3 male valid
#> 4 female valid
extractMeta(gads)
#> varName varLabel format display_width labeled value
#> 1 ID Person Identifier <NA> NA no NA
#> 2 sex Sex as self reported <NA> NA yes -99
#> 3 sex Sex as self reported <NA> NA yes 0
#> 4 sex Sex as self reported <NA> NA yes 1
#> 5 forename first name as reported by teacher <NA> NA no NA
#> valLabel missings
#> 1 <NA> <NA>
#> 2 missing - omission miss
#> 3 male valid
#> 4 female valid
#> 5 <NA> <NA>
extractData
is used to extract data. With its arguments the structure of the resulting data can be defined. If convertMiss = TRUE
, which is the default, is used, values that are listed as missing codes are recoded to NAs
. With the convertLabels
argument it can be specified how value labels should be used. If set to "character"
all labeled values are recoded to character, the same applies to “factor
”. If set to "numeric"
, the value labels are not applied.
# Extract data without applying labels
<- extractData(gads, convertMiss = TRUE, convertLabels = "numeric")
dat1
dat1#> ID sex forename
#> 1 1 0 Tim
#> 2 2 0 Bill
#> 3 3 1 Ann
#> 4 4 1 Chris
<- extractData(gads, convertMiss = TRUE, convertLabels = "character")
dat2
dat2#> ID sex forename
#> 1 1 male Tim
#> 2 2 male Bill
#> 3 3 female Ann
#> 4 4 female Chris
GADSdat
objectsGADSdat
objects can also be modified even though only a certain amount of operations are supported. For smaller changes to the meta data and data a number of convenience functions exists. These functions allow modifying variable labels (chagenVarLabels
), modifying variable names (changeVarNames
) and recoding values (recodeGADS
).
### wrapper functions
# Modify variable labels
<- changeVarLabels(gads, varName = c("ID"), varLabel = c("Test taker ID"))
gads2 extractMeta(gads2, vars = "ID")
#> varName varLabel format display_width labeled value valLabel missings
#> 1 ID Test taker ID <NA> NA no NA <NA> <NA>
# Modify variable name
<- changeVarNames(gads, oldNames = c("ID"), newNames = c("idstud"))
gads3 extractMeta(gads3, vars = "idstud")
#> varName varLabel format display_width labeled value valLabel
#> 1 idstud Person Identifier <NA> NA no NA <NA>
#> missings
#> 1 <NA>
extractData(gads3)
#> idstud sex forename
#> 1 1 male Tim
#> 2 2 male Bill
#> 3 3 female Ann
#> 4 4 female Chris
# recode GADS
<- recodeGADS(gads, varName = "sex", oldValues = c(0, 1, -99), newValues = c(1, 2, 99))
gads4 extractMeta(gads4, vars = "sex")
#> varName varLabel format display_width labeled value
#> 2 sex Sex as self reported <NA> NA yes 1
#> 3 sex Sex as self reported <NA> NA yes 2
#> 4 sex Sex as self reported <NA> NA yes 99
#> valLabel missings
#> 2 male valid
#> 3 female valid
#> 4 missing - omission miss
extractData(gads4, convertLabels = "numeric")
#> ID sex forename
#> 1 1 1 Tim
#> 2 2 1 Bill
#> 3 3 2 Ann
#> 4 4 2 Chris
For simultaneous changes to multiple variables a set of functions are implemented that extract a table for changes and apply the changes as written into this change table. To enable an easier work flow the change table could also be saved as an Excel file, modified via Excel and again imported into R
. See the help pages of the respective functions for more details.
# extract changeTable
<- getChangeMeta(gads, level = "variable")
varChanges # modify changeTable
$varName == "ID", "varLabel_new"] <- "Test taker ID"
varChanges[varChanges# Apply changes
<- applyChangeMeta(varChanges, gads)
gads5 extractMeta(gads5, vars = "ID")
#> varName varLabel format display_width labeled value valLabel missings
#> 1 ID Test taker ID <NA> NA no NA <NA> <NA>
Objects of the class GADSdat
can also be exported into the SPSS format, utilizing haven
. Note that this function is slightly experimental and problems with specific character strings might occur.
write_spss(gads, "path/example_out.sav")
If the haven
format is preferred for working in R
, a GADSdat
object can also be transformed to its equivalent tibble
format as if the data was imported from SPSS via haven
.
<- export_tibble(gads)
haven_dat
haven_dat#> # A tibble: 4 x 3
#> ID sex forename
#> <int> <dbl+lbl> <chr>
#> 1 1 0 [male] Tim
#> 2 2 0 [male] Bill
#> 3 3 1 [female] Ann
#> 4 4 1 [female] Chris
lapply(haven_dat, attributes)
#> $ID
#> $ID$label
#> [1] "Person Identifier"
#>
#>
#> $sex
#> $sex$label
#> [1] "Sex as self reported"
#>
#> $sex$na_values
#> [1] -99
#>
#> $sex$class
#> [1] "haven_labelled_spss" "haven_labelled"
#>
#> $sex$labels
#> missing - omission male female
#> -99 0 1
#>
#>
#> $forename
#> $forename$label
#> [1] "first name as reported by teacher"