Tidying Data for Analysis

The bwsTools package requires data to be in a specified format—but your data might not look like this when you first get it. For more on this format, see the paper introducing this package as well as the documentation for what specific functions require.

This vignette looks at two ways best-worst scaling data may be formatted initially and shows how to use the tidyverse packages dplyr and tidyr to get your data into the format that is required for the bwsTools functions.

First, we load the packages we need:

library(bwsTools)
library(dplyr)
library(tidyr)

For simplicity’s sake, imagine a simple scenario where three survey respondents respond to a best-worst scaling design that has four items that appear over four trials of three options. The data appear in “wide” format, where each row is a participant (pid being the unique identifier). The column names follow a standardized format: qX_Z, where X denotes what trial it was (one through four) and Z indicates the item shown (the first, second, or third item). Values are 2 if it was chosen as best, 1 if it was chosen as worst, and NA if it was not chosen. I find that this is how data come down from most common platforms, such as Qualtrics. Example data may look like:

dat1
#> # A tibble: 3 x 13
#>     pid  q1_1  q1_2  q1_3  q2_1  q2_2  q2_3  q3_1  q3_2  q3_3  q4_1  q4_2  q4_3
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     1     1     2    NA    NA     2     1    NA     2     1
#> 2     2     2    NA     1     1    NA     2     1     2    NA     2     1    NA
#> 3     3     2     1    NA    NA     1     2     2     1    NA     2     1    NA

We can see that, in the first trial, respondents 1 and 2 both chose the third option as worst and first option as best In the fourth trial, respondents 2 and 3 both chose the second option as worst and the first option as best. But what do these refer to?

I like to have a data.frame that I call key. This tells me what each option in each trial refers to.

key
#> # A tibble: 12 x 2
#>    q     label        
#>    <chr> <chr>        
#>  1 q1_1  Steak N Shake
#>  2 q1_2  Shake Shack  
#>  3 q1_3  Whataburger  
#>  4 q2_1  Culvers      
#>  5 q2_2  Shake Shack  
#>  6 q2_3  Whataburger  
#>  7 q3_1  Steak N Shake
#>  8 q3_2  Culvers      
#>  9 q3_3  Shake Shack  
#> 10 q4_1  Steak N Shake
#> 11 q4_2  Culvers      
#> 12 q4_3  Whataburger

This allows us to interpret the above data better: respondents 1 and 2 both chose Whataburger as worst and Steak N Shake as best in the first trial.

I assume some knowledge of using the tidyverse, but I suggest checking out R for Data Science for a gentle introduction to it if you are not.

The following pipe chain shows how to take the data.frame above and format it how the package would like for calculating individual-level scores. See comments for additional explanation.

dat1_i <- dat1 %>% 
  # 1. collect all of the non-pid columns, where variable names are filled into
  #    the column q, and the values are in column resp
  gather("q", "resp", -pid) %>%
  mutate(
    resp = case_when(  # 2. recode resp such that
      resp == 2 ~ 1,   #    if resp is 2, make it 1
      resp == 1 ~ -1,  #    if resp is 1, make it -1
      is.na(resp) ~ 0  #    if resp is NA, make it zero
    )
  ) %>% 
  # 3. join with the key data.frame by the column q
  left_join(key, by = "q") %>% 
  # 4. separate the q column into the block number and the item number
  #    by the underscore
  separate(q, c("block", "temp"), sep = "_") %>% 
  # 5. unselect the item number, since it is no longer needed
  #    as you have the item name now
  select(-temp)

This now follows the tidy format that bwsTools requires. One column indicates the unique identifier for the respondent, another the trial (or block) that the choices were presented in, the response (as 1 if best, -1 if worst, and 0 as unselected), and the name of the item.

dat1_i
#> # A tibble: 36 x 4
#>      pid block  resp label        
#>    <int> <chr> <dbl> <chr>        
#>  1     1 q1        1 Steak N Shake
#>  2     2 q1        1 Steak N Shake
#>  3     3 q1        1 Steak N Shake
#>  4     1 q1        0 Shake Shack  
#>  5     2 q1        0 Shake Shack  
#>  6     3 q1       -1 Shake Shack  
#>  7     1 q1       -1 Whataburger  
#>  8     2 q1       -1 Whataburger  
#>  9     3 q1        0 Whataburger  
#> 10     1 q2       -1 Culvers      
#> # … with 26 more rows

And the following code demonstrates how to do so if one wanted to calculate aggregate scores. Note that it starts with the individual-level data, dat1_i.

dat1_a <- dat1_i %>% 
  # 1. group by the label
  group_by(label) %>% 
  # 2. summarise such that...
  summarise(
    total = n(),              # n() shows number of times the item appeared
    best = sum(resp == 1),    # sum up the number of times it was selected best
    worst = sum(resp == -1),  # and sum up the times it was selected as worst
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

And you can see below that these now run without any errors.

res1_i <- e_bayescoring(dat1_i, "pid", "block", "label", "resp")
head(res1_i)
#> # A tibble: 6 x 3
#>     pid label         b_ebayes
#>   <int> <chr>            <dbl>
#> 1     1 Culvers          0.111
#> 2     1 Shake Shack     -0.223
#> 3     1 Steak N Shake    0.956
#> 4     1 Whataburger     -0.821
#> 5     2 Culvers         -0.571
#> 6     2 Shake Shack     -0.223

By default, it gets returned in the tidy format. This makes it perfect for plotting. But let’s say you wanted to join it back to your original data, because you want to correlate scores with, for example, age. You could specify the format to return as wide and then join back to the original data, dat1.

dat1 <- e_bayescoring(dat1_i, "pid", "block", "label", "resp", wide = TRUE) %>% 
  left_join(dat1, by = "pid")
head(dat1)
#> # A tibble: 3 x 17
#>     pid Culvers `Shake Shack` `Steak N Shake` Whataburger  q1_1  q1_2  q1_3
#>   <int>   <dbl>         <dbl>           <dbl>       <dbl> <dbl> <dbl> <dbl>
#> 1     1   0.111        -0.223           0.956      -0.821     2    NA     1
#> 2     2  -0.571        -0.223           0.956      -0.111     2    NA     1
#> 3     3  -0.956        -0.956           1.65        0.223     2     1    NA
#> # … with 9 more variables: q2_1 <dbl>, q2_2 <dbl>, q2_3 <dbl>, q3_1 <dbl>,
#> #   q3_2 <dbl>, q3_3 <dbl>, q4_1 <dbl>, q4_2 <dbl>, q4_3 <dbl>

That key above assumes that everyone is seeing the same design. What if they aren’t? Data in this situation might look like it does below. There are now two columns for each decision: one indicating if it was selected best, worst, or unselected (qX_iZ_y below), and one indicating what the item read (qX_iZ_t) where X again represents what trial (i.e., block) the respondent was on and the Z represents the item number. Then _y indicates it is the response and _t indicates it is the label. These data may look like:

dat2
#> # A tibble: 3 x 25
#>     pid q1_i1_y q1_i1_t q1_i2_y q1_i2_t q1_i3_y q1_i3_t q2_i1_y q2_i1_t q2_i2_y
#>   <int>   <dbl> <chr>     <dbl> <chr>     <dbl> <chr>     <dbl> <chr>     <dbl>
#> 1     1       2 Steak …      NA Shake …       1 Whatab…       1 Culvers       2
#> 2     2       2 Steak …      NA Shake …       1 Whatab…       1 Culvers      NA
#> 3     3       2 Steak …       1 Shake …      NA Whatab…      NA Culvers       1
#> # … with 15 more variables: q2_i2_t <chr>, q2_i3_y <dbl>, q2_i3_t <chr>,
#> #   q3_i1_y <dbl>, q3_i1_t <chr>, q3_i2_y <dbl>, q3_i2_t <chr>, q3_i3_y <dbl>,
#> #   q3_i3_t <chr>, q4_i1_y <dbl>, q4_i1_t <chr>, q4_i2_y <dbl>, q4_i2_t <chr>,
#> #   q4_i3_y <dbl>, q4_i3_t <chr>

The tidying procedure is similar to that above. See comments below.

dat2_i <- dat2 %>% 
  # 1. collect all of the non-pid columns, where the column name is now called
  #    temp, and the values in those columns are now all in the value column
  gather("temp", "value", -pid) %>% 
  # 2. break apart the temp column by the underscore, so now it indicates
  #    the block in the block column, the item number in the item column,
  #    and whether or not the value refers to the label (t) or response (y)
  #    in column t_or_y
  separate(temp, c("block", "item", "t_or_y"), sep = "_") %>% 
  # 3. spread out t_or_y, filling it with the values of value
  spread(t_or_y, value) %>% 
  # 4. recode answers, just like in the above example
  mutate(
    y = case_when(
      y == 2 ~ 1,
      y == 1 ~ -1,
      is.na(y) ~ 0
    )
  ) %>% 
  # 5. remove the item number column, as it is not needed anymore
  select(-item)

Now, the data are in the correct format for using the bwsTools individual scoring functions.

dat2_i
#> # A tibble: 36 x 4
#>      pid block t                 y
#>    <int> <chr> <chr>         <dbl>
#>  1     1 q1    Steak N Shake     1
#>  2     1 q1    Shake Shack       0
#>  3     1 q1    Whataburger      -1
#>  4     1 q2    Culvers          -1
#>  5     1 q2    Shake Shack       1
#>  6     1 q2    Whataburger       0
#>  7     1 q3    Steak N Shake     0
#>  8     1 q3    Culvers           1
#>  9     1 q3    Shake Shack      -1
#> 10     1 q4    Steak N Shake     0
#> # … with 26 more rows

Using these individual-level data to aggregate up follows the same procedure as above.