Duplication analysis

library(scrutiny)

You can use scrutiny to analyze duplicate values in data. Duplications can go a long way in assessing the reliability of published research.

This vignette walks you through scrutiny’s tools for finding, counting, and summarizing duplications. It uses the pigs4 dataset as a simple example:

pigs4
#> # A tibble: 5 × 3
#>   snout tail  wings
#>   <chr> <chr> <chr>
#> 1 4.73  6.88  6.09 
#> 2 8.13  7.33  8.27 
#> 3 4.22  5.17  4.40 
#> 4 4.22  7.57  5.92 
#> 5 5.17  8.13  5.17

Ranked counting with duplicate_count()

A good first step is to create a frequency table. To do so, use duplicate_count():

pigs4 %>% 
  duplicate_count()
#> # A tibble: 11 × 4
#>    value count locations          locations_n
#>    <chr> <int> <chr>                    <int>
#>  1 5.17      3 snout, tail, wings           3
#>  2 4.22      2 snout                        1
#>  3 8.13      2 snout, tail                  2
#>  4 4.73      1 snout                        1
#>  5 6.88      1 tail                         1
#>  6 7.33      1 tail                         1
#>  7 7.57      1 tail                         1
#>  8 4.40      1 wings                        1
#>  9 5.92      1 wings                        1
#> 10 6.09      1 wings                        1
#> 11 8.27      1 wings                        1

It returns a tibble (data frame) that lists all unique values. It is ordered by the count of values in the input data frame, so the values that appear most often are at the top. The locations are the names of the column or columns in which a given value appears. They are counted by locations_n.

For larger datasets, summary statistics can be helpful. Just run audit() after duplicate_count():

pigs4 %>% 
    duplicate_count() %>% 
    audit()
#> # A tibble: 2 × 8
#>   term         mean    sd median   min   max na_count na_rate
#>   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>   <dbl>
#> 1 count        1.36 0.674      1     1     3        0       0
#> 2 locations_n  1.27 0.647      1     1     3        0       0

Counting by column pair with duplicate_count_colpair()

Sometimes, a sequence of data may be repeated in multiple columns. duplicate_count_colpair() helps find such cases:

pigs4 %>% 
  duplicate_count_colpair()
#> # A tibble: 3 × 7
#>   x     y     count total_x total_y rate_x rate_y
#>   <chr> <chr> <int>   <int>   <int>  <dbl>  <dbl>
#> 1 snout tail      2       5       5    0.4    0.4
#> 2 snout wings     1       5       5    0.2    0.2
#> 3 tail  wings     1       5       5    0.2    0.2

x and y represent all combinations of columns in pigs4. The count is the number of values that appear in both respective columns. This is different from duplicate_count(), where count displays total frequencies.

snout and tail are the column pair with the most overlap: 2 out of 5 values are the same, a rate of 0.4. If there are no missing values, total_x and total_y are the same. The same applies to rate_x and rate_y.

Again, you can get summary statistics with audit():

pigs4 %>% 
  duplicate_count_colpair() %>% 
  audit()
#> # A tibble: 5 × 8
#>   term     mean    sd median   min   max na_count na_rate
#>   <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>   <dbl>
#> 1 count   1.33  0.577    1     1     2          0       0
#> 2 total_x 5     0        5     5     5          0       0
#> 3 total_y 5     0        5     5     5          0       0
#> 4 rate_x  0.267 0.115    0.2   0.2   0.4        0       0
#> 5 rate_y  0.267 0.115    0.2   0.2   0.4        0       0

Counting by observation with duplicate_tally()

Unlike the other two functions, duplicate_tally() preserves the structure of the original data frame. It adds an _n column next to each original column. The newly added columns count how often each value appears in the data frame as a whole:

pigs4 %>% 
    duplicate_tally()
#> # A tibble: 5 × 6
#>   snout snout_n tail  tail_n wings wings_n
#>   <chr>   <int> <chr>  <int> <chr>   <int>
#> 1 4.73        1 6.88       1 6.09        1
#> 2 8.13        2 7.33       1 8.27        1
#> 3 4.22        2 5.17       3 4.40        1
#> 4 4.22        2 7.57       1 5.92        1
#> 5 5.17        3 8.13       2 5.17        3

In snout, for example, 4.22 appears twice, so its entries in snout_n are 2. But likewise, 8.13 appears in both snout and tail, so both observations are marked 2 in the _n columns.

When following duplicate_tally() up with audit(), it shows summary statistics for each _n column. The last row summarizes all of these columns together.

pigs4 %>% 
    duplicate_tally() %>% 
    audit()
#> # A tibble: 4 × 8
#>   term    mean    sd median   min   max na_count na_rate
#>   <chr>  <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>   <dbl>
#> 1 snout   2    0.707      2     1     3        0       0
#> 2 tail    1.6  0.894      1     1     3        0       0
#> 3 wings   1.4  0.894      1     1     3        0       0
#> 4 .total  1.67 0.816      1     1     3        0       0