The implementation of error detection techniques in scrutiny rests on a foundation of specialized helper functions. Some of these are exported because they might be helpful in error detection more broadly, or perhaps even in other contexts.

This vignette provides an overview of scrutiny’s miscellaneous
infrastructure for implementing error detection techniques. For more
specific articles, see `vignette("rounding")`

or
`vignette("consistency-tests")`

.

Large parts of the package ultimately rest on either of two functions that simply count decimal places. These are digits after a number’s decimal point or some other separator. Both functions also take strings.

`decimal_places()`

is vectorized:

```
decimal_places("2.80")
#> [1] 2
decimal_places(c(55.1, 6.493, 8))
#> [1] 1 3 0
vec1 <- iris %>%
dplyr::slice(1:10) %>%
dplyr::pull(Sepal.Length)
vec1
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
vec1 %>%
decimal_places()
#> [1] 1 1 1 1 0 1 1 0 1 1
```

Using strings (that are coercible to numeric) is recommended in an error detection context because trailing zeros can be crucial here. Numeric values drop trailing zeros, whereas strings preserve them:

`decimal_places_scalar()`

is faster than
`decimal_places()`

but only takes a single number or string.
This makes it suitable as a helper within other single-case
functions.

When dealing with numbers that used to have trailing zeros but lost
them from being registered as numeric, call `restore_zeros()`

to format them correctly. This can be relevant within functions that
create vectors where trailing zeros matter, such as the
`seq_*()`

functions presented in the next section.

Suppose all of the following numbers originally had one decimal place, but some no longer do:

Now, get them back with `restore_zeros()`

:

```
vec2 %>%
restore_zeros()
#> [1] "4.0" "6.9" "5.0" "4.2" "4.8" "7.0" "4.0"
vec2 %>%
restore_zeros() %>%
decimal_places()
#> [1] 1 1 1 1 1 1 1
```

This uses the default of going by the longest mantissa and padding
the other strings with decimal zeros until they have that many decimal
places. However, this is just a heuristic: The longest mantissa might
itself have lost decimal places. Specify the `width`

argument
to explicitly state the desired mantissa length:

`base::seq()`

offers a flexible way to generate sequences,
but it is not cut out for working with decimal numbers. The
`by`

argument only allows for manual specifications of the
step size, i.e., the difference between two consecutive output values.
In an error detection context, there is also the problem of trailing
zeros in numeric values.

Use scrutiny’s `seq_*()`

functions to automatically
determine step size from the input numbers and, by default, to supply
missing trailing zeros via `restore_zeros()`

. Output will
then naturally be string.

Why are there multiple such functions? The first two disentangle the
two different ways in which `seq()`

can be used. A third
function adds a way of generating sequences not directly covered by
`seq()`

.

`seq_endpoint()`

takes two main arguments,`from`

and`to`

. It creates a sequence between the two, inferring step size from the greater number of decimal places among them. This corresponds to a`seq()`

call in which`to`

was specified.`seq_distance()`

takes a`from`

argument, uses it to infer the step size, and creates a sequence of a length specified by the`length_out`

argument (default is`10`

). This corresponds to a`seq()`

call in which`length.out`

was specified.Finally,

`seq_disperse()`

creates a sequence centered around`from`

.

Each of these functions has a `*_df()`

variant that embeds
the sequence as a tibble column.

The `seq_*()`

functions have some more features, such as
offsets and direction reversal, but I’ll focus on the basics here.

Call `seq_endpoint()`

to bridge two numbers at the correct
decimal level:

```
seq_endpoint(from = 4.1, to = 6)
#> [1] "4.1" "4.2" "4.3" "4.4" "4.5" "4.6" "4.7" "4.8" "4.9" "5.0" "5.1" "5.2"
#> [13] "5.3" "5.4" "5.5" "5.6" "5.7" "5.8" "5.9" "6.0"
seq_endpoint(from = 4.1, to = 4.15)
#> [1] "4.10" "4.11" "4.12" "4.13" "4.14" "4.15"
```

Call `seq_distance()`

to get a sequence of desired
length:

```
seq_distance(from = 4.1, length_out = 3)
#> [1] "4.1" "4.2" "4.3"
# Default for `length_out` is `10`:
seq_distance(from = 4.1)
#> [1] "4.1" "4.2" "4.3" "4.4" "4.5" "4.6" "4.7" "4.8" "4.9" "5.0"
```

Finally, call `seq_disperse()`

to construct a sequence
around `from`

:

```
seq_disperse(from = 4.1, dispersion = 1:3)
#> [1] "3.8" "3.9" "4.0" "4.1" "4.2" "4.3" "4.4"
# Default for `dispersion` if `1:5`:
seq_disperse(from = 4.1)
#> [1] "3.6" "3.7" "3.8" "3.9" "4.0" "4.1" "4.2" "4.3" "4.4" "4.5" "4.6"
```

`seq_disperse()`

is a hybrid between the two
`seq()`

wrappers explained above and the
`disperse*()`

functions introduced next.

Four predicate functions test whether a vector `x`

represents particular kinds of sequences. These testing functions can be
used as helpers, but they are also analytic tools in their own
right.

`is_seq_linear()`

returns `TRUE`

if the
difference between all neighboring values is the same:

```
is_seq_linear(x = 8:15)
#> [1] TRUE
is_seq_linear(x = c(8:15, 16))
#> [1] TRUE
is_seq_linear(x = c(8:15, 17))
#> [1] FALSE
```

`is_seq_ascending()`

tests whether that difference is
always positive…

```
is_seq_ascending(x = 8:15)
#> [1] TRUE
is_seq_ascending(x = 15:8)
#> [1] FALSE
# Default also tests for linearity:
is_seq_ascending(x = c(8:15, 17))
#> [1] FALSE
is_seq_ascending(x = c(8:15, 17), test_linear = FALSE)
#> [1] TRUE
```

…whereas `is_seq_descending()`

tests whether it is always
negative:

```
is_seq_descending(x = 8:15)
#> [1] FALSE
is_seq_descending(x = 15:8)
#> [1] TRUE
# Default also tests for linearity:
is_seq_descending(x = c(15:8, 2))
#> [1] FALSE
is_seq_descending(x = c(15:8, 2), test_linear = FALSE)
#> [1] TRUE
```

`is_seq_dispersed()`

tests whether the vector is grouped
around its `from`

argument:

```
is_seq_dispersed(x = 3:7, from = 2)
#> [1] FALSE
# Direction doesn't matter here:
is_seq_dispersed(x = 3:7, from = 5)
#> [1] TRUE
is_seq_dispersed(x = 7:3, from = 5)
#> [1] TRUE
# Dispersed from `50`, but not linear:
x_nonlinear <- c(49, 42, 47, 44, 50, 56, 53, 58, 51)
# Default also tests for linearity:
is_seq_dispersed(x = x_nonlinear, from = 50)
#> [1] FALSE
is_seq_dispersed(x = x_nonlinear, from = 50, test_linear = FALSE)
#> [1] TRUE
```

`NA`

handlingAll the `is_seq_*()`

functions take special care with
missing values. If one or more elements of `x`

are
`NA`

, this doesn’t necessarily mean that it’s unknown whether
or not `x`

might possibly represent the kind of sequence in
question.

In these examples, it is genuinely unclear whether `x`

is
linear:

Linearity thus depends on the unknown, missing value behind
`NA`

:

```
is_seq_linear(x = c(1, 2, 3, 4))
#> [1] TRUE
is_seq_linear(x = c(1, 2, 7, 4))
#> [1] FALSE
is_seq_linear(x = c(1, 2, 3, 4, 5, 6))
#> [1] TRUE
is_seq_linear(x = c(1, 2, 17, 29, 32, 6))
#> [1] FALSE
```

Sometimes, however, `x`

cannot possibly represent the
tested kind of sequence, independently of the hypothetical numbers
substituted for `NA`

elements. In such cases, scrutiny’s
`is_seq_*()`

functions will always return
`FALSE`

:

```
is_seq_linear(x = c(1, 2, NA, 10))
#> [1] FALSE
is_seq_linear(x = c(1, 2, NA, NA, NA, 10))
#> [1] FALSE
```

This is very much in the spirit of consistency testing. Even if
certain data are unknown, it still makes sense to check whether or not
*any* data could possibly fill in the gaps. The
`is_seq_*()`

functions effectively ask: Are the numbers left
and right of the `NA`

s consistent with each other, given
their index positions?

It is worth emphasizing that this behavior is not exotic, or specific
to scrutiny. It simply asserts the fundamental ideas of `NA`

propagation in R. For example,
`is_seq_ascending(x = c(1, 2, NA, 1))`

is `FALSE`

for the same reason that `NA & FALSE`

is
`FALSE`

: The outcome is the same for all possible values of
`NA`

(Wickham 2019,
ch. 3.2.3).

Leading and trailing `NA`

s are mostly ignored when
determining whether `x`

*might* be the kind of
sequence in question:

```
is_seq_linear(x = c(NA, NA, 1, 2, 3, 4, NA))
#> [1] NA
is_seq_linear(x = c(NA, NA, 1, 2, NA, 4, NA))
#> [1] NA
```

The only exception, `is_seq_dispersed()`

, is particularly
sensitive to `NA`

values:

```
# `TRUE` because `x` is symmetrically dispersed
# from 5 and contains no `NA` values:
is_seq_dispersed(x = c(3:7), from = 5)
#> [1] TRUE
# `NA` because it might be dispersed from 5,
# depending on the values hidden behind the `NA`s:
is_seq_dispersed(x = c(NA, 3:7, NA), from = 5)
#> [1] NA
is_seq_dispersed(x = c(NA, NA, 3:7, NA, NA), from = 5)
#> [1] NA
# `FALSE` because it's not symmetrically dispersed
# around 5, no matter what the `NA`s stand in for:
is_seq_dispersed(x = c(NA, 3:7), from = 5)
#> [1] FALSE
is_seq_dispersed(x = c(3:7, NA), from = 5)
#> [1] FALSE
is_seq_dispersed(x = c(3:7, NA, NA), from = 5)
#> [1] FALSE
is_seq_dispersed(x = c(NA, NA, 3:7), from = 5)
#> [1] FALSE
```

`disperse_total()`

Briefly, `disperse_total()`

checks if an input total is
even or odd, cuts it in half, and creates “dispersed” group sizes going
out from there, with each pair of group sizes adding up to the input
total. This works naturally with even totals. For odd totals, it starts
with the two integers closest to half.

The function internally calls either of `disperse()`

and
`disperse2()`

, but I recommend simply using the higher-level
`disperse_total()`

. Here are two basic examples:

```
# With an even total...
disperse_total(n = 70)
#> # A tibble: 12 × 2
#> n n_change
#> <dbl> <dbl>
#> 1 35 0
#> 2 35 0
#> 3 34 -1
#> 4 36 1
#> 5 33 -2
#> 6 37 2
#> 7 32 -3
#> 8 38 3
#> 9 31 -4
#> 10 39 4
#> 11 30 -5
#> 12 40 5
# ...and with an odd total:
disperse_total(n = 83)
#> # A tibble: 12 × 2
#> n n_change
#> <dbl> <dbl>
#> 1 41 0
#> 2 42 0
#> 3 40 -1
#> 4 43 1
#> 5 39 -2
#> 6 44 2
#> 7 38 -3
#> 8 45 3
#> 9 37 -4
#> 10 46 4
#> 11 36 -5
#> 12 47 5
```

Starting with `is_subset_of()`

, scrutiny features a
distinctive family of predicate functions that test whether one vector
`x`

is a subset of another vector `y`

, whether
`x`

is a superset of `y`

(i.e. the reverse of a
subset), or whether `x`

and `y`

are equal
sets.

As a teaser: These functions are divided into three subgroups based
on the way the second vector, `y`

, is constituted. For
example, you might test if `x`

is a subset of multiple other
vectors taken together, or a superset of a vector `y`

that
consists of multiple values entered along with `x`

.

Functions from this family are not currently used as helpers inside other scrutiny functions, but that may well change. Use elsewhere is also conceivable.

Wickham, Hadley. 2019. *Advanced r*. Second edition. Boca Raton:
CRC Press/Taylor; Francis Group.