```
library(seqgendiff)
library(limma)
set.seed(31) ## for reproducibility
```

We demonstrate how to apply the different thinning functions available in seqgendiff. The following contains guidelines for the workflow of a single repetition in a simulation study.

The methods used here are described in Gerard (2020).

We will use simulated data for this vignette. Though in practice you would obtain the RNA-seq counts from a real dataset.

```
<- 100
nsamp <- 10000
ngene <- matrix(stats::rpois(n = nsamp * ngene, lambda = 100),
mat nrow = ngene,
ncol = nsamp)
```

Each repetition of a simulation study, you should randomly subset your RNA-seq data so that your results are not dependent on the quirks of a few individuals/genes. The function to do this is `select_counts()`

.

`<- select_counts(mat = mat, nsamp = 6, ngene = 1000) submat `

The default is to randomly select the samples and genes. Though there are many options available on how to select the genes using the `gselect`

argument.

If the row and column names of the original matrix are `NULL`

, then the row and column names of the returned submatrix contain the indices of the selected rows and columns of the original matrix.

```
head(rownames(submat))
#> [1] "23" "25" "38" "39" "68" "107"
head(colnames(submat))
#> [1] "18" "44" "62" "69" "92" "99"
```

If you are exploring the effects of heterogeneous library sizes, use `thin_lib()`

. You specify the thinning factor on the log-2 scale, so a value of 0 means no thinning, a value of 1 means thin by half, a value of 2 means thin by 1/4, etc. You need to specify this scaling factor for all samples.

```
<- seq_len(ncol(submat))
scaling_factor
scaling_factor#> [1] 1 2 3 4 5 6
<- thin_lib(mat = submat, thinlog2 = scaling_factor) thout
```

We can verify that thinning was performed correctly by looking at the empirical thinning amount.

```
## Empirical thinning
colSums(thout$mat) / colSums(submat)
#> 18 44 62 69 92 99
#> 0.49887721 0.24906136 0.12624313 0.06288817 0.03108922 0.01598802
## Specified thinning
2 ^ -scaling_factor
#> [1] 0.500000 0.250000 0.125000 0.062500 0.031250 0.015625
```

A similar function exists to thin gene-wise rather than sample-wise: `thin_gene()`

.

To uniformly thin all counts, use `thin_all()`

. This might be useful for determining read-depth suggestions. It takes as input a single universal scaling factor on the log-2 scale.

`<- thin_all(mat = submat, thinlog2 = 1) thout `

We can verify that we approximately halved all counts:

```
sum(thout$mat) / sum(submat)
#> [1] 0.4998192
```

Use `thin_diff()`

for general thinning. For this function, you need to specify both the coefficient matrix and the design matrix.

```
<- cbind(rep(c(0, 1), each = ncol(submat) / 2),
designmat rep(c(0, 1), length.out = ncol(submat)))
designmat#> [,1] [,2]
#> [1,] 0 0
#> [2,] 0 1
#> [3,] 0 0
#> [4,] 1 1
#> [5,] 1 0
#> [6,] 1 1
<- matrix(stats::rnorm(ncol(designmat) * nrow(submat)),
coefmat ncol = ncol(designmat),
nrow = nrow(submat))
head(coefmat)
#> [,1] [,2]
#> [1,] 0.2554242 -1.3119342
#> [2,] 0.6377614 0.5336341
#> [3,] 1.3052848 0.6763963
#> [4,] 0.1383671 -0.9041101
#> [5,] -0.1251100 1.3601036
#> [6,] -0.8867430 -0.5249070
```

Once we have the coefficient and design matrices, we can thin.

```
<- thin_diff(mat = submat,
thout design_fixed = designmat,
coef_fixed = coefmat)
```

We can verify that we thinned correctly using the voom-limma pipeline.

```
<- cbind(thout$design_obs, thout$designmat)
new_design <- limma::voom(counts = thout$mat, design = new_design)
vout <- limma::lmFit(vout)
lout <- coef(lout)[, -1, drop = FALSE] coefhat
```

We’ll plot the true coefficients against their estimates.

```
<- par(mar = c(2.5, 2.5, 1, 0) + 0.1,
oldpar mgp = c(1.5, 0.5, 0))
plot(x = coefmat[, 1],
y = coefhat[, 1],
xlab = "True Coefficient",
ylab = "Estimated Coefficient",
main = "First Variable",
pch = 16)
abline(a = 0,
b = 1,
lty = 2,
col = 2,
lwd = 2)
```

```
plot(x = coefmat[, 2],
y = coefhat[, 2],
xlab = "True Coefficient",
ylab = "Estimated Coefficient",
main = "Second Variable",
pch = 16)
abline(a = 0,
b = 1,
lty = 2,
col = 2,
lwd = 2)
```

`par(oldpar)`

The difference between the `design_fixed`

and `design_perm`

arguments is that the rows in `design_perm`

are permuted before applying thinning. Without any other arguments, this makes the design variables independent of any surrogate variables. With the additional specification of the `target_cor`

argument, we try to control the amount of correlation between the design variables in `design_perm`

and any surrogate variables.

Let’s target for a correlation of 0.9 between the first surrogate variable and the first design variable, and a correlation of 0 between the first surrogate variable and the second design variable.

```
<- matrix(c(0.9, 0), nrow = 2)
target_cor
target_cor#> [,1]
#> [1,] 0.9
#> [2,] 0.0
<- thin_diff(mat = submat,
thout_cor design_perm = designmat,
coef_perm = coefmat,
target_cor = target_cor)
```

The first variable is indeed more strongly correlated with the estimated surrogate variable:

```
cor(thout_cor$designmat, thout_cor$sv)
#> [,1]
#> P1 0.7473094
#> P2 0.2976213
```

The actual correlation between the permuted design matrix and the surrogate variables will not be the target correlation. But we can estimate what the actual correlation is using the function `effective_cor()`

.

```
<- effective_cor(design_perm = designmat,
eout sv = thout_cor$sv,
target_cor = target_cor,
iternum = 50)
eout#> [,1]
#> [1,] 0.7113003
#> [2,] 0.1399246
```

I am only using 50 iterations here for speed reasons, but you should stick to the defaults for `iternum`

.

For the special case when your design matrix is just a group indicator (that is, you have two groups of individuals), you can use the function `thin_2group()`

. Let’s generate data from the two-group model where 90% of genes are null and the non-null effects are gamma-distributed.

```
<- thin_2group(mat = submat,
thout prop_null = 0.9,
signal_fun = stats::rgamma,
signal_params = list(shape = 1, rate = 1))
```

We can again verify that we thinned appropriately using the voom-limma pipeline:

```
<- cbind(thout$design_obs, thout$designmat)
new_design
new_design#> (Intercept) P1
#> [1,] 1 0
#> [2,] 1 0
#> [3,] 1 0
#> [4,] 1 1
#> [5,] 1 1
#> [6,] 1 1
<- limma::voom(counts = thout$mat, design = new_design)
vout <- limma::lmFit(vout)
lout <- coef(lout)[, 2, drop = FALSE] coefhat
```

And we can plot the results

```
<- par(mar = c(2.5, 2.5, 1, 0) + 0.1,
oldpar mgp = c(1.5, 0.5, 0))
plot(x = thout$coefmat,
y = coefhat,
xlab = "True Coefficient",
ylab = "Estimated Coefficient",
main = "First Variable",
pch = 16)
abline(a = 0,
b = 1,
lty = 2,
col = 2,
lwd = 2)
```

- Gerard, D (2020). “Data-based RNA-seq simulations by binomial thinning.”
*BMC Bioinformatics*. 21(1), 206. doi: 10.1186/s12859-020-3450-9.