Some best practices for anticlustering

Martin Papenberg

This vignette documents some “best practices” for anticlustering using the R package anticlust. In many cases, the suggestions pertain to overriding the default values of arguments of anticlustering(), which seems to be a difficult decision for users. However, I advise you: Do not stick with the defaults; check out the results of different anticlustering specifications; repeat the process; play around; read the documentation (especially ?anticlustering); change arguments arbitrarily; compare the output. Nothing can break.1

This document uses somewhat imperative language; nuance and explanations are given in the package documentation, the other vignettes, and the papers by Papenberg and Klau (2021; https://doi.org/10.1037/met0000301) and Papenberg (2024; https://doi.org/10.1111/bmsp.12315). Note that deciding which anticlustering objective to use usually requires substantial content considerations and cannot be reduced to “which one is better”. However, some hints are given below.

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77 (1), 80–102. https://doi.org/10.1111/bmsp.12315


  1. Well, actually your R session can break if you use an optimal method (method = "ilp") with a data set that is too large.

  2. You might ask why standardize = TRUE is not the default. Actually, there are two reasons. First, the argument was not always available in anticlust and changing the default behaviour of a function when releasing a new version is oftentimes undesirable. Second, it seems like a big decision to me to just change users’ data by default (which is done when standardizing the data). In doubt, just compare the results of using standardize = TRUE and standardize = FALSE and decide for yourself which you like best. Standardization may not be the best choice in all settings.

  3. What is considered very large may depend on the specifications of your computer. I used to have problems computing a distance matrix for \(N > 20000\).

  4. The maximum dispersion problem can be solved optimally for rather large data sets, especially for \(K = 2\). For \(K = 3\) and \(K = 4\), several 100 elements can usually be processed, especially when installing the SYMPHONY solver.