The goal of **distops** is to provide a set of functions to compute distances between observations in a sample and to perform operations on distance matrices.

You can install the development version of distops from GitHub with:

We provide two functions for package developers to help with defining efficient implementation of the `dist`

functions for custom distances. Namely:

`use_distops()`

setups a package to use**distops**for computing distances. In particular, it creates a`src/`

directory with a`Makevars`

file and a`Makevars.win`

file. It also creates a`R/distops-package.R`

file with the appropriate**roxygen2**tags so that the`NAMESPACE`

file is modified to add the`importFrom()`

directives for the Rcpp and RcppParallel packages and the`useDynLib()`

directive for packages with compiled code. It finally modifies the`DESCRIPTION`

file to add**Rcpp**,**RcppParallel**and**distops**to the`Imports`

and`LinkingTo`

fields and GNU make to the`SystemRequirements`

field.`use_distance()`

creates R and C++ files for easy implementation of custom distances.

Let us compute the Euclidean distance matrix for the `iris`

dataset:

We can subset this matrix using the `[`

operator. We can either provide the same indices for rows and columns in which case it return another object of class `dist`

:

Or we can provide different indices for rows and columns in which case it returns a dense matrix:

```
D[2:3, 7:12]
#> 7 8 9 10 11 12
#> 2 0.5099020 0.4242641 0.5099020 0.1732051 0.8660254 0.4582576
#> 3 0.2645751 0.4123106 0.4358899 0.3162278 0.8831761 0.3741657
```

The subsetting operation is fully parallelized using the **RcppParallel** package. It is also memory efficient as it does not copy the original distance matrix.

The medoid of a sample is the observation that minimizes the sum of distances to all other observations. The `find_medoids()`

function computes the medoid of a sample for a given distance. It takes advantage of the **RcppParallel** package to compute the medoid in parallel.

If the `memberships`

argument is provided, it returns the medoid for each cluster.

- Pass a list instead of a matrix to be more general?
- Use Arrow parquet format to store distance matrix in multiple files when sample size exceeds 10,000 or something like that.
- Use Arrow connection to read in large data.
- Add Progress bar.