Introduction to Vector Comprehension

Christopher Mann

2021-01-22

List comprehension is an alternative method of writing for loops or lapply (and similar) functions that is readable, quick, and easy to use. Many newcomers to R struggle with using the lapply family of functions and default to using for. However, for loops are extremely slow in R and should be avoided. Comprehensions in the eList package bridge the gap by allowing users to write lapply or sapply functions using for loops.

This vignette describes how to use list and other comprehensions. It also explores comprehensions with multiple variables, if-else conditions, nested comprehensions, and parallel comprehensions.

List Comprehension

Let us start with a simple example: we have a vector of fruit names and want to create a sequence for each based on the number of characters. One common method is to write a for statement where we loop across each fruit name, create a sequence from the numbers, and place it in a list.

x <- c("apples", "bananas", "pears")

fruit_chars <- vector("list", length(x))
for (i in seq_along(x)){
  fruit_chars[[i]] <- 1:nchar(x[i])
}
fruit_chars
#> [[1]]
#> [1] 1 2 3 4 5 6
#> 
#> [[2]]
#> [1] 1 2 3 4 5 6 7
#> 
#> [[3]]
#> [1] 1 2 3 4 5

Alternatively, we could use a faster, vectorized function such as lapply. This is similar to the for loop, except that a specified function is applied to each fruit name and the results are wrapped in a list.

fruit_chars <- lapply(x, function(i) 1:nchar(i))

The eList package allows for a hybrid variant of the two using the List() function. The for loop evaluates the expression and merges the results into a list. Beneath the scenes, List is parsing the statement into lapply and evaluating it. As we will see, though, there are additional features in List.

library(eList)
#> 
#> Attaching package: 'eList'
#> The following object is masked from 'package:utils':
#> 
#>     zip
fruit_chars <- List(for (i in x) 1:nchar(i))

If you are coming from Python or another language that uses list comprehension, you will notice that this syntax is a bit different; it is written more similar to standard for loops in R. The for statement comes first, then the variable name and sequence are placed in parentheses. Finally the expression is placed afterwards. The expression can be wrapped in braces {} if desired.

Assigning Names

The lists of numbers in the previous examples may be confusing. Luckily, List and other comprehension functions allow us to assign names to each result using the notation: name = expr.

List(for (i in x) i = 1:nchar(i))
#> $apples
#> [1] 1 2 3 4 5 6
#> 
#> $bananas
#> [1] 1 2 3 4 5 6 7
#> 
#> $pears
#> [1] 1 2 3 4 5

The name can be any type of expression. Though complex expressions should be wrapped in parentheses so that the parser does not confuse who has the = sign.

List(for (i in x) (if (i == "bananas") "good" else "bad") = 1:nchar(i))
#> $bad
#> [1] 1 2 3 4 5 6
#> 
#> $good
#> [1] 1 2 3 4 5 6 7
#> 
#> $bad
#> [1] 1 2 3 4 5

Conditions - if-else

The results of a comprehension can be filtered using standard if statements after naming the variables and sequence, making sure that the condition is placed in parentheses. The statement below only returns the sequence for "bananas".

List(for (i in x) if (i == "bananas") 1:nchar(i))
#> [[1]]
#> [1] 1 2 3 4 5 6 7

Any statement that evaluates to NULL is automatically filtered from the results. else statements can also be included so that the results are not filtered out (unless the else statement evaluates to NULL).

List(for (i in x) if (i == "bananas") "delicious" else "ewww")
#> [[1]]
#> [1] "ewww"
#> 
#> [[2]]
#> [1] "delicious"
#> 
#> [[3]]
#> [1] "ewww"

Each entry can still be assigned a name. Furthermore, the expression can be as complex as necessary for the task with any number of else if checks.

List(for (i in x) i = {
  n <- nchar(i)
  if (n > 6) "delicious"
  else if (n > 5) "ok"
  else "ewww"
})
#> $apples
#> [1] "ok"
#> 
#> $bananas
#> [1] "delicious"
#> 
#> $pears
#> [1] "ewww"

Multiple Variables

Comprehensions can have multiple variables if the variables are separated using "." within a single name. To see how this works, let us use the enum function in the eList package. When enum is used on a variable, the first value becomes its index in the loop and the second is the value of the vector at that index. Now, when we use (i.j in enum(x)), i = the index number of each item in x, while j = the value of the item in x (the name of the fruit).

List(for (i.j in enum(x)) j = i)
#> $apples
#> [1] 1
#> 
#> $bananas
#> [1] 2
#> 
#> $pears
#> [1] 3

Let us inspect enum(x) to see what is going on beneath the surface. enum took the vector x and created a new list at each index. The first list contains two elements: the first being 1 (the index number), the second being "apples" (the first value in x).

enum(x)
#> [[1]]
#> [[1]][[1]]
#> [1] 1
#> 
#> [[1]][[2]]
#> [1] "apples"
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] 2
#> 
#> [[2]][[2]]
#> [1] "bananas"
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 3
#> 
#> [[3]][[2]]
#> [1] "pears"

When multiple variables like i.j are passed to the for loop, then i is assigned to the first element and j to the second element for item in the sequence. There can be any number of variables as long as they are separated by ".". There does not have to be a variable for each element; but there does need to be an element for each item or else an “out-of-bounds” error will be produced.

y <- list(a = 1:3, b = 4:6, c = 7:9)
List(for (i.j.k in y) (i+j)/k)
#> $a
#> [1] 1
#> 
#> $b
#> [1] 1.5
#> 
#> $c
#> [1] 1.666667

Similarly, variables can be skipped. The function below extracts the first and third elements of each item in the list.

List(for (i..j in y) c(i, j))
#> $a
#> [1] 1 3
#> 
#> $b
#> [1] 4 6
#> 
#> $c
#> [1] 7 9

If only the first or second item is needed, though, you should use two dots: i.. for the first, ..i for the second.

Another convenient function that can be used on the sequence is items(). It is similar to enum, except that the name of each item is used instead of its index number. See the documentation for eList for other helper functions.

List(for (i.j in items(y)) paste0(i, j))
#> [[1]]
#> [1] "a1" "a2" "a3"
#> 
#> [[2]]
#> [1] "b4" "b5" "b6"
#> 
#> [[3]]
#> [1] "c7" "c8" "c9"

Variables can also be separated using a comma as long as the variables are surrounded in backticks.

z1 <- 1:3
z2 <- 4:6
List(for (`i, j` in zip(z1, z2)) i + j)
#> [[1]]
#> [1] 5
#> 
#> [[2]]
#> [1] 7
#> 
#> [[3]]
#> [1] 9

Other Return Types

All of the examples so far have returned a list. However, eList supports a variety of different types of comprehension. For example, Num returns a numeric vector, while Chr returns a character vector. Env can be used to produce an environment, similar to dictionary comprehensions in Python. Note that each entry in an environment must have a unique name. These work by coercing the result into particular type, producing an error if unable.

Num(for (i.j.k in y) (i+j)/k)
#>        a        b        c 
#> 1.000000 1.500000 1.666667
Chr(for (i.j.k in y) (i+j)/k)
#>                  a                  b                  c 
#>                "1"              "1.5" "1.66666666666667"

One convenience function is ... It can be used as either ..[for ...] or ..(for ...). By default, it attempts to simplify the results in an array, but can mimic any other type of comprehension.

..[for (i.j.k in y) (i+j)/k]
#> [1] 1.000000 1.500000 1.666667

Nested Loops

Multiple for loops can be used within a comprehension. As long the subsequent for statements immediately follow the variables & sequence of the previous one, or immediately follow a if-else statement, then it will be parsed into a vectorized lapply style function. Nested loops should be avoided unless necessary since they can be difficult to understand. The following are a couple examples using matrix and numeric comprehension.

Mat(for (i in 1:3) for (j in 1:6) i*j)
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    2    4    6
#> [3,]    3    6    9
#> [4,]    4    8   12
#> [5,]    5   10   15
#> [6,]    6   12   18
Num(for (i in 1:3) for (j in 1:6) if (i==j) i*j)
#> [1] 1 4 9

Vector Comprehensions can also be nested within each other. The code below nests a numeric comprehension within a character comprehension.

Chr(for (i in Num(for (j in 1:3) j^2)) paste0(letters[sqrt(i)], i))
#> [1] "a1" "b4" "c9"

Parallelization

Vector comprehensions also for parallel computations using the parallel package. All comprehensions allow the user to supply a cluster, which activates the parallel version of sapply or lapply. eList comes with a function that allows for the quick creation of a cluster based on the user’s operating system and by auto-detecting the number of cores available. Users are recommended to explicitly create a cluster though the functions in the parallel package unless a quick parallel calculation is needed.

cluster <- auto_cluster(2)
Num(for (i in 1:100) sample(1:100, 1), clust=cluster)
close_cluster(cluster)

Note that the additional overhead involved with parallelization means that the gain in performance will be relatively small (and often negative) unless the sequence is sufficiently large. Large comprehensions, though, can experience significant gains by specifying a cluster.

Summary Comprehensions

The eList package contains a number of summary functions that accept vector comprehensions and/or normal vectors. These functions allow users to combine multiple comprehensions and other vectors in a single function, then apply a summary function to all entries. Each summary comprehension accepts a cluster for parallel computations and the na.rm argument. Some examples:

# Are all values greater than 0?
All(for (i in 1:10) i > 0)
#> [1] TRUE
# Are no values TRUE? (combines comprehension with other values)
None(for (i in 1:10) i < 0, TRUE, FALSE)
#> [1] FALSE
# factorial(5)
Prod(for (i in 1:5) i)
#> [1] 120
# Summary statistics from a random draw of 1000 observations from normal distribution
Stats(for (i in rnorm(1000)) i)
#> $min
#> [1] -2.975354
#> 
#> $q1
#> [1] -0.6803642
#> 
#> $med
#> [1] 0.09313939
#> 
#> $q3
#> [1] 0.809412
#> 
#> $max
#> [1] 3.731964
#> 
#> $mean
#> [1] 0.06918359
#> 
#> $sd
#> [1] 1.04056
# Every other letter in the alphabet as a single character
Paste(for (i in seq(1,26,2)) letters[i], collapse=", ")
#> [1] "a, c, e, g, i, k, m, o, q, s, u, w, y"