Distributional semantic models (DSMs) represent the meaning of a
target term (which can be a word form, lemma, morpheme, word pair, etc.)
in the form of a feature vector that records either co-occurrence
frequencies of the target term with a set of feature terms
(*term-term model*) or its distribution across textual units
(*term-context model*). Such DSMs have become an indispensable
ingredient in many NLP applications that require flexible broad-coverage
lexical semantics.

Distributional modelling is an empirical science. DSM representations are determined by a wide range of parameters such as size and type of the co-occurrence context, feature selection, weighting of co-occurrence frequencies (often with statistical association measures), distance metric, dimensionality reduction method and the number of latent dimensions used. Despite recent efforts to carry out systematic evaluation studies, the precise effects of these parameters and their relevance for different application settings are still poorly understood.

The **wordspace** package aims to provide a flexible,
powerful and easy to use “interactive laboratory” that enables its users
to build DSMs and experiment with them, but that also scales up to the
large models required by real-life applications.

Further background information and references can be found in:

Evert, Stefan (2014). Distributional semantics in R with the wordspace package. In

Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations, pages 110–114, Dublin, Ireland.

Before continuing with this tutorial, load the package with

`library(wordspace)`

The most general representation of a distributional model takes the
form of a sparse matrix, with entries specified as a triplet of row
label (*target term*), column label (*feature term*) and
co-occurrence frequency. A sample of such a table is included in the
package under the name `DSM_VerbNounTriples_BNC`

, listing
syntactic verb-noun co-occurrences in the British National Corpus:

noun | rel | verb | f | mode | |
---|---|---|---|---|---|

461973 | horse | subj | pick | 5 | written |

546678 | leg | subj | say | 9 | written |

608545 | mick | subj | say | 8 | spoken |

611816 | mill | subj | manage | 6 | written |

714162 | person | subj | begin | 11 | written |

726716 | place | subj | determine | 6 | written |

970619 | tax | subj | work | 7 | written |

992648 | time | subj | book | 6 | written |

1387925 | discussion | obj | facilitate | 14 | written |

1656006 | lee | obj | give | 7 | written |

The `wordspace`

package creates DSM objects from such
triplet representations, which can easily be imported into R from a wide
range of file and database formats. Ready-made import functions are
provided for TAB-delimited text files (as used by DISSECT), which
may be compressed to save disk space, and for term-document models
created by the text-mining package `tm`

.

The native input format is a pre-compiled sparse matrix representation generated by the UCS toolkit. In this way, UCS serves as a hub for the preparation of co-occurrence data, which can be collected from dependency pairs, extracted from a corpus indexed with the IMS Corpus Workbench or imported from various other formats.

The first step in the creation of a distributional semantic model is the compilation of a co-occurrence matrix. Let us illustrate the procedure for verb-noun co-occurrences from the written part of the British National Corpus. First, we extract relevant rows from the table above.

`<- subset(DSM_VerbNounTriples_BNC, mode == "written") Triples `

Note that many verb-noun pairs such as *(walk, dog)* still
have multiple entries in `Triples`

: *dog* can appear
either as the subject or as the object of *walk*.

`subset(Triples, noun == "dog" & verb == "walk")`

```
## noun rel verb f mode
## 295011 dog subj walk 20 written
## 1398669 dog obj walk 87 written
```

There are two ways of dealing with such cases: we can either add up
the frequency counts (a *dependency-filtered model*) or treat
“dog-as-subject” and “dog-as-object” as two different terms (a
*dependency-structured model*). We opt for a dependency-filtered
model in this example – can you work out how to compile the
corresponding dependency-structured DSM in R, either for verbs of for
nouns as target terms?

The `dsm`

constructor function expects three vectors of
the same length, containing row label (target term), column label
(feature term) and co-occurrence count (or pre-weighted score) for each
nonzero cell of the co-occurrence matrix. In our example, we use nouns
as targets and verbs as features. Note the option
`raw.freq=TRUE`

to indicate that the matrix contains raw
frequency counts.

```
<- dsm(target=Triples$noun, feature=Triples$verb, score=Triples$f, raw.freq=TRUE)
VObj dim(VObj)
```

`## [1] 10940 3149`

The constructor automatically computes marginal frequencies for the
target and feature terms by summing over rows and columns of the matrix
respectively. The information is collected in data frames
`VObj$rows`

and `VObj$cols`

, together with the
number of nonzero elements in each row and column:

`subset(VObj$rows, rank(-f) <= 6) # 6 most frequent nouns`

```
## term nnzero f
## 5395 man 907 51976
## 6491 people 911 72832
## 6951 problem 402 29002
## 8928 thing 442 34104
## 8975 time 658 46500
## 9613 way 738 46088
```

This way of computing marginal frequencies is appropriate for
syntactic co-occurrence and term-document models. In the case of surface
co-occurrence based on token spans, the correct marginal frequencies
have to be provided separately in the `rowinfo=`

and
`colinfo=`

arguments (see `?dsm`

for details).

The actual co-occurrence matrix is stored in `VObj$M`

.
Since it is too large to display on screen, we extract the top left
corner with the `head`

method for DSM objects. Note that you
can also use `head(VObj, Inf)`

to extract the full
matrix.

`head(VObj)`

```
## 6 x 6 sparse Matrix of class "dgCMatrix"
## be have say give take achieve
## aa 7 5 12 . . .
## abandonment 14 . . . . .
## abbey 45 13 6 . . .
## abbot 23 7 10 5 5 .
## abbreviation 9 . . . . .
## abc 6 . . . . .
```

Rows and columns with few nonzero cells provide unreliable semantic
information and can lead to numerical problems (e.g. because a sparse
association score deletes the remaining nonzero entries). It is
therefore common to apply frequency thresholds both on rows and columns,
here in the form of requiring at least 3 nonzero cells. The option
`recursive=TRUE`

guarantees that both criteria are satisfied
by the final DSM when rows and columns are filtered at the same time
(see the examples in `?subset.dsm`

for an illustration).

```
<- subset(VObj, nnzero >= 3, nnzero >= 3, recursive=TRUE)
VObj dim(VObj)
```

`## [1] 6087 2139`

If you want to filter *only* columns or rows, you can pass the
constraint as a named argument: `subset=(nnzero >= 3)`

for
rows and `select=(nnzero >= 3)`

for columns.

The next step is to weight co-occurrence frequency counts. Here, we
use the *simple log-likelihood* association measure with an
additional logarithmic transformation, which has shown good results in
evaluation studies. The `wordspace`

package computes
*sparse* (or “positive”) versions of all association measures by
default, setting negative associations to zero. This guarantees that the
sparseness of the co-occurrence matrix is preserved. We also normalize
the weighted row vectors to unit Euclidean length
(`normalize=TRUE`

).

`<- dsm.score(VObj, score="simple-ll", transform="log", normalize=TRUE, method="euclidean") VObj `

Printing a DSM object shows information about the dimensions of the
co-occurrence matrix and whether it has already been scored. Note that
the scored matrix does not replace the original co-occurrence counts, so
`dsm.score`

can be executed again at any time with different
parameters.

` VObj`

```
## Distributional Semantic Model with 6087 rows x 2139 columns
## * raw co-occurrence matrix M available
## - sparse matrix with 191.4k / 13.0M nonzero entries (fill rate = 1.47%)
## - in canonical format
## - known to be non-negative
## - sample size of underlying corpus: 5010.1k tokens
## * scored matrix S available
## - sparse matrix with 153.7k / 13.0M nonzero entries (fill rate = 1.18%)
## - in canonical format
## - known to be non-negative
```

Most distributional models apply a dimensionality reduction technique
to make data sets more manageable and to refine the semantic
representations. A widely-used technique is singular value decomposition
(SVD). Since `VObj`

is a sparse matrix,
`dsm.projection`

automatically applies an efficient algorithm
from the `sparsesvd`

package.

```
<- dsm.projection(VObj, method="svd", n=300)
VObj300 dim(VObj300)
```

`## [1] 6087 300`

`VObj300`

is a dense matrix with 300 columns, giving the
coordinates of the target terms in 300 latent dimensions. Its attribute
`"R2"`

shows what proportion of information from the original
matrix is captured by each latent dimension.

`plot(attr(VObj300, "R2"), type="h", xlab="latent dimension", ylab="R2")`

The primary goal of a DSM is to determine “semantic” distances
between pairs of words. The arguments to `pair.distances`

can
also be parallel vectors in order to compute distances for a large
number of word pairs efficiently.

`pair.distances("book", "paper", VObj300, method="cosine")`

```
## book/paper
## 45.07627
## attr(,"similarity")
## [1] FALSE
```

By default, the function converts similarity measures into an
equivalent distance metric – the angle between vectors in the case of
cosine similarity. If you want the actual similarity values, specify
`convert=FALSE`

:

`pair.distances("book", "paper", VObj300, method="cosine", convert=FALSE)`

```
## book/paper
## 0.7061649
## attr(,"similarity")
## [1] TRUE
```

We are often interested in finding the nearest neighbours of a given term in the DSM space:

`nearest.neighbours(VObj300, "book", n=14) # reduced space`

```
## paper article poem works magazine novel text guide
## 45.07627 51.92011 53.48027 53.91556 53.94824 54.40451 55.13910 55.27027
## newspaper document item essay leaflet letter
## 55.51492 55.62521 56.28246 56.29539 56.49145 58.04178
```

The return value is actually a vector of distances to the nearest neighbours, labelled with the corresponding terms. Here is how you obtain the actual neighbour terms:

```
<- nearest.neighbours(VObj, "book", n=15) # unreduced space
nn names(nn)
```

```
## [1] "paper" "guide" "works" "novel" "magazine" "article"
## [7] "document" "poem" "diary" "essay" "item" "text"
## [13] "booklet" "leaflet" "newspaper"
```

The neighbourhood plot visualizes nearest neighbours as a semantic
network based on their mutual distances. This often helps interpretation
by grouping related neighbours. The network below shows that
*book* as a text type is similar to *novel*,
*essay*, *poem* and *article*; as a form of
document it is similar to *paper*, *letter* and
*document*; and as a publication it is similar to
*leaflet*, *magazine* and *newspaper*.

```
<- nearest.neighbours(VObj300, "book", n=15, dist.matrix=TRUE)
nn.mat plot(nn.mat)
```

A straightforward way to evaluat distributional representations is to
compare them with human judgements of the semantic similarity between
word pairs. The `wordspace`

package includes to well-known
data sets of this type: Rubenstein-Goodenough (`RG65`

) and
`WordSim353`

(a superset of `RG65`

with judgements
from new test subjects).

word1 | word2 | score | |
---|---|---|---|

5 | autograph_N | shore_N | 0.06 |

15 | monk_N | slave_N | 0.57 |

25 | forest_N | graveyard_N | 1.00 |

35 | cemetery_N | mound_N | 1.69 |

45 | brother_N | monk_N | 2.74 |

55 | autograph_N | signature_N | 3.59 |

65 | gem_N | jewel_N | 3.94 |

There is also a ready-made evaluation function, which computes
Pearson and rank correlation between the DSM distances and human
subjects. The option `format="HW"`

adjusts the
POS-disambiguated notation for terms in the data set
(e.g. `book_N`

) to the format used by our distributional
model (`book`

).

`eval.similarity.correlation(RG65, VObj300, convert=FALSE, format="HW")`

```
## rho p.value missing r r.lower r.upper
## RG65 0.3076154 0.01267694 20 0.3735435 0.1426399 0.5658865
```

Evaluation results can also be visualized in the form of a scatterplot with a trend line.

`plot(eval.similarity.correlation(RG65, VObj300, convert=FALSE, format="HW", details=TRUE))`

The rank correlation of 0.308 is very poor, mostly due to the small
amount of data on which our DSM is based. Much better results are
obtained with pre-compiled DSM vectors from large Web corpus, which are
also included in the package. Note that target terms are given in a
different format there (which corresponds to the format in
`RG65`

).

`plot(eval.similarity.correlation(RG65, DSM_Vectors, convert=FALSE, details=TRUE))`

Schütze (1998) used DSM representations for word sense disambiguation
(or, more precisely, word sense induction) based on a clustering of the
sentence contexts of an ambiguous word. The `wordspace`

package includes a small data set with such contexts for a selection of
English words. Let us look at the noun *vessel* as an example,
which has two main senses (“ship” and “blood vessel”):

```
<- subset(SemCorWSD, target == "vessel" & pos == "n")
Vessel table(Vessel$gloss)
```

```
##
## a craft designed for water transportation
## 6
## a tube in which a body fluid circulates
## 6
```

Sentence contexts are given as tokenized strings
(`$sentence`

), in lemmatized form (`$hw`

) and as
lemmas annotated with part-of-speech codes (`$lemma`

). Choose
the version that matches the representation of target terms in your
DSM.

sense | sentence |
---|---|

vessel.n.02 | The spraying operation was conducted from the rear deck of a small Naval vessel , cruising two miles off-shore and vertical to an on-shore breeze . |

vessel.n.01 | Such a dual derivation was strikingly demonstrated during the injection process where initial filling would be noted to occur in several isolated pleural vessels at once . |

vessel.n.01 | This vessel could be followed to the parenchyma where it directly provided bronchial arterial blood to the alveolar capillary bed ( figs. 17 , 18 ) . |

vessel.n.01 | However , this artery is known to be a nutrient vessel with a distribution primarily to the proximal airways and supportive tissues of the lung . |

vessel.n.01 | It is distinctly possible , therefore , that simultaneous pressures in all three vessels would have rendered the shunts inoperable and hence , uninjectable . |

vessel.n.01 | A careful search failed to show occlusion of any of the mesenteric vessels . |

vessel.n.01 | Some of the small vessels were filled with fibrin thrombi , and there was extensive interstitial hemorrhage . |

vessel.n.02 | He appeared to be peering haughtily down his nose at the crowded and unclean vessel that would carry him to freedom . |

vessel.n.02 | However , we sent a third vessel out , a much smaller and faster one than the first two . |

vessel.n.02 | Upon reaching the desired speed , the automatic equipment would cut off the drive , and the silent but not empty vessel would hurl towards the star which was its journey ’s end . |

vessel.n.02 | Then , after slowing the vessel considerably , the drive would adjust to a one gee deceleration . |

vessel.n.02 | To round out the blockading force , submarines would be needed - to locate , identify and track approaching vessels . |

Following Schütze, each context is represented by a centroid vector obtained by averaging over the DSM vectors of all context words.

`<- context.vectors(DSM_Vectors, Vessel$lemma, row.names=Vessel$id) centroids `

This returns a matrix of centroid vectors for the 12 sentence
contexts of *vessel* in the data set. The vectors can now be
clustered and analyzed using standard R functions. Partitioning around
medoids (PAM) has shown good and robust performance in evaluation
studies.

```
library(cluster) # clustering algorithms of Kaufman & Rousseeuw (1990)
<- pam(dist.matrix(centroids), 2, diss=TRUE, keep.diss=TRUE)
res plot(res, col.p=factor(Vessel$sense), shade=TRUE, which=1, main="WSD for 'vessel'")
```

Colours in the plot above indicate the gold standard sense of each
instance of *vessel*. A confusion matrix confirms perfect
clustering of the two senses:

`## table(res$clustering, Vessel$sense)`

vessel.n.01 | vessel.n.02 |
---|---|

0 | 6 |

6 | 0 |

We can also use a pre-defined function for the evaluation of
clustering tasks, which is convenient but does not produce a
visualization of the clusters. Note that the “target terms” of the task
must correspond to the row labels of the centroid matrix, which we have
set to sentence IDs (`Vessel$id`

) above.

`eval.clustering(Vessel, M=centroids, word.name="id", class.name="sense")`

```
## purity entropy missing
## Vessel 100 0 0
```

As a final example, let us look at a simple approach to compositional distributional semantics, which computes the compositional meaning of two words as the element-wise sum or product of their DSM vectors.

```
<- VObj300["mouse", ] # extract row vectors from matrix
mouse <- VObj300["computer", ] computer
```

The nearest neighbours of mouse are problematic, presumably because the type vector represents a mixture of the two senses that is not close to either meaning in the semantic space.

`nearest.neighbours(VObj300, "mouse", n=12)`

```
## prop isotope carbon serum transformer sponge
## 53.72837 55.08473 55.29022 56.84607 57.20004 57.25144
## thermometer hoop loudspeaker razor mount implant
## 57.33371 57.38682 57.38682 57.43123 57.47889 57.49171
```

By adding the vectors of *mouse* and *computer*, we
obtain neighbours that seem to fit the “computer mouse” sense very
well:

`nearest.neighbours(VObj300, mouse + computer, n=12)`

```
## mouse computer program processor software tool machine
## 36.69755 41.47028 54.15183 54.37412 54.55897 55.07898 55.74764
## transistor mix keyboard prop sponge
## 58.51586 58.93807 59.15682 59.18378 59.34185
```

Note that the target is specified as a distributional vector rather than a term in this case. Observations from the recent literature suggest that element-wise multiplication is not compatible with non-sparse SVD-reduced DSMs, so it is not surprising to find completely unrelated nearest neighbours in our example:

`nearest.neighbours(VObj300, mouse * computer, n=12)`

```
## kylie barry coward leo wednesday emma baron leech
## 52.70435 52.92230 53.61260 53.81451 54.01588 54.33543 54.53792 55.07008
## friday frank colonel bruce
## 55.28301 55.39203 55.44021 55.47596
```