Provides the hyphenation algorithm used for ‘TeX’/‘LaTeX’ and similar software.
hyphen() takes vectors of character strings (i.e., single words) and applies an hyphenation algorithm (Liang, 1983) to each word. This algorithm was originally developed for automatic word hyphenation in \(\LaTeX\), and is gracefully misused here to be of a slightly different service.1
hyphen() needs a set of hyphenation patterns for each language it should analyze. If you’re lucky, there’s already a pre-built package in the official
l10n repository for your language of interest that you only need to install and load. These packages are called
XX is a two letter abbreviation for the particular language. For instance,
sylly.de adds support for German, whereas
sylly.en adds support for English:
c("This", "is", "a", "rather", "stupid", "demonstration") sampleText <-library(sylly.en) hyphen(sampleText, hyph.pattern="en")hyph.txt.en <-
The method has a parameter called
as which defines the object class of the returned results. It defaults to the S4 class
kRp.hyphen. In addition to the hyphenated tokens, it includes various statistics and metadata, like the language of the text. These objects were designed to integrate seamlessly with the methods and functions of the
When all you need is the actual data frame with hyphenated text, you could call
hyphenText() on the
kRp.hyphen object. But you could also set
as="data.frame" accordinly in the first place. Alternatively, using the shortcut method
hyphen_df() instead of
hyphen() will also return a simple data frame.
If you’re only even interested in the numeric results, you can set
as="numeric" (or use
hyphen_c()), which will strip down the results to just the numeric vector of syllables.
url("http://tug.ctan.org/tex-archive/language/hyph- url.is.pattern <-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-is.pat.txt") read.hyph.pat(url.is.pattern, lang="is") hyph.is <-close(url.is.pattern) hyphen(icelandicSampleText, hyph.pattern=hyph.is)hyph.txt.is <-
hyphen() might not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is might underestimate the real number of syllables in a text.
Depending on your use case, the more accurate the end results should be, the less you should rely on automatic hyphenation alone. But it sure is a good starting point, for there is a method called
correct.hyph() to help you clean these results of errors later on. The most comfortable way to do this is to call
hyphenText(hyph.txt.en), which will get you a data frame with two colums,
word (the hyphenated words) and
syll (the number of syllables), and open it in a spread sheet editor:4
## syll word [...] ## 20 1 first ## 21 1 place ## 22 1 primary ## 23 2 de-fense ## 24 1 and [...]
You can then manually correct wrong hyphenations by removing or inserting ``-’’ as hyphenation indicators, and call the method on the corrected object without further arguments, which will cause it to recount all syllables and update the statistics:
The method can also be used to alter entries directly:
correct.hyph(hyph.txt.en, word="primary", hyphen="pri-ma-ry")hyph.txt.en <-
## Changed ## ## syll word ## 22 1 primary ## ## into ## ## syll word ## 22 3 pri-ma-ry
Once you have corrected the hyphenation of a token,
sylly will also update its cache (see below) and use the corrected format from now on.
hyphen() caches the results of each token it analyzed internally for the running R session, and also checks its cache for each token it is called on. This speeds up the process, because it only has to split the token and look up matching patterns once. If for some reason you don’t want this (e.g., if it uses to much memory), you can turn caching off by setting
If on the other hand you would like to preserve and re-use the cache, you can also configure
sylly to write it to a file. To do so, you sould use
The file will be created dynamically the first time it is needed, should it not exist already. You can use the same cache file for multiple languages. Furthermore, since most setting done with
set.sylly.env() are stored in you session’s
options(), you can also define this file permanently by adding somethin like the following to your
options( sylly=list( hyph.cache.file="~/sylly_cache.RData" ))
This will cause
sylly to always use this cache file by default. One of the main benefits of this, next to boosting speed, is the fact that corrections you have done in the past will be preserved for future sessions. In other words, if you fix incorrect hyphenation results from time to time, the overall accuracy of your results will improve constantly.
The APA style used in this vignette was kindly provided by the CSL project, licensed under Creative Commons Attribution-ShareAlike 3.0 Unported license.
Liang, F. M. (1983). Word Hy-phen-a-tion by Com-put-er (PhD thesis). Stanford University, Dept. Computer Science, Stanford.
hyphen() method was originally implemented as part of the
koRpus package, but was later split off into its own package, which is
koRpus adds further
hyphen() methods so they can be used on tokenized and POS tagged objects directly.↩︎
*.pat.txt files at http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/↩︎
You can also use the private method
sylly:::sylly\_langpack() to generate an R package skeleton for this language, but it requires you to look at the
sylly source code, as the commented code is the only documentation. The results of this method are optimized to be packaged with
roxyPackage (https://github.com/unDocUMeantIt/roxyPackage). In this combination, generating new language packages can almost be automatized.↩︎