_____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

cran checks Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status codecov.io rstudio mirror downloads cran version

Get full text articles from lots of places

Checkout the fulltext manual to get started.


rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext makes it easy to do text-mining by supporting the following steps:

Previously supported use cases, extracted out to other packages:

It’s easy to go from the outputs of ft_get to text-mining packages such as tm and quanteda.

Data sources in fulltext include:

Authentication: A number of publishers require authentication via API key, and some even more draconian authentication processes involving checking IP addresses. We are working on supporting all the various authentication things for different publishers, but of course all the OA content is already easily available. See the Authentication section in ?fulltext-package after loading the package.

We’d love your feedback. Let us know what you think in the issue tracker

Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'crossref')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 193485; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'biology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 3 
#> [IDs] ID_31091453 ID_28140597 ID_24657234 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 3 
#> [IDs] ID_31091453 ID_28140597 ID_24657234 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.7717/peerj.228')
#> <fulltext text>
#> [Docs] 1 
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext 
#> [IDs] 10.7717/peerj.228 ...

Extract chunks

library(pubchunks)
x <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")
x %>% ft_collect() %>% pub_chunks("publisher") %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#>                          publisher .publisher
#> 1 eLife Sciences Publications, Ltd      elife
#> 
#> $elife$`10.7554/eLife.32763`
#>                          publisher .publisher
#> 1 eLife Sciences Publications, Ltd      elife

Get multiple fields at once

x %>% ft_collect() %>% pub_chunks(c("doi","publisher")) %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#>                   doi                        publisher .publisher
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd      elife
#> 
#> $elife$`10.7554/eLife.32763`
#>                   doi                        publisher .publisher
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd      elife

Pull out the data.frame’s

x %>%
  ft_collect() %>% 
  pub_chunks(c("doi", "publisher", "author")) %>%
  pub_tabularize() %>%
  .$elife
#> $`10.7554/eLife.03032`
#>                   doi                        publisher authors.given_names
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd                  Ya
#>   authors.surname authors.given_names.1 authors.surname.1
#> 1            Zhao                 Jimin               Lin
#>   authors.given_names.2 authors.surname.2 authors.given_names.3
#> 1               Beiying                Xu                  Sida
#>   authors.surname.3 authors.given_names.4 authors.surname.4
#> 1                Hu                   Xue             Zhang
#>   authors.given_names.5 authors.surname.5 .publisher
#> 1                Ligang                Wu      elife
#> 
#> $`10.7554/eLife.32763`
#>                   doi                        publisher authors.given_names
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd             Natasha
#>   authors.surname authors.given_names.1 authors.surname.1
#> 1          Mhatre                Robert            Malkin
#>   authors.given_names.2 authors.surname.2 authors.given_names.3
#> 1                Rittik               Deb                Rohini
#>   authors.surname.3 authors.given_names.4 authors.surname.4 .publisher
#> 1      Balakrishnan                Daniel            Robert      elife

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and two pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")
ft_extract(pdf)
#> <document>/Library/Frameworks/R.framework/Versions/3.6/Resources/library/fulltext/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Interoperability with other packages downstream

cache_options_set(path = (td <- 'foobar'))
#> $cache
#> [1] TRUE
#> 
#> $backend
#> [1] "ext"
#> 
#> $path
#> [1] "/Users/sckott/Library/Caches/R/foobar"
#> 
#> $overwrite
#> [1] FALSE
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), type = "pdf")
library(readtext)
x <- readtext::readtext(file.path(cache_options_get()$path, "*.pdf"))
library(quanteda)
quanteda::corpus(x)
#> Corpus consisting of 2 documents and 0 docvars.

Contributors

Meta

rofooter