What is rapidraker?

rapidraker provides an implementation of the same keyword extraction algorithm (RAKE) that slowraker does, but it’s written in Java instead of R. This makes it a bit faster than slowraker.

Installation

You can get the stable version from CRAN:

install.packages("rapidraker")

The development version of the package requires you to compile the latest Java source code in rapidrake-java, so installing it is not as simple as making a call to devtools::install_github().

Basic usage

library(slowraker)
library(rapidraker)

data("dog_pubs")
rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))
#>   doc_id                            keyword freq score                   stem
#> 1      1 assistance dog identification tags    1  10.8 assist dog identif tag
#> 2      1          animal control facilities    1   9.0     anim control facil
#> 3      1          emotional support animals    1   9.0      emot support anim
#> 4      1                   small body sizes    1   9.0        small bodi size
#> 5      1       seemingly inappropriate dogs    1   7.9    seem inappropri dog
#> 6      1            assistance dogs sharply    1   7.3     assist dog sharpli

Performance comparison

txt <- rep(dog_pubs$abstract, 20)

sr_time <- system.time(slowrake(txt))[["elapsed"]]
rr_time <- system.time(rapidrake(txt))[["elapsed"]]

In this example, rapidrake() took 4.09 seconds to execute while slowrake() took 101.04, making the Java version about about 25 times faster.

Making rapidrake() even faster

We can parallelize extraction across documents like so:

# The following code was run on aarch64-apple-darwin20, 12 cores
library(parallel)
library(doParallel)
library(foreach)

cores <- detectCores()
# Make txt vector larger so we can more easily see the speed improvement of parallelization
txt2 <- rep(txt, cores * 3) 
by <- floor(length(txt2) / cores)

cl <- makeCluster(cores)
registerDoParallel(cl)

rr_par_time <- system.time(
  foreach(i = 1:cores) %dopar% {
    start <- (i - 1) * by + 1
    finish <- start + by - 1
    rapidraker::rapidrake(txt2[start:finish])
  }
)[["elapsed"]]

stopCluster(cl)

The sequential version of rapidrake() took 4.09 seconds to extract keywords for 600 documents, while the parallel version took 28.62 seconds for 21600 documents. This suggests that the parallel version was about 5 times faster than the regular version.