What is rapidraker?

rapidraker is an R package that provides an implementation of the same keyword extraction algorithm (RAKE) that slowraker does. However, rapidraker::rapidrake() is written mostly in Java while slowraker::slowrake() is written mostly in R. This means that you can expect rapidrake() to be considerably faster than slowrake().

Installation

You can get the stable version from CRAN:

install.packages("rapidraker")

The development version of the package requires you to compile the latest Java source code in rapidrake-java, so installing it is not as simple as making a call to devtools::install_github().

Basic usage

library(slowraker)
library(rapidraker)

data("dog_pubs")
rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))

Performance of slowraker vs rapidraker

txt <- rep(dog_pubs$abstract, 20)

sr_time <- system.time(slowrake(txt))[["elapsed"]]
rr_time <- system.time(rapidrake(txt))[["elapsed"]]

In this example, slowrake() took 88.86 seconds to execute while rapidrake() took 14.58 seconds. In other words, rapidrake() was about 6 times faster than slowrake().

Making rapidrake() even faster

Executing RAKE is an example of an embarrassingly parallel problem. As such, it doesn’t take much to parallelize rapidrake():

# The following code was run on an 8 core Windows 7 machine
library(parallel)
library(doParallel)
library(foreach)

cores <- detectCores()
# Make txt vector larger so we can more easily see the speed improvement of parallelization
txt2 <- rep(txt, cores) 
by <- floor(length(txt2) / cores)

cl <- makeCluster(cores)
registerDoParallel(cl)

rr_par_time <- system.time(
  foreach(i = 1:cores) %dopar% {
    start <- (i - 1) * by + 1
    finish <- start + by - 1
    rapidraker::rapidrake(txt2[start:finish])
  }
)[["elapsed"]]

stopCluster(cl)

The sequential version of rapidrake() took 14.58 seconds to extract keywords for 600 documents, while the parallel version took 31.15 seconds to extract keywords for 4800 documents. This suggests that the parallel version was about 4 times faster than the regular version.