rapidraker
?rapidraker
provides an implementation of the same
keyword extraction algorithm (RAKE) that slowraker
does,
but it’s written in Java instead of R. This makes it a bit faster than
slowraker
.
You can get the stable version from CRAN:
install.packages("rapidraker")
The development version of the package requires you to compile the
latest Java source code in rapidrake-java, so
installing it is not as simple as making a call to
devtools::install_github()
.
library(slowraker)
library(rapidraker)
data("dog_pubs")
rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))
#> doc_id keyword freq score stem
#> 1 1 assistance dog identification tags 1 10.8 assist dog identif tag
#> 2 1 animal control facilities 1 9.0 anim control facil
#> 3 1 emotional support animals 1 9.0 emot support anim
#> 4 1 small body sizes 1 9.0 small bodi size
#> 5 1 seemingly inappropriate dogs 1 7.9 seem inappropri dog
#> 6 1 assistance dogs sharply 1 7.3 assist dog sharpli
txt <- rep(dog_pubs$abstract, 20)
sr_time <- system.time(slowrake(txt))[["elapsed"]]
rr_time <- system.time(rapidrake(txt))[["elapsed"]]
In this example, rapidrake()
took 4.09 seconds to
execute while slowrake()
took 101.04, making the Java
version about about 25 times faster.
rapidrake()
even fasterWe can parallelize extraction across documents like so:
# The following code was run on aarch64-apple-darwin20, 12 cores
library(parallel)
library(doParallel)
library(foreach)
cores <- detectCores()
# Make txt vector larger so we can more easily see the speed improvement of parallelization
txt2 <- rep(txt, cores * 3)
by <- floor(length(txt2) / cores)
cl <- makeCluster(cores)
registerDoParallel(cl)
rr_par_time <- system.time(
foreach(i = 1:cores) %dopar% {
start <- (i - 1) * by + 1
finish <- start + by - 1
rapidraker::rapidrake(txt2[start:finish])
}
)[["elapsed"]]
stopCluster(cl)
The sequential version of rapidrake()
took 4.09 seconds
to extract keywords for 600 documents, while the parallel version took
28.62 seconds for 21600 documents. This suggests that the parallel
version was about 5 times faster than the regular version.