The Rapid Automatic Keyword Extraction (RAKE) algorithm was first described in Rose et al. as a way to quickly extract keywords from documents. The algorithm involves two main steps:
1. Candidate keywords are identified. A candidate keyword is any set of contiguous words (i.e., any n-gram) that doesn’t contain a phrase delimiter or a stop word.1 A phrase delimiter is a punctuation character that marks the beginning or end of a phrase (e.g., a period or a comma). Splitting up text based on phrase delimiters/stop words is the essential idea behind RAKE. According to the authors:
RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words and, the, and of, or other words with minimal lexical meaning
In addition to using stop words and phrase delimiters to identify candidate keywords,
slowrake() also allows you to use a word’s part-of-speech (POS) to mark it as a potential delimiter. For example, most keywords don’t contain verbs, so you may want treat verbs as phrase delimiters. You can use
stop_pos parameter to choose which parts-of-speech to exclude from your candidate keywords.
2. Keywords get scored A keyword’s score (i.e., its degree of “keywordness”) is the sum of its member word scores. For example, the score for the keyword “dog leash” is calculated by adding the score for the word “dog” with the score for the word “leash.” A member word’s score is equal to its degree/frequency, where degree equals the number of times that the word co-occurs with another word in another keyword, and frequency is the total number of times that the word occurs overall (i.e., including keywords that only have one member word, like “dog”).
See Rose et al. for more details on how RAKE works.
RAKE is unique in that it is completely unsupervised, so it’s relatively quick to get started with. Let’s take a look at a few examples that demonstrate
library(slowraker) txt <- "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
slowrake(txt, stem = FALSE)[]
slowrake(txt, stop_words = c(smart_words, "diophantine"))[]
slowrake(txt, stop_pos = NULL)[]
slowrake(txt, stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)])[]
res <- slowrake(txt)[] res2 <- aggregate(freq ~ keyword + stem, data = res, FUN = sum) res2[order(res2$freq, decreasing = TRUE), ]
slowrake(txt = dog_pubs$abstract[1:10]) #> | | | 0% | |====== | 10% | |============= | 20% | |==================== | 30% | |========================== | 40% | |================================ | 50% | |======================================= | 60% | |============================================== | 70% | |==================================================== | 80% | |========================================================== | 90% | |=================================================================| 100% #> #> # A rakelist containing 10 data frames: #> $ :'data.frame': 61 obs. of 4 variables: #> ..$ keyword:"assistance dog identification tags" ... #> ..$ freq :1 1 ... #> ..$ score :11 ... #> ..$ stem :"assist dog identif tag" ... #> $ :'data.frame': 88 obs. of 4 variables: #> ..$ keyword:"current dog suitability assessments focus" ... #> ..$ freq :1 1 ... #> ..$ score :21 ... #> ..$ stem :"current dog suitabl assess focu" ... #> #...With 8 more data frames.