The Rapid Automatic Keyword Extraction (RAKE) algorithm was first described in Rose et al. as a way to quickly extract keywords from documents. The algorithm involves two main steps:
1. Candidate keywords are identified. A candidate keyword is any set of contiguous words (i.e., any n-gram) that doesn’t contain a phrase delimiter or a stop word.1 A phrase delimiter is a punctuation character that marks the beginning or end of a phrase (e.g., a period or a comma). Splitting up text based on phrase delimiters/stop words is the essential idea behind RAKE. According to the authors:
RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words and, the, and of, or other words with minimal lexical meaning
In addition to using stop words and phrase delimiters to identify
candidate keywords, slowrake()
also allows you to use a
word’s part-of-speech (POS) to mark it as a potential delimiter. For
example, most keywords don’t contain verbs, so you may want treat verbs
as phrase delimiters. You can use slowrake()
’s
stop_pos
parameter to choose which parts-of-speech to
exclude from your candidate keywords.
2. Keywords get scored A keyword’s score (i.e., its degree of “keywordness”) is the sum of its member word scores. For example, the score for the keyword “dog leash” is calculated by adding the score for the word “dog” with the score for the word “leash.” A member word’s score is equal to its degree/frequency, where degree equals the number of times that the word co-occurs with another word in another keyword, and frequency is the total number of times that the word occurs overall (i.e., including keywords that only have one member word, like “dog”).
See Rose et al. for more details on how RAKE works.
RAKE is unique in that it is completely unsupervised, so it’s
relatively quick to get started with. Let’s take a look at a few
examples that demonstrate slowrake()
’s parameters.
library(slowraker)
txt <- "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
slowrake(txt)[[1]]
#> keyword freq score stem
#> 1 linear diophantine equations 1 8.5 linear diophantin equat
#> 2 minimal supporting set 1 6.8 minim support set
#> 3 linear constraints 1 4.5 linear constraint
#> 4 natural numbers 1 4.0 natur number
#> 5 nonstrict inequations 1 4.0 nonstrict inequ
#> 6 strict inequations 1 4.0 strict inequ
#> 7 minimal set 1 3.8 minim set
#> 8 mixed types 1 3.3 mix type
#> 9 minimal 1 2.0 minim
#> 10 set 1 1.8 set
#> 11 sets 1 1.8 set
#> 12 types 2 1.3 type
#> 13 algorithms 2 1.0 algorithm
#> 14 compatibility 2 1.0 compat
#> 15 components 1 1.0 compon
#> 16 construction 1 1.0 construct
#> 17 criteria 2 1.0 criteria
#> 18 solutions 3 1.0 solut
#> 19 system 1 1.0 system
#> 20 systems 4 1.0 system
#> 21 upper 1 1.0 upper
slowrake(txt, stem = FALSE)[[1]]
#> keyword freq score
#> 1 linear diophantine equations 1 8.5
#> 2 minimal supporting set 1 7.0
#> 3 linear constraints 1 4.5
#> 4 minimal set 1 4.0
#> 5 natural numbers 1 4.0
#> 6 nonstrict inequations 1 4.0
#> 7 strict inequations 1 4.0
#> 8 mixed types 1 3.3
#> 9 minimal 1 2.0
#> 10 set 1 2.0
#> 11 types 2 1.3
#> 12 algorithms 2 1.0
#> 13 compatibility 2 1.0
#> 14 components 1 1.0
#> 15 construction 1 1.0
#> 16 criteria 2 1.0
#> 17 sets 1 1.0
#> 18 solutions 3 1.0
#> 19 system 1 1.0
#> 20 systems 4 1.0
#> 21 upper 1 1.0
slowraker::smart_words
):
slowrake(txt, stop_words = c(smart_words, "diophantine"))[[1]]
#> keyword freq score stem
#> 1 minimal supporting set 1 6.8 minim support set
#> 2 natural numbers 1 4.0 natur number
#> 3 nonstrict inequations 1 4.0 nonstrict inequ
#> 4 strict inequations 1 4.0 strict inequ
#> 5 minimal set 1 3.8 minim set
#> 6 linear constraints 1 3.5 linear constraint
#> 7 mixed types 1 3.3 mix type
#> 8 minimal 1 2.0 minim
#> 9 set 1 1.8 set
#> 10 sets 1 1.8 set
#> 11 linear 1 1.5 linear
#> 12 types 2 1.3 type
#> 13 algorithms 2 1.0 algorithm
#> 14 compatibility 2 1.0 compat
#> 15 components 1 1.0 compon
#> 16 construction 1 1.0 construct
#> 17 criteria 2 1.0 criteria
#> 18 equations 1 1.0 equat
#> 19 solutions 3 1.0 solut
#> 20 system 1 1.0 system
#> 21 systems 4 1.0 system
#> 22 upper 1 1.0 upper
slowrake(txt, stop_pos = NULL)[[1]]
#> keyword freq score stem
#> 1 linear diophantine equations 1 8.5 linear diophantin equat
#> 2 minimal generating sets 1 7.9 minim gener set
#> 3 minimal supporting set 1 7.9 minim support set
#> 4 minimal set 1 4.9 minim set
#> 5 linear constraints 1 4.5 linear constraint
#> 6 natural numbers 1 4.0 natur number
#> 7 nonstrict inequations 1 4.0 nonstrict inequ
#> 8 strict inequations 1 4.0 strict inequ
#> 9 upper bounds 1 4.0 upper bound
#> 10 mixed types 1 3.7 mix type
#> 11 considered types 1 3.2 consid type
#> 12 set 1 2.2 set
#> 13 types 1 1.7 type
#> 14 considered 1 1.5 consid
#> 15 algorithms 2 1.0 algorithm
#> 16 compatibility 2 1.0 compat
#> 17 components 1 1.0 compon
#> 18 constructing 1 1.0 construct
#> 19 construction 1 1.0 construct
#> 20 criteria 2 1.0 criteria
#> 21 solutions 3 1.0 solut
#> 22 solving 1 1.0 solv
#> 23 system 1 1.0 system
#> 24 systems 4 1.0 system
slowrake(txt, stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)])[[1]]
#> keyword freq score stem
#> 1 algorithms 2 1 algorithm
#> 2 compatibility 2 1 compat
#> 3 components 1 1 compon
#> 4 constraints 1 1 constraint
#> 5 construction 1 1 construct
#> 6 criteria 2 1 criteria
#> 7 equations 1 1 equat
#> 8 inequations 2 1 inequ
#> 9 numbers 1 1 number
#> 10 set 3 1 set
#> 11 sets 1 1 set
#> 12 solutions 3 1 solut
#> 13 system 1 1 system
#> 14 systems 4 1 system
#> 15 types 3 1 type
#> 16 upper 1 1 upper
freq
):
res <- slowrake(txt)[[1]]
res2 <- aggregate(freq ~ keyword + stem, data = res, FUN = sum)
res2[order(res2$freq, decreasing = TRUE), ]
#> keyword stem freq
#> 19 systems system 4
#> 16 solutions solut 3
#> 1 algorithms algorithm 2
#> 2 compatibility compat 2
#> 5 criteria criteria 2
#> 20 types type 2
#> 3 components compon 1
#> 4 construction construct 1
#> 6 linear constraints linear constraint 1
#> 7 linear diophantine equations linear diophantin equat 1
#> 8 minimal minim 1
#> 9 minimal set minim set 1
#> 10 minimal supporting set minim support set 1
#> 11 mixed types mix type 1
#> 12 natural numbers natur number 1
#> 13 nonstrict inequations nonstrict inequ 1
#> 14 set set 1
#> 15 sets set 1
#> 17 strict inequations strict inequ 1
#> 18 system system 1
#> 21 upper upper 1
slowrake(txt = dog_pubs$abstract[1:10])
#> | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
#>
#> # A rakelist containing 10 data frames:
#> $ :'data.frame': 61 obs. of 4 variables:
#> ..$ keyword:"assistance dog identification tags" ...
#> ..$ freq :1 1 ...
#> ..$ score :11 ...
#> ..$ stem :"assist dog identif tag" ...
#> $ :'data.frame': 88 obs. of 4 variables:
#> ..$ keyword:"current dog suitability assessments focus" ...
#> ..$ freq :1 1 ...
#> ..$ score :21 ...
#> ..$ stem :"current dog suitabl assess focu" ...
#> #...With 8 more data frames.
Technically the original version of RAKE allows some
keywords to contain stop words, but slowrake()
doesn’t
support this.↩︎