This vignette presents three examples of using
slowrake()
to find keywords in text. Each application runs
RAKE on a different type of document, including a webpage, patent
abstract(s), and a journal article.
# Load the libraries needed for all three applications
library(slowraker)
library(httr)
library(xml2)
library(patentsview)
library(dplyr)
library(pdftools)
library(stringr)
library(knitr)
# The webpage of interest - slowraker's "Getting started" page
url <- "https://crew102.github.io/slowraker/articles/getting-started.html"
GET(url) %>%
content("text") %>%
read_html() %>%
xml_find_all(".//p") %>%
xml_text() %>%
paste(collapse = " ") %>%
iconv(from = "UTF-8", "ASCII", sub = '"') %>%
slowrake() %>%
.[[1]]
#> keyword freq score stem
#> 1 rapid automatic keyword extraction 1 13.5 rapid automat keyword extract
#> 2 minimal lexical meaning 1 9.0 minim lexic mean
#> 3 member word scores 1 5.4 member word score
#> 4 christopher baker 1 4.0 christoph baker
#> 5 completely unsupervised 1 4.0 complet unsupervis
#> 6 essential idea 1 4.0 essenti idea
#> 7 main steps 1 4.0 main step
#> 8 original version 1 4.0 origin version
#> 9 potential delimiter 1 4.0 potenti delimit
#> 10 punctuation character 1 4.0 punctuat charact
#> 11 standard punctuation 1 4.0 standard punctuat
#> 12 member word 2 3.9 member word
#> 13 phrase delimiter 2 3.8 phrase delimit
#> 14 phrase delimiters 3 3.8 phrase delimit
#> 15 contiguous words 1 3.6 contigu word
#> 16 function words 1 3.6 function word
#> 17 multiple words 1 3.6 multipl word
#> 18 stop word 1 3.6 stop word
#> 19 stop words 3 3.6 stop word
#> 20 candidate keyword 1 3.5 candid keyword
#> 21 candidate keywords 3 3.5 candid keyword
#> 22 keywords frequently 1 3.5 keyword frequent
#> 23 dog leash 1 3.5 dog leash
#> 24 total number 1 3.5 total number
#> 25 treat verbs 1 3.5 treat verb
#> 26 phrase 1 1.8 phrase
#> 27 word 6 1.6 word
#> 28 words 2 1.6 word
#> 29 keyword 3 1.5 keyword
#> 30 keywordness 1 1.5 keyword
#> 31 keywords 5 1.5 keyword
#> 32 dog 1 1.5 dog
#> 33 number 1 1.5 number
#> 34 score 3 1.5 score
#> 35 verbs 1 1.5 verb
#> 36 addition 1 1.0 addit
#> 37 algorithm 2 1.0 algorithm
#> 38 authors 1 1.0 author
#> 39 beginning 1 1.0 begin
#> 40 comma 1 1.0 comma
#> 41 degree 3 1.0 degre
#> 42 details 1 1.0 detail
#> 43 developed 1 1.0 develop
#> 44 documents 1 1.0 document
#> 45 end 1 1.0 end
#> 46 equal 1 1.0 equal
#> 47 examples 1 1.0 exampl
#> 48 frequency 2 1.0 frequenc
#> 49 gram 1 1.0 gram
#> 50 observation 1 1.0 observ
#> 51 parameter 1 1.0 paramet
#> 52 parameters 1 1.0 paramet
#> 53 part 1 1.0 part
#> 54 parts 1 1.0 part
#> 55 period 1 1.0 period
#> 56 pkgdown 1 1.0 pkgdown
#> 57 pos 1 1.0 po
#> 58 quick 1 1.0 quick
#> 59 quickly 1 1.0 quickli
#> 60 rake 6 1.0 rake
#> 61 rarely 1 1.0 rare
#> 62 rose 2 1.0 rose
#> 63 set 1 1.0 set
#> 64 site 1 1.0 site
#> 65 slowrake 4 1.0 slowrak
#> 66 speech 2 1.0 speech
#> 67 sum 1 1.0 sum
#> 68 support 1 1.0 support
#> 69 technically 1 1.0 technic
#> 70 text 1 1.0 text
#> 71 times 2 1.0 time
#> 72 unique 1 1.0 uniqu
# Download data from the PatentsView API for 10 patents with the phrase
# "keyword extraction" in their abstracts
pv_data <- search_pv(
query = qry_funs$text_phrase(patent_abstract = "keyword extraction"),
fields = c("patent_number", "patent_title", "patent_abstract"),
per_page = 10
)
# Look at the data
patents <- pv_data$data$patents
kable(head(patents, n = 2))
patent_number | patent_title | patent_abstract |
---|---|---|
10013574 | Method and apparatus for secure storage and retrieval of encrypted files in public cloud-computing platforms | In one aspect, a computerized Encrypted Drive System (EDS) server useful for keyword extraction and indexing server of includes a computer store containing data, wherein the data. The data includes an unencrypted document file and a computer processor in the EDS server. The computer processor obtains the unencrypted document file from the computer store. The computer processor extracts a keyword information from the unencrypted document file. The keyword information comprises of a set of keywords appearing in the unencrypted document file. The computer processor includes one or more colors from the color-set of each keyword into a document color-index of the unencrypted document file. The computer processor generates a Bloom filter encoding a set of keywords stored in a metadata field and the unencrypted document file, and wherein the Bloom filter is used to represent the set of keywords in the unencrypted document file. |
10049102 | Method and system for providing semantics based technical support | Disclosed is a system for providing semantics based technical support to a user. A keyword extraction module extracts a set of keywords from a plurality of knowledge content and a plurality of bug content. A matrix generation module generates a keyword-content matrix comprising a similarity score computed for each keyword corresponding to each knowledge content. The matrix generation module further decomposes the keyword-content matrix into one or more sub-matrices. A vector coordinates identification module determines a first set of vector coordinates for each knowledge content of the plurality of knowledge content and a second set of vector coordinates for a bug-query received from a user in real-time. A similarity measure module computes a cosine similarity measure of the bug-query with each knowledge content to identify at least one knowledge content relevant to the bug-query. The knowledge identification module further provides the at least one knowledge content to the user. |
rakelist <- slowrake(
patents$patent_abstract,
stop_words = c("method", smart_words), # Consider "method" to be a stop word
stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)] # Consider all non-nouns to be stop words
)
# Create a single data frame with all patents' keywords
out_rake <- rbind_rakelist(rakelist, doc_id = patents$patent_number)
out_rake
#> doc_id keyword freq score
#> 1 10013574 encrypted drive system 1 9.0
#> 2 10013574 document color index 1 6.6
#> 3 10013574 indexing server 1 4.2
#> 4 10013574 document file 7 4.1
#> 5 10013574 bloom filter 2 4.0
#> 6 10013574 computer processor 5 4.0
#> 7 10013574 computer store 2 4.0
#> 8 10013574 metadata field 1 4.0
#> 9 10013574 keyword extraction 1 3.4
#> 10 10013574 keyword information 2 3.4
#> 11 10013574 color set 1 3.2
#> 12 10013574 eds server 1 3.2
#> 13 10013574 colors 1 2.0
#> 14 10013574 server 1 1.7
#> 15 10013574 eds 1 1.5
#> 16 10013574 keyword 1 1.4
#> 17 10013574 keywords 3 1.4
#> 18 10013574 set 3 1.2
#> 19 10013574 aspect 1 1.0
#> 20 10013574 data 3 1.0
#> 21 10049102 similarity measure module 1 7.6
#> 22 10049102 keyword extraction module 1 7.5
#> 23 10049102 knowledge identification module 1 7.4
#> 24 10049102 matrix generation module 1 7.0
#> 25 10049102 identification module 1 5.3
#> 26 10049102 similarity measure 1 4.8
#> 27 10049102 similarity score 1 4.3
#> 28 10049102 matrix generation 1 4.2
#> 29 10049102 knowledge content 7 4.1
#> 30 10049102 bug content 1 4.0
#> 31 10049102 bug query 3 4.0
#> 32 10049102 real time 1 4.0
#> 33 10049102 vector coordinates 2 3.7
#> 34 10049102 matrix 2 1.8
#> 35 10049102 keyword 1 1.7
#> 36 10049102 keywords 1 1.7
#> 37 10049102 vector 1 1.7
#> 38 10049102 matrices 1 1.0
#> 39 10049102 plurality 3 1.0
#> 40 10049102 semantics 1 1.0
#> 41 10049102 set 3 1.0
#> 42 10049102 support 1 1.0
#> 43 10049102 system 1 1.0
#> 44 10049102 user 3 1.0
#> 45 10255908 grasper grasps 1 4.0
#> 46 10255908 search result 1 4.0
#> 47 10255908 speech content 3 4.0
#> 48 10255908 voice data 3 4.0
#> 49 10255908 voice portion 1 4.0
#> 50 10255908 voice recognition 1 4.0
#> 51 10255908 indexing database 2 3.9
#> 52 10255908 indexing information 3 3.9
#> 53 10255908 indexing unit 1 3.9
#> 54 10255908 search word 3 3.8
#> 55 10255908 phoneme lattice 2 3.6
#> 56 10255908 keyword input 1 3.5
#> 57 10255908 phoneme string 1 3.1
#> 58 10255908 indexing 1 1.9
#> 59 10255908 word 1 1.8
#> 60 10255908 phoneme 2 1.6
#> 61 10255908 keyword 1 1.5
#> 62 10255908 string 1 1.5
#> 63 10255908 analysis 1 1.0
#> 64 10255908 comparison 1 1.0
#> 65 10255908 frame 2 1.0
#> 66 10255908 frames 1 1.0
#> 67 10255908 matching 1 1.0
#> 68 10255908 performs 1 1.0
#> 69 10255908 plurality 1 1.0
#> 70 10255908 reference 2 1.0
#> 71 10255908 searcher 2 1.0
#> 72 10255908 stores 1 1.0
#> 73 10255908 system 2 1.0
#> 74 10255908 time 2 1.0
#> 75 10255908 user 2 1.0
#> 76 10298654 keyword extraction 1 3.2
#> 77 10298654 keywords 4 1.2
#> 78 10298654 content 7 1.0
#> 79 10298654 context 1 1.0
#> 80 10298654 definitions 1 1.0
#> 81 10298654 meanings 1 1.0
#> 82 10298654 media 1 1.0
#> 83 10298654 portion 1 1.0
#> 84 10298654 portions 1 1.0
#> 85 10298654 resource 3 1.0
#> 86 10298654 techniques 1 1.0
#> 87 10298654 url 6 1.0
#> 88 10298654 urls 1 1.0
#> 89 10298654 user 1 1.0
#> 90 10298654 webpage 2 1.0
#> 91 10304441 grasper grasps 1 4.0
#> 92 10304441 search result 1 4.0
#> 93 10304441 speech content 3 4.0
#> 94 10304441 voice data 3 4.0
#> 95 10304441 voice portion 1 4.0
#> 96 10304441 voice recognition 1 4.0
#> 97 10304441 indexing database 2 3.9
#> 98 10304441 indexing information 3 3.9
#> 99 10304441 indexing unit 1 3.9
#> 100 10304441 search word 3 3.8
#> 101 10304441 phoneme lattice 2 3.6
#> 102 10304441 keyword input 1 3.5
#> 103 10304441 phoneme string 1 3.1
#> 104 10304441 indexing 1 1.9
#> 105 10304441 word 1 1.8
#> 106 10304441 phoneme 2 1.6
#> 107 10304441 keyword 1 1.5
#> 108 10304441 string 1 1.5
#> 109 10304441 analysis 1 1.0
#> 110 10304441 comparison 1 1.0
#> 111 10304441 frame 2 1.0
#> 112 10304441 frames 1 1.0
#> 113 10304441 matching 1 1.0
#> 114 10304441 performs 1 1.0
#> 115 10304441 plurality 1 1.0
#> 116 10304441 reference 2 1.0
#> 117 10304441 searcher 2 1.0
#> 118 10304441 stores 1 1.0
#> 119 10304441 system 2 1.0
#> 120 10304441 time 2 1.0
#> 121 10304441 user 2 1.0
#> 122 10311974 collecting patient chief complaint 1 13.4
#> 123 10311974 patient chief complaint 2 9.4
#> 124 10311974 phone device internet 1 9.0
#> 125 10311974 text word segmentation 1 9.0
#> 126 10311974 disease examination result 2 7.7
#> 127 10311974 disease examination 1 5.3
#> 128 10311974 diagnosis result 1 4.4
#> 129 10311974 guide result 2 4.2
#> 130 10311974 cloud server 1 4.0
#> 131 10311974 doctor information 1 4.0
#> 132 10311974 keyword extraction 1 4.0
#> 133 10311974 knowledge base 1 4.0
#> 134 10311974 recommendation technologies 1 4.0
#> 135 10311974 guide system 1 3.2
#> 136 10311974 patient 1 2.8
#> 137 10311974 guide 1 1.8
#> 138 10311974 system 1 1.5
#> 139 10311974 combination 1 1.0
#> 140 10311974 environment 1 1.0
#> 141 10311974 health 1 1.0
#> 142 10311974 invention 1 1.0
#> 143 10311974 matching 1 1.0
#> 144 10311974 optimum 1 1.0
#> 145 10311974 query 1 1.0
#> 146 10311974 terminal 2 1.0
#> 147 10311974 thereof 1 1.0
#> 148 10319020 product introduction support device 1 13.7
#> 149 10319020 display control information 1 9.0
#> 150 10319020 display control part 2 9.0
#> 151 10319020 group presentation part 2 8.7
#> 152 10319020 keyword extraction part 3 8.0
#> 153 10319020 word selection part 2 8.0
#> 154 10319020 product group 1 4.3
#> 155 10319020 keyword 2 2.0
#> 156 10319020 keywords 1 2.0
#> 157 10319020 word 2 2.0
#> 158 10319020 products 4 1.7
#> 159 10319020 attribute 1 1.0
#> 160 10319020 candidates 1 1.0
#> 161 10319020 plurality 4 1.0
#> 162 10319020 processing 1 1.0
#> 163 10319020 state 1 1.0
#> 164 10319020 web 1 1.0
#> 165 10387568 keyword extraction process 1 7.8
#> 166 10387568 input document 1 4.0
#> 167 10387568 match rate 1 4.0
#> 168 10387568 network page 1 4.0
#> 169 10387568 search engine 1 4.0
#> 170 10387568 search results 1 4.0
#> 171 10387568 candidate keyword 2 3.8
#> 172 10387568 candidate keywords 4 3.8
#> 173 10387568 keyword score 2 3.8
#> 174 10387568 keyword scores 1 3.8
#> 175 10387568 search terms 1 3.5
#> 176 10387568 keywords 3 1.8
#> 177 10387568 terms 1 1.5
#> 178 10387568 applications 1 1.0
#> 179 10387568 frequency 1 1.0
#> 180 10387568 length 1 1.0
#> 181 10387568 words 1 1.0
#> 182 10394864 cluster combination unit 1 7.4
#> 183 10394864 keyword extraction unit 1 6.9
#> 184 10394864 seed selection unit 1 6.9
#> 185 10394864 topic extraction server 1 6.8
#> 186 10394864 clustering unit 1 4.4
#> 187 10394864 document group 2 4.0
#> 188 10394864 topic group 1 3.8
#> 189 10394864 unit 1 2.4
#> 190 10394864 cluster 1 2.0
#> 191 10394864 server 1 2.0
#> 192 10394864 topic 2 1.8
#> 193 10394864 keyword 3 1.5
#> 194 10394864 seed 3 1.5
#> 195 10394864 noun 3 1.0
#> 196 10394864 sentence 1 1.0
#> 197 10394864 stopword 1 1.0
#> 198 10394864 suitability 1 1.0
#> 199 10394864 text 1 1.0
#> 200 10394864 times 1 1.0
#> 201 10394864 weight 2 1.0
#> 202 10477043 keyword extraction target 2 8.3
#> 203 10477043 target determination unit 3 8.2
#> 204 10477043 keyword extraction unit 2 8.1
#> 205 10477043 text reading unit 1 6.9
#> 206 10477043 image processing unit 3 6.8
#> 207 10477043 target 1 2.7
#> 208 10477043 keyword 1 2.6
#> 209 10477043 process 1 2.5
#> 210 10477043 unit 3 2.5
#> 211 10477043 image 5 1.8
#> 212 10477043 text 4 1.4
#> 213 10477043 apparatus 1 1.0
#> 214 10477043 document 1 1.0
#> stem
#> 1 encrypt drive system
#> 2 document color index
#> 3 index server
#> 4 document file
#> 5 bloom filter
#> 6 comput processor
#> 7 comput store
#> 8 metadata field
#> 9 keyword extract
#> 10 keyword inform
#> 11 color set
#> 12 ed server
#> 13 color
#> 14 server
#> 15 ed
#> 16 keyword
#> 17 keyword
#> 18 set
#> 19 aspect
#> 20 data
#> 21 similar measur modul
#> 22 keyword extract modul
#> 23 knowledg identif modul
#> 24 matrix gener modul
#> 25 identif modul
#> 26 similar measur
#> 27 similar score
#> 28 matrix gener
#> 29 knowledg content
#> 30 bug content
#> 31 bug queri
#> 32 real time
#> 33 vector coordin
#> 34 matrix
#> 35 keyword
#> 36 keyword
#> 37 vector
#> 38 matric
#> 39 plural
#> 40 semant
#> 41 set
#> 42 support
#> 43 system
#> 44 user
#> 45 grasper grasp
#> 46 search result
#> 47 speech content
#> 48 voic data
#> 49 voic portion
#> 50 voic recognit
#> 51 index databas
#> 52 index inform
#> 53 index unit
#> 54 search word
#> 55 phonem lattic
#> 56 keyword input
#> 57 phonem string
#> 58 index
#> 59 word
#> 60 phonem
#> 61 keyword
#> 62 string
#> 63 analysi
#> 64 comparison
#> 65 frame
#> 66 frame
#> 67 match
#> 68 perform
#> 69 plural
#> 70 refer
#> 71 searcher
#> 72 store
#> 73 system
#> 74 time
#> 75 user
#> 76 keyword extract
#> 77 keyword
#> 78 content
#> 79 context
#> 80 definit
#> 81 mean
#> 82 media
#> 83 portion
#> 84 portion
#> 85 resourc
#> 86 techniqu
#> 87 url
#> 88 url
#> 89 user
#> 90 webpag
#> 91 grasper grasp
#> 92 search result
#> 93 speech content
#> 94 voic data
#> 95 voic portion
#> 96 voic recognit
#> 97 index databas
#> 98 index inform
#> 99 index unit
#> 100 search word
#> 101 phonem lattic
#> 102 keyword input
#> 103 phonem string
#> 104 index
#> 105 word
#> 106 phonem
#> 107 keyword
#> 108 string
#> 109 analysi
#> 110 comparison
#> 111 frame
#> 112 frame
#> 113 match
#> 114 perform
#> 115 plural
#> 116 refer
#> 117 searcher
#> 118 store
#> 119 system
#> 120 time
#> 121 user
#> 122 collect patient chief complaint
#> 123 patient chief complaint
#> 124 phone devic internet
#> 125 text word segment
#> 126 diseas examin result
#> 127 diseas examin
#> 128 diagnosi result
#> 129 guid result
#> 130 cloud server
#> 131 doctor inform
#> 132 keyword extract
#> 133 knowledg base
#> 134 recommend technologi
#> 135 guid system
#> 136 patient
#> 137 guid
#> 138 system
#> 139 combin
#> 140 environ
#> 141 health
#> 142 invent
#> 143 match
#> 144 optimum
#> 145 queri
#> 146 termin
#> 147 thereof
#> 148 product introduct support devic
#> 149 displai control inform
#> 150 displai control part
#> 151 group present part
#> 152 keyword extract part
#> 153 word select part
#> 154 product group
#> 155 keyword
#> 156 keyword
#> 157 word
#> 158 product
#> 159 attribut
#> 160 candid
#> 161 plural
#> 162 process
#> 163 state
#> 164 web
#> 165 keyword extract process
#> 166 input document
#> 167 match rate
#> 168 network page
#> 169 search engin
#> 170 search result
#> 171 candid keyword
#> 172 candid keyword
#> 173 keyword score
#> 174 keyword score
#> 175 search term
#> 176 keyword
#> 177 term
#> 178 applic
#> 179 frequenc
#> 180 length
#> 181 word
#> 182 cluster combin unit
#> 183 keyword extract unit
#> 184 seed select unit
#> 185 topic extract server
#> 186 cluster unit
#> 187 document group
#> 188 topic group
#> 189 unit
#> 190 cluster
#> 191 server
#> 192 topic
#> 193 keyword
#> 194 seed
#> 195 noun
#> 196 sentenc
#> 197 stopword
#> 198 suitabl
#> 199 text
#> 200 time
#> 201 weight
#> 202 keyword extract target
#> 203 target determin unit
#> 204 keyword extract unit
#> 205 text read unit
#> 206 imag process unit
#> 207 target
#> 208 keyword
#> 209 process
#> 210 unit
#> 211 imag
#> 212 text
#> 213 apparatu
#> 214 document
out_rake %>%
group_by(doc_id) %>%
arrange(desc(score)) %>%
slice(1) %>%
inner_join(patents, by = c("doc_id" = "patent_number")) %>%
rename(patent_number = doc_id, top_keyword = keyword) %>%
select(matches("number|title|keyword")) %>%
head() %>%
kable()
patent_number | top_keyword | patent_title |
---|---|---|
10013574 | encrypted drive system | Method and apparatus for secure storage and retrieval of encrypted files in public cloud-computing platforms |
10049102 | similarity measure module | Method and system for providing semantics based technical support |
10255908 | grasper grasps | System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content |
10298654 | keyword extraction | Automatic uniform resource locator construction |
10304441 | grasper grasps | System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content |
10311974 | collecting patient chief complaint | Mobile health intelligent medical guide system and method thereof |
# The journal article of interest - Rose et. al (i.e., the RAKE publication)
url <- "http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf"
# Download file and pull out text layer from PDF
destfile <- tempfile()
GET(url, write_disk(destfile))
raw_txt <- pdf_text(destfile)
# Helper function for text removal
sub_all <- function(regex_vec, txt) {
pattern <- paste0(regex_vec, collapse = "|")
gsub(pattern, " ", txt)
}
txt1 <-
paste0(raw_txt, collapse = " ") %>%
gsub("\\r\\n", " ", .) %>%
gsub("[[:space:]]{2,}", " ", .)
# Regex to capture text that we don't want to run RAKE on
remove1 <- "Acknowledgements.*"
remove2 <- "TEXT MINING"
remove3 <- "AUTOMATIC KEYWORD EXTRACTION"
txt2 <- sub_all(c(remove1, remove2, remove3), txt1)
There are some sections of the PDF’s text that we don’t want to run RAKE on, such as the text found in tables. The problem with tables is that they usually don’t contain typical phrase delimiters (e.g., periods and commas). Instead, the cell of the table acts as a sort of delimiter. It can be very difficult to parse a table’s cells out in a PDF document, though, so we’ll just try to identify/remove the tables themselves.1
The tables in this article mostly contain numbers. If we split the article into text chunks using a digit delimiter, it’s likely that most of a table’s chunks will be relatively small in size. We can use this fact to help us identify which text chunks correspond to tables and which correspond to paragraphs.
# Numbers generally appear in paragraphs in two ways in this article: When the authors refer to results in a specific table/figure (e.g., "the sample abstract shown in Figure 1.1"), and when the authors reference another article (e.g., "Andrade and Valencia (1998) base their approach"). Remove these instances so that paragraphs don't get split into small chunks, which would make them hard to tell apart from tables.
remove4 <- "(Table|Figure) [[:digit:].]{1,}"
remove5 <- "\\([12][[:digit:]]{3}\\)"
txt3 <- sub_all(c(remove4, remove5), txt2)
# Split text into chunks based on digit delimiter
txt_chunks <- unlist(strsplit(txt3, "[[:digit:]]"))
# Use number of alpha chars found in a chunk as an indicator of its size
num_alpha <- str_count(txt_chunks, "[[:alpha:]]")
# Use kmeans to distinguish tables from paragraphs
clust <- kmeans(num_alpha, centers = 2)
good_clust <- which(max(clust$centers) == clust$centers)
# Only keep chunks that are thought to be paragraphs
good_chunks <- txt_chunks[clust$cluster == good_clust]
final_txt <- paste(good_chunks, collapse = " ")
keyword | freq | score | stem |
---|---|---|---|
minimal set linear constraints linear constraints natural numbers strict inequations strict inequations nonstrict inequations nonstrict inequations upper bounds upper bounds | 1 | 115 | minim set linear constraint linear constraint natur number strict inequ strict inequ nonstrict inequ nonstrict inequ upper bound upper bound |
algorithms equations numbers nonstrict systems minimal bounds system | 1 | 36 | algorithm equat number nonstrict system minim bound system |
sets linear diophantine equations linear diophantine equations minimal | 1 | 34 | set linear diophantin equat linear diophantin equat minim |
natural criteria upper linear strict sets | 1 | 29 | natur criteria upper linear strict set |
extracted correct keywords keywords stoplist method size total | 1 | 28 | extract correct keyword keyword stoplist method size total |
criteria compatibility system linear diophantine equations | 1 | 22 | criteria compat system linear diophantin equat |
The fact that some of the keywords shown above are very long suggests we missed something in Step 4. It turns out that our method mistook one of the tables (Table 1.1 shown below) for a paragraph. Table 1.1 is somewhat atypical in that it doesn’t contain any numbers, and thus it makes sense that our method missed it.
To clean up the results, let’s apply an ad hoc filter on the keywords. This filter will remove keywords whose long length indicates that a phrase run-on has occurred, and hence the keyword is no good.
# Function to remove keywords that occur only once and have more than max_word_cnt member words
filter.rakelist <- function(x, max_word_cnt = 3) {
structure(
lapply(x, function(r) {
word_cnt <- str_count(r$keyword, " ") + 1
to_filter <- r$freq == 1 & word_cnt > max_word_cnt
r[!to_filter, ]
}),
class = c("rakelist", "list")
)
}
filter <- function(x) UseMethod("filter")
filter(rakelist)[[1]]
#> keyword freq score stem
#> 10 upper bounds 1 16.0 upper bound
#> 11 diophantine inequations constraints 1 14.5 diophantin inequ constraint
#> 12 linear diophantine equations 2 13.7 linear diophantin equat
#> 15 natural numbers criteria 1 12.3 natur number criteria
#> 16 linear constraints 2 11.9 linear constraint
#> 20 nonstrict inequations 3 11.2 nonstrict inequ
#> 21 strict inequations 3 10.8 strict inequ
#> 25 minimal supporting set 1 10.1 minim support set
#> 26 minimal lexical meaning 1 9.6 minim lexic mean
#> 27 basic building block 1 9.0 basic build block
#> 28 greenhouse gas emissions 2 9.0 greenhous ga emiss
#> 30 machine learning algorithms 1 8.8 machin learn algorithm
#> 31 natural numbers 4 8.7 natur number
#> 32 specific grammar conventions 1 8.4 specif grammar convent
#> 33 inspec test set 1 7.8 inspec test set
#> 34 highly specific terminology 1 7.7 highli specif terminologi
#> 35 chi square measure 2 7.7 chi squar measur
#> 36 fine grained measurement 1 7.7 fine grain measur
#> 37 kyoto protocol legally 1 7.7 kyoto protocol legal
#> 38 top ranking words 1 7.7 top rank word
#> 39 natural language 1 7.4 natur languag
#> 40 language independent method 1 7.3 languag independ method
#> 41 stated conven tions 1 7.2 state conven tion
#> 42 shorter kyoto protocol 1 7.2 shorter kyoto protocol
#> 43 minimal set 1 7.1 minim set
#> 44 interior stop words 3 7.0 interior stop word
#> 45 fixed size 1 7.0 fix size
#> 46 keyword extraction method 1 7.0 keyword extract method
#> 47 mon function words 1 6.9 mon function word
#> 48 full text articles 1 6.8 full text articl
#> 49 keyword extraction research 1 6.8 keyword extract research
#> 50 additional analytic methods 1 6.7 addit analyt method
#> 51 content bearing words 1 6.7 content bear word
#> 52 corpus oriented methods 1 6.7 corpu orient method
#> 53 automatic keyword extraction 1 6.7 automat keyword extract
#> 54 document oriented methods 3 6.6 document orient method
#> 55 longer candidate keywords 2 6.5 longer candid keyword
#> 56 strong interest due 1 6.5 strong interest due
#> 57 algorithms minimal 1 6.5 algorithm minim
#> 58 comparable recall scores 1 6.4 compar recal score
#> 59 multi word keywords 1 6.3 multi word keyword
#> 60 top frequent words 1 6.3 top frequent word
#> 61 score candidate keywords 1 6.3 score candid keyword
#> 62 sample technical abstract 1 6.3 sampl technic abstract
#> 63 stoplists leverage manually 1 6.2 stoplist leverag manual
#> 64 keyword exclusivity exc 1 6.2 keyword exclus exc
#> 65 minimal generating 1 6.1 minim gener
#> 66 supervised learning 1 6.0 supervis learn
#> 67 upper 2 6.0 upper
#> 68 systems candidate keywords 1 6.0 system candid keyword
#> 69 corpus oriented statistics 1 6.0 corpu orient statist
#> 70 member keyword scores 1 5.8 member keyword score
#> 71 text analysis methods 1 5.7 text analysi method
#> 72 word frequency distributions 1 5.7 word frequenc distribut
#> 73 typical function words 1 5.6 typic function word
#> 74 compatibility components 1 5.5 compat compon
#> 75 testing set 1 5.5 test set
#> 76 systems criteria 1 5.4 system criteria
#> 77 manual assignment 1 5.3 manual assign
#> 78 keyword adjacency frequency 1 5.3 keyword adjac frequenc
#> 79 linear 2 5.3 linear
#> 80 content bearing 1 5.0 content bear
#> 81 specific languages 1 5.0 specif languag
#> 82 simple set 1 5.0 simpl set
#> 83 equation 1 4.8 equat
#> 84 text analysis applications 1 4.8 text analysi applic
#> 85 stoplist generation stoplists 1 4.8 stoplist gener stoplist
#> 86 keyword extraction 2 4.8 keyword extract
#> 87 keyword extracts 1 4.8 keyword extract
#> 88 sets solutions 1 4.8 set solut
#> 89 analytic methods 1 4.7 analyt method
#> 90 kyoto protocol 2 4.7 kyoto protocol
#> 91 word ranks 1 4.7 word rank
#> 92 carry meaning 1 4.5 carri mean
#> 93 summary features 1 4.5 summari featur
#> 94 total 1 4.5 total
#> 95 valencia base 1 4.5 valencia base
#> 96 inspec training 1 4.3 inspec train
#> 97 comparable recall 1 4.3 compar recal
#> 98 current state 2 4.2 current state
#> 99 candidate keyword 4 4.2 candid keyword
#> 100 candidate keywords 6 4.2 candid keyword
#> 101 methods inherently 1 4.2 method inher
#> 102 shorter keywords 1 4.2 shorter keyword
#> 103 minimal 5 4.1 minim
#> 104 scores higher 3 4.1 score higher
#> 105 sample abstract 4 4.1 sampl abstract
#> 106 technical abstracts 4 4.1 technic abstract
#> 107 stop word 1 4.0 stop word
#> 108 stop words 5 4.0 stop word
#> 109 actual instances 1 4.0 actual instanc
#> 110 auto matically 1 4.0 auto matic
#> 111 compact representation 1 4.0 compact represent
#> 112 configuration parameters 1 4.0 configur paramet
#> 113 digital libraries 1 4.0 digit librari
#> 114 discriminative fea 1 4.0 discrimin fea
#> 115 effi ciency 1 4.0 effi cienci
#> 116 fixed taxonomy 1 4.0 fix taxonomi
#> 117 future articles 1 4.0 futur articl
#> 118 good choices 1 4.0 good choic
#> 119 google search 1 4.0 googl search
#> 120 hand tuned 1 4.0 hand tune
#> 121 highest values 1 4.0 highest valu
#> 122 highly frequent 2 4.0 highli frequent
#> 123 mark dataset 1 4.0 mark dataset
#> 124 mathematical signatures 1 4.0 mathemat signatur
#> 125 news articles 3 4.0 new articl
#> 126 nical rigor 1 4.0 nical rigor
#> 127 notable difference 1 4.0 notabl differ
#> 128 previous work 1 4.0 previou work
#> 129 professional curators 1 4.0 profession curat
#> 130 scientific literature 1 4.0 scientif literatur
#> 131 search tasks 1 4.0 search task
#> 132 single pass 2 4.0 singl pass
#> 133 standard punctuation 1 4.0 standard punctuat
#> 134 syntactic filters 1 4.0 syntact filter
#> 135 web pages 1 4.0 web page
#> 136 number 4 3.9 number
#> 137 specific subjects 1 3.9 specif subject
#> 138 short abstracts 1 3.9 short abstract
#> 139 function words 1 3.9 function word
#> 140 domain independent 1 3.8 domain independ
#> 141 high recall 1 3.8 high recal
#> 142 inspec corpus 1 3.8 inspec corpu
#> 143 text mining 1 3.8 text mine
#> 144 term frequency 8 3.8 term frequenc
#> 145 term selection 1 3.8 term select
#> 146 score keywords 1 3.8 score keyword
#> 147 score words 1 3.8 score word
#> 148 word scores 1 3.8 word score
#> 149 false positives 1 3.8 fals posit
#> 150 input parameters 2 3.8 input paramet
#> 151 ishizuka state 1 3.8 ishizuka state
#> 152 specific domains 1 3.7 specif domain
#> 153 measure degrades 1 3.7 measur degrad
#> 154 analysis tools 1 3.7 analysi tool
#> 155 protein function 1 3.7 protein function
#> 156 statistical methods 1 3.7 statist method
#> 157 keyword anchors 1 3.7 keyword anchor
#> 158 keyword frequency 1 3.7 keyword frequenc
#> 159 potential keywords 1 3.7 potenti keyword
#> 160 content words 5 3.7 content word
#> 161 contiguous words 1 3.7 contigu word
#> 162 individual words 1 3.7 individu word
#> 163 select words 1 3.7 select word
#> 164 selectively words 1 3.7 select word
#> 165 single words 2 3.7 singl word
#> 166 word delimiters 2 3.7 word delimit
#> 167 benchmark evaluation 2 3.7 benchmark evalu
#> 168 climate change 2 3.7 climat chang
#> 169 computationally efficient 1 3.7 computation effici
#> 170 extremely efficient 1 3.7 extrem effici
#> 171 phrase delimiters 3 3.7 phrase delimit
#> 172 abstract text 1 3.7 abstract text
#> 173 typical abstract 1 3.6 typic abstract
#> 174 criteria 1 3.6 criteria
#> 175 diophantine 2 3.6 diophantin
#> 176 abstracts keywords 1 3.6 abstract keyword
#> 177 occurrence graph 1 3.5 occurr graph
#> 178 primary document’ 1 3.5 primari document’
#> 179 representative list 1 3.5 repres list
#> 180 search results 1 3.5 search result
#> 181 tarau report 1 3.5 tarau report
#> 182 uncontrolled key 1 3.5 uncontrol kei
#> 183 frequent terms 1 3.4 frequent term
#> 184 document frequency 2 3.4 document frequenc
#> 185 document increases 1 3.4 document increas
#> 186 individual documents 4 3.4 individu document
#> 187 large documents 1 3.4 larg document
#> 188 page document 1 3.4 page document
#> 189 single document 1 3.4 singl document
#> 190 multiple words 1 3.4 multipl word
#> 191 word positions 1 3.4 word posit
#> 192 essential content 1 3.4 essenti content
#> 193 essential ity 1 3.4 essenti iti
#> 194 higher precision 1 3.4 higher precis
#> 195 highest precision 1 3.4 highest precis
#> 196 index vocabulary 1 3.4 index vocabulari
#> 197 mixed types 1 3.4 mix type
#> 198 perfect precision 1 3.4 perfect precis
#> 199 professional indexer 1 3.4 profession index
#> 200 stoplist generation 1 3.4 stoplist gener
#> 201 worst stoplist 1 3.4 worst stoplist
#> 202 keywords frequently 1 3.3 keyword frequent
#> 203 metric scores 1 3.3 metric score
#> 204 pos tags 2 3.3 po tag
#> 205 vast collections 1 3.3 vast collect
#> 206 wide range 1 3.3 wide rang
#> 207 manual annotations 1 3.3 manual annot
#> 208 text analyses 1 3.3 text analys
#> 209 text stream 1 3.3 text stream
#> 210 keyword adjacency 3 3.3 keyword adjac
#> 211 words adjacent 1 3.3 word adjac
#> 212 positive results 1 3.2 posit result
#> 213 precision recall 1 3.2 precis recal
#> 214 document text 4 3.2 document text
#> 215 compatibility 2 3.2 compat
#> 216 textrank increases 1 3.2 textrank increas
#> 217 word associations 1 3.2 word associ
#> 218 word list 1 3.2 word list
#> 219 noun phrase 1 3.2 noun phrase
#> 220 performance advantage 1 3.2 perform advantag
#> 221 multiple types 1 3.1 multipl type
#> 222 input stoplist 1 3.1 input stoplist
#> 223 extraction 2 3.1 extract
#> 224 measure rake 1 3.0 measur rake
#> 225 information retrieval 1 3.0 inform retriev
#> 226 multiple corpora 1 3.0 multipl corpora
#> 227 present results 1 3.0 present result
#> 228 rake performance 1 3.0 rake perform
#> 229 set 11 3.0 set
#> 230 sets 1 3.0 set
#> 231 keyword edf 1 2.9 keyword edf
#> 232 dynamic collections 1 2.8 dynam collect
#> 233 ideal application 1 2.8 ideal applic
#> 234 stoplist rake 1 2.7 stoplist rake
#> 235 rake effectively 1 2.7 rake effect
#> 236 reference corpus 1 2.6 refer corpu
#> 237 languages 1 2.6 languag
#> 238 independent 1 2.5 independ
#> 239 algorithm 3 2.3 algorithm
#> 240 algorithms 4 2.3 algorithm
#> 241 approach 1 2.3 approach
#> 242 approaches 1 2.3 approach
#> 243 components 1 2.3 compon
#> 244 graph 1 2.3 graph
#> 245 longer 1 2.3 longer
#> 246 functionality 1 2.2 function
#> 247 method 7 2.2 method
#> 248 methods 5 2.2 method
#> 249 protocol 1 2.2 protocol
#> 250 score 2 2.1 score
#> 251 addition 1 2.0 addit
#> 252 articles 1 2.0 articl
#> 253 content 3 2.0 content
#> 254 contexts 2 2.0 context
#> 255 distributions 1 2.0 distribut
#> 256 due 1 2.0 due
#> 257 frequency 2 2.0 frequenc
#> 258 gener 1 2.0 gener
#> 259 general 2 2.0 gener
#> 260 generality 1 2.0 gener
#> 261 member 1 2.0 member
#> 262 research 1 2.0 research
#> 263 tion 1 2.0 tion
#> 264 automatically 6 1.9 automat
#> 265 abstract 2 1.9 abstract
#> 266 abstracts 1 1.9 abstract
#> 267 recall 2 1.8 recal
#> 268 manually 6 1.8 manual
#> 269 solutions 3 1.8 solut
#> 270 text 6 1.8 text
#> 271 texts 1 1.8 text
#> 272 system 8 1.8 system
#> 273 systems 13 1.8 system
#> 274 terms 3 1.8 term
#> 275 inputs 1 1.8 input
#> 276 multiple 1 1.8 multipl
#> 277 position 1 1.8 posit
#> 278 typically 2 1.8 typic
#> 279 measure 5 1.7 measur
#> 280 measurement 1 1.7 measur
#> 281 analysis 4 1.7 analysi
#> 282 keyword 11 1.7 keyword
#> 283 keywords 53 1.7 keyword
#> 284 word 5 1.7 word
#> 285 words 32 1.7 word
#> 286 change 1 1.7 chang
#> 287 efficiency 1 1.7 effici
#> 288 evaluation 1 1.7 evalu
#> 289 frequently 4 1.7 frequent
#> 290 performance 1 1.7 perform
#> 291 phrases 2 1.7 phrase
#> 292 pos 1 1.7 po
#> 293 tags 1 1.7 tag
#> 294 adjacent 5 1.6 adjac
#> 295 advantages 1 1.5 advantag
#> 296 analyses 1 1.5 analys
#> 297 annotation 1 1.5 annot
#> 298 associations 1 1.5 associ
#> 299 document’ 1 1.5 document’
#> 300 dynamically 1 1.5 dynam
#> 301 exclusive 1 1.5 exclus
#> 302 exclusivity 2 1.5 exclus
#> 303 ideally 1 1.5 ideal
#> 304 information 1 1.5 inform
#> 305 interest 3 1.5 interest
#> 306 ishizuka 1 1.5 ishizuka
#> 307 key 1 1.5 kei
#> 308 list 1 1.5 list
#> 309 lists 1 1.5 list
#> 310 nouns 1 1.5 noun
#> 311 presentation 1 1.5 present
#> 312 protein 1 1.5 protein
#> 313 results 3 1.5 result
#> 314 retrieval 1 1.5 retriev
#> 315 statistically 4 1.5 statist
#> 316 streams 1 1.5 stream
#> 317 subjects 1 1.5 subject
#> 318 summary 1 1.5 summari
#> 319 tarau 1 1.5 tarau
#> 320 valencia 1 1.5 valencia
#> 321 corpus 9 1.5 corpu
#> 322 document 9 1.4 document
#> 323 documents 27 1.4 document
#> 324 essential 2 1.4 essenti
#> 325 essentially 1 1.4 essenti
#> 326 indexes 1 1.4 index
#> 327 indexing 2 1.4 index
#> 328 precision 6 1.4 precis
#> 329 types 3 1.4 type
#> 330 stoplist 15 1.4 stoplist
#> 331 stoplists 16 1.4 stoplist
#> 332 application 2 1.3 applic
#> 333 applications 5 1.3 applic
#> 334 collections 4 1.3 collect
#> 335 domain 1 1.3 domain
#> 336 domains 3 1.3 domain
#> 337 effectively 1 1.3 effect
#> 338 effectiveness 1 1.3 effect
#> 339 widely 2 1.3 wide
#> 340 rake 20 1.3 rake
#> 341 corpora 3 1.2 corpora
#> 342 edf 3 1.2 edf
#> 343 metric 1 1.2 metric
#> 344 metrics 2 1.2 metric
#> 345 textrank 4 1.2 textrank
#> 346 occurrence 1 1.2 occurr
#> 347 occurrences 4 1.2 occurr
#> 348 refer 1 1.2 refer
#> 349 reference 1 1.2 refer
#> 350 references 2 1.2 refer
#> 351 referring 1 1.2 refer
#> 352 ability 2 1.0 abil
#> 353 absence 1 1.0 absenc
#> 354 abstract’ 1 1.0 abstract’
#> 355 adaptive 1 1.0 adapt
#> 356 adjectives 1 1.0 adject
#> 357 adjoin 1 1.0 adjoin
#> 358 aid 1 1.0 aid
#> 359 andrade 2 1.0 andrad
#> 360 arbitrarily 1 1.0 arbitrarili
#> 361 array 2 1.0 arrai
#> 362 art 1 1.0 art
#> 363 authors 2 1.0 author
#> 364 axis 1 1.0 axi
#> 365 bench 1 1.0 bench
#> 366 berry 1 1.0 berri
#> 367 bias 1 1.0 bia
#> 368 biases 1 1.0 bias
#> 369 bold 1 1.0 bold
#> 370 broadly 1 1.0 broadli
#> 371 challenge 1 1.0 challeng
#> 372 chunks 1 1.0 chunk
#> 373 circumstances 1 1.0 circumst
#> 374 clear 1 1.0 clear
#> 375 combination 1 1.0 combin
#> 376 common 1 1.0 common
#> 377 comparison 5 1.0 comparison
#> 378 construction 1 1.0 construct
#> 379 contrast 1 1.0 contrast
#> 380 control 1 1.0 control
#> 381 convergence 1 1.0 converg
#> 382 counts 1 1.0 count
#> 383 creation 2 1.0 creation
#> 384 deg 11 1.0 deg
#> 385 degree 2 1.0 degre
#> 386 desire 1 1.0 desir
#> 387 details 1 1.0 detail
#> 388 discrepancies 1 1.0 discrep
#> 389 drawbacks 1 1.0 drawback
#> 390 drop 1 1.0 drop
#> 391 early 1 1.0 earli
#> 392 easily 2 1.0 easili
#> 393 easy 1 1.0 easi
#> 394 engel 1 1.0 engel
#> 395 envision 1 1.0 envis
#> 396 evident 1 1.0 evid
#> 397 evil 1 1.0 evil
#> 398 expectation 1 1.0 expect
#> 399 finally 2 1.0 final
#> 400 form 1 1.0 form
#> 401 fox 1 1.0 fox
#> 402 fox’ 1 1.0 fox’
#> 403 frees 1 1.0 free
#> 404 freq 9 1.0 freq
#> 405 gle 1 1.0 gle
#> 406 grams 1 1.0 gram
#> 407 greater 1 1.0 greater
#> 408 gutwin 1 1.0 gutwin
#> 409 hulth 2 1.0 hulth
#> 410 hyperlinks 1 1.0 hyperlink
#> 411 implementations 1 1.0 implement
#> 412 inaccessible 1 1.0 inaccess
#> 413 introduction 1 1.0 introduct
#> 414 intuition 1 1.0 intuit
#> 415 iterations 1 1.0 iter
#> 416 jones 2 1.0 jone
#> 417 judgment 1 1.0 judgment
#> 418 kee 1 1.0 kee
#> 419 keyphind 1 1.0 keyphind
#> 420 lack 1 1.0 lack
#> 421 limits 1 1.0 limit
#> 422 manner 1 1.0 manner
#> 423 material 1 1.0 materi
#> 424 matsuo 2 1.0 matsuo
#> 425 meaningful 2 1.0 meaning
#> 426 meaningless 1 1.0 meaningless
#> 427 michael 1 1.0 michael
#> 428 mihal 1 1.0 mihal
#> 429 mihalcea 1 1.0 mihalcea
#> 430 miss 1 1.0 miss
#> 431 motivation 1 1.0 motiv
#> 432 observation 1 1.0 observ
#> 433 occur 1 1.0 occur
#> 434 order 4 1.0 order
#> 435 pairs 1 1.0 pair
#> 436 part 1 1.0 part
#> 437 paynter 1 1.0 paynter
#> 438 phrasier 1 1.0 phrasier
#> 439 practice 1 1.0 practic
#> 440 predominantly 1 1.0 predominantli
#> 441 primarily 2 1.0 primarili
#> 442 purposes 2 1.0 purpos
#> 443 queries 1 1.0 queri
#> 444 quickly 1 1.0 quickli
#> 445 rake’ 2 1.0 rake’
#> 446 rarely 1 1.0 rare
#> 447 rate 1 1.0 rate
#> 448 ratio 1 1.0 ratio
#> 449 rdf 4 1.0 rdf
#> 450 reasoning 1 1.0 reason
#> 451 relationship 1 1.0 relationship
#> 452 reliable 1 1.0 reliabl
#> 453 remarkably 1 1.0 remark
#> 454 resources 1 1.0 resourc
#> 455 risk 1 1.0 risk
#> 456 runs 1 1.0 run
#> 457 salton 1 1.0 salton
#> 458 section 1 1.0 section
#> 459 sections 1 1.0 section
#> 460 sentences 1 1.0 sentenc
#> 461 sequence 2 1.0 sequenc
#> 462 sequences 2 1.0 sequenc
#> 463 series 1 1.0 seri
#> 464 share 1 1.0 share
#> 465 significance 1 1.0 signific
#> 466 simplicity 2 1.0 simplic
#> 467 small 1 1.0 small
#> 468 solely 1 1.0 sole
#> 469 speech 1 1.0 speech
#> 470 style 1 1.0 style
#> 471 subset 3 1.0 subset
#> 472 suitable 1 1.0 suitabl
#> 473 sum 2 1.0 sum
#> 474 sys 1 1.0 sy
#> 475 tech 1 1.0 tech
#> 476 techniques 2 1.0 techniqu
#> 477 tems 1 1.0 tem
#> 478 theory 1 1.0 theori
#> 479 time 3 1.0 time
#> 480 times 1 1.0 time
#> 481 title 1 1.0 titl
#> 482 topic 1 1.0 topic
#> 483 tures 1 1.0 ture
#> 484 uninformative 1 1.0 uninform
#> 485 unique 1 1.0 uniqu
#> 486 unsupervised 1 1.0 unsupervis
#> 487 user 1 1.0 user
#> 488 users 2 1.0 user
#> 489 utility 1 1.0 util
#> 490 variability 1 1.0 variabl
#> 491 variety 1 1.0 varieti
#> 492 volume 1 1.0 volum
#> 493 whitney 1 1.0 whitnei
#> 494 window 2 1.0 window