This vignette presents three examples of using slowrake() to find keywords in text. Each application runs RAKE on a different type of document, including a webpage, patent abstract(s), and a journal article.

Webpage

1. Download the HTML and run RAKE

# Load the libraries needed for all three applications
library(slowraker)
library(httr)
library(xml2)
library(patentsview)
library(dplyr)
library(pdftools)
library(stringr)
library(knitr)

# The webpage of interest - slowraker's "Getting started" page
url <- "https://crew102.github.io/slowraker/articles/getting-started.html"

GET(url) %>% 
  content("text") %>% 
  read_html() %>% 
  xml_find_all(".//p") %>% 
  xml_text() %>% 
  paste(collapse = " ") %>%
  iconv(from = "UTF-8", "ASCII", sub = '"') %>% 
  slowrake() %>%
  .[[1]]

Patent abstracts

1. Download patent data

# Download data from the PatentsView API for 10 patents with the phrase
# "keyword extraction" in their abstracts
pv_data <- search_pv(
  query = qry_funs$text_phrase(patent_abstract = "keyword extraction"),
  fields = c("patent_number", "patent_title", "patent_abstract"),
  per_page = 10
)

# Look at the data
patents <- pv_data$data$patents
kable(head(patents, n = 2))
patent_number patent_title patent_abstract
5689583 Character recognition apparatus using a keyword A character recognition unit recognizes a document image to output character candidates; a character correction unit selects a corrected character string which is correct with respect to grammar and vocabulary, from a set of character candidates from the character recognition unit; a keyword extraction unit extracts keywords of a document to be recognized, from the corrected character string; and wherein the character correction unit selects the corrected character string by a use of BUNSETSU evaluation representing correctness with respect to grammar and vocabulary, and an evaluation of the BUNSETSU is increased when the BUNSETSU has the keyword outputted from the keyword extraction unit.
5819261 Method and apparatus for extracting a keyword from scheduling data using the keyword for searching the schedule data file This invention provides an information processing method and apparatus, which can automatically set a word, which has already been electronically stored, as a search keyword, and can perform a search operation. For this purpose, in an information search apparatus for searching a data file for desired data, and reading out the desired data, input text data is stored in a data storage area, and when extraction of a search keyword is instructed, a search keyword extraction program automatically extracts a keyword used for search from the text data stored in the data storage area in response to the instruction. A multimedia data file stored in a nonvolatile storage medium is searched based on the extracted keyword.

2. Run RAKE on the abstracts

rakelist <- slowrake(
  patents$patent_abstract,
  stop_words = c("method", smart_words), # Consider "method" to be a stop word
  stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)] # Consider all non-nouns to be stop words
)

# Create a single data frame with all patents' keywords
out_rake <- rbind_rakelist(rakelist, doc_id = patents$patent_number)
out_rake

3. Show each patent’s top keyword

out_rake %>% 
  group_by(doc_id) %>% 
  arrange(desc(score)) %>%
  slice(1) %>% 
  inner_join(patents, by = c("doc_id" = "patent_number")) %>% 
  rename(patent_number = doc_id, top_keyword = keyword) %>% 
  select(matches("number|title|keyword")) %>%
  head() %>%
  kable()
patent_number top_keyword patent_title
5689583 character recognition unit Character recognition apparatus using a keyword
5819261 search keyword extraction program Method and apparatus for extracting a keyword from scheduling data using the keyword for searching the schedule data file
6173251 character type segmentation point Keyword extraction apparatus, keyword extraction method, and computer readable recording medium storing keyword extraction program
6243723 classification item selection apparatus Document classification apparatus
6334104 output sound control device Sound effects affixing system and sound effects affixing method
6470307 list forms Method and apparatus for automatically identifying keywords within a document

Journal article

1. Get PDF text

# The journal article of interest - Rose et. al (i.e., the RAKE publication)
url <- "http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf"

# Download file and pull out text layer from PDF
destfile <- tempfile()
GET(url, write_disk(destfile))
raw_txt <- pdf_text(destfile)

2. Apply basic text cleaning

# Helper function for text removal
sub_all <- function(regex_vec, txt) {
  pattern <- paste0(regex_vec, collapse = "|")
  gsub(pattern, " ", txt)
}

txt1 <- 
  paste0(raw_txt, collapse = " ") %>% 
    gsub("\\r\\n", " ", .) %>% 
    gsub("[[:space:]]{2,}", " ", .)

# Regex to capture text that we don't want to run RAKE on
remove1 <- "Acknowledgements.*"
remove2 <- "TEXT MINING"
remove3 <- "AUTOMATIC KEYWORD EXTRACTION"

txt2 <- sub_all(c(remove1, remove2, remove3), txt1)

3. Detect and remove tables

There are some sections of the PDF’s text that we don’t want to run RAKE on, such as the text found in tables. The problem with tables is that they usually don’t contain typical phrase delimiters (e.g., periods and commas). Instead, the cell of the table acts as a sort of delimiter. It can be very difficult to parse a table’s cells out in a PDF document, though, so we’ll just try to identify/remove the tables themselves.1

The tables in this article mostly contain numbers. If we split the article into text chunks using a digit delimiter, it’s likely that most of a table’s chunks will be relatively small in size. We can use this fact to help us identify which text chunks correspond to tables and which correspond to paragraphs.

# Numbers generally appear in paragraphs in two ways in this article: When the authors refer to results in a specific table/figure (e.g., "the sample abstract shown in Figure 1.1"), and when the authors reference another article (e.g., "Andrade and Valencia (1998) base their approach"). Remove these instances so that paragraphs don't get split into small chunks, which would make them hard to tell apart from tables.
remove4 <- "(Table|Figure) [[:digit:].]{1,}"
remove5 <- "\\([12][[:digit:]]{3}\\)"
txt3 <- sub_all(c(remove4, remove5), txt2)

# Split text into chunks based on digit delimiter
txt_chunks <- unlist(strsplit(txt3, "[[:digit:]]"))

# Use number of alpha chars found in a chunk as an indicator of its size
num_alpha <- str_count(txt_chunks, "[[:alpha:]]")

# Use kmeans to distinguish tables from paragraphs
clust <- kmeans(num_alpha, centers = 2)
good_clust <- which(max(clust$centers) == clust$centers)

# Only keep chunks that are thought to be paragraphs
good_chunks <- txt_chunks[clust$cluster == good_clust]
final_txt <- paste(good_chunks, collapse = " ")

4. Run RAKE

rakelist <- slowrake(final_txt)

kable(head(rakelist[[1]]))
keyword freq score stem
minimal set linear constraints linear constraints natural numbers strict inequations strict inequations nonstrict inequations nonstrict inequations upper bounds upper bounds 1 114 minim set linear constraint linear constraint natur number strict inequ strict inequ nonstrict inequ nonstrict inequ upper bound upper bound
bounds criteria linear minimal natural nonstrict numbers 1 38 bound criteria linear minim natur nonstrict number
sets linear diophantine equations linear diophantine equations minimal 1 32 set linear diophantin equat linear diophantin equat minim
extracted correct keywords keywords stoplist method size total 1 28 extract correct keyword keyword stoplist method size total
candidate keyword compatibility systems linear constraints 1 22 candid keyword compat system linear constraint
criteria compatibility system linear diophantine equations 1 22 criteria compat system linear diophantin equat

5. Filter out bad keywords

The fact that some of the keywords shown above are very long suggests we missed something in Step 4. It turns out that our method mistook one of the tables (Table 1.1 shown below) for a paragraph. Table 1.1 is somewhat atypical in that it doesn’t contain any numbers, and thus it makes sense that our method missed it.

To clean up the results, let’s apply an ad hoc filter on the keywords. This filter will remove keywords whose long length indicates that a phrase run-on has occurred, and hence the keyword is no good.

# Function to remove keywords that occur only once and have more than max_word_cnt member words
filter.rakelist <- function(x, max_word_cnt = 3) {
  structure(
    lapply(x, function(r) {
      word_cnt <- str_count(r$keyword, " ") + 1
      to_filter <- r$freq == 1 & word_cnt > max_word_cnt
      r[!to_filter, ]
    }),
    class = c("rakelist", "list")
  )
}

filter <- function(x) UseMethod("filter")

filter(rakelist)[[1]]

  1. There are better solutions for identifying and parsing tables in PDFs than the one I use here (e.g., the tabula Java library and its corresponding R wrapper, tabulizer).