r - Matching a list of phrases to a corpus of documents and returning phrase frequency -

May 15, 2015

i have list of phrases , corpus of documents.there 100k+ phrases , 60k+ documents in corpus. phrases might/might not present in corpus. i'm looking forward find term frequency of each phrase present in corpus.

an example dataset:

phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") doc1 <- "if you're starting workout, begin slow." doc2 <- "don't jump in brain initial , try operate several kilometers without need of worked out before." doc3 <- "it possible end injuring on own , carrying out more damage good." doc4 <- "instead start brief stroll , gradually boost duration along speed." doc5 <- "before know you'll working 5 miles without problems."

i new text analytics in r , have approached problem on lines of tyler rinker's solution r text mining: counting number of times specific word appears in corpus?.

here's approach far:

library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string=phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null,         pos = 1, envir = as.environment(pos))

when i'm exporting results in csv, giving me whether phrase 1 present in of docs or not.

i'm expecting output below (excluding non-matching phrases):

docs      phrase1     phrase2    phrase3    phrase4    phrase5 1         0           1          2          0          0 2         1           0          0          1          0

i tried approach , can't replicate:

using:

library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string = phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null,         pos = 1, envir = as.environment(pos))

i following csv:

docs    word.count  term(just starting) term(several kilometers)    term(brief stroll)  term(gradually boost)   term(5 miles)   term(dark night)    term(cold morning) 1   7   1   0   0   0   0   0   0 2   12  0   1   0   0   0   0   0 3   7   0   0   0   0   0   0   0 4   9   0   0   1   1   0   0   0 5   7   0   0   0   0   0   0   0

Search This Blog

Plus Code

r - Matching a list of phrases to a corpus of documents and returning phrase frequency -

Comments

Post a Comment

Popular posts from this blog

r - Trouble relying on third party package imports in my package -

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

Payment information shows nothing in one page checkout page magento -