r - Matching a list of phrases to a corpus of documents and returning phrase frequency -


i have list of phrases , corpus of documents.there 100k+ phrases , 60k+ documents in corpus. phrases might/might not present in corpus. i'm looking forward find term frequency of each phrase present in corpus.

an example dataset:

phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") doc1 <- "if you're starting workout, begin slow." doc2 <- "don't jump in brain initial , try operate several kilometers without need of worked out before." doc3 <- "it possible end injuring on own , carrying out more damage good." doc4 <- "instead start brief stroll , gradually boost duration along speed." doc5 <- "before know you'll working 5 miles without problems." 

i new text analytics in r , have approached problem on lines of tyler rinker's solution r text mining: counting number of times specific word appears in corpus?.

here's approach far:

library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string=phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null,         pos = 1, envir = as.environment(pos)) 

when i'm exporting results in csv, giving me whether phrase 1 present in of docs or not.

i'm expecting output below (excluding non-matching phrases):

docs      phrase1     phrase2    phrase3    phrase4    phrase5 1         0           1          2          0          0 2         1           0          0          1          0 

i tried approach , can't replicate:

using:

library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string = phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null,         pos = 1, envir = as.environment(pos)) 

i following csv:

docs    word.count  term(just starting) term(several kilometers)    term(brief stroll)  term(gradually boost)   term(5 miles)   term(dark night)    term(cold morning) 1   7   1   0   0   0   0   0   0 2   12  0   1   0   0   0   0   0 3   7   0   0   0   0   0   0   0 4   9   0   0   1   1   0   0   0 5   7   0   0   0   0   0   0   0 

Comments

Popular posts from this blog

javascript - AngularJS custom datepicker directive -

javascript - jQuery date picker - Disable dates after the selection from the first date picker -