r - Matching a list of phrases to a corpus of documents and returning phrase frequency -
i have list of phrases , corpus of documents.there 100k+ phrases , 60k+ documents in corpus. phrases might/might not present in corpus. i'm looking forward find term frequency of each phrase present in corpus.
an example dataset:
phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") doc1 <- "if you're starting workout, begin slow." doc2 <- "don't jump in brain initial , try operate several kilometers without need of worked out before." doc3 <- "it possible end injuring on own , carrying out more damage good." doc4 <- "instead start brief stroll , gradually boost duration along speed." doc5 <- "before know you'll working 5 miles without problems."
i new text analytics in r , have approached problem on lines of tyler rinker's solution r text mining: counting number of times specific word appears in corpus?.
here's approach far:
library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string=phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null, pos = 1, envir = as.environment(pos))
when i'm exporting results in csv, giving me whether phrase 1 present in of docs or not.
i'm expecting output below (excluding non-matching phrases):
docs phrase1 phrase2 phrase3 phrase4 phrase5 1 0 1 2 0 0 2 1 0 0 1 0
i tried approach , can't replicate:
using:
library(tm) library(qdap) docs <- c(doc1, doc2, doc3, doc4, doc5) text <- removewords(docs, stopwords("english")) text <- removepunctuation(text) text <- tolower(text) corp <- corpus(vectorsource(text)) phrases <- tolower(phrases) word.freq <- apply_as_df(corp, termco_d, match.string = phrases) mcsv_w(word.freq, dir = null, open = t, sep = ", ", dataframes = null, pos = 1, envir = as.environment(pos))
i following csv:
docs word.count term(just starting) term(several kilometers) term(brief stroll) term(gradually boost) term(5 miles) term(dark night) term(cold morning) 1 7 1 0 0 0 0 0 0 2 12 0 1 0 0 0 0 0 3 7 0 0 0 0 0 0 0 4 9 0 0 1 1 0 0 0 5 7 0 0 0 0 0 0 0
Comments
Post a Comment