c++ - How to split a sentence with an escaped whitespace? -

April 15, 2011

i want split sentence using whitespace delimiter except escaped whitespaces. using boost::split , regex, how can split it? if not possible, how else?

example:

std::string sentence = "my dog fluffy\\ cake likes jump";

result:
my
dog
fluffy\ cake
likes
to
jump

three implementations:

with boost spirit
with boost regex
handwritten parser

with boost spirit

here's how i'd boost spirit. might seem overkill, experience teaches me once you're splitting input text require more parsing logic.

boost spirit shines when scale "just splitting tokens" real grammar production rules.

live on coliru

#include <boost/spirit/include/qi.hpp> namespace qi = boost::spirit::qi;  int main() {     std::string const sentence = "my dog fluffy\\ cake likes jump";     using = std::string::const_iterator;     f = sentence.begin(), l = sentence.end();      std::vector<std::string> words;      bool ok = qi::phrase_parse(f, l,             *qi::lexeme [ +('\\' >> qi::char_ | qi::graph) ], // words             qi::space - "\\ ", // skipper             words);      if (ok) {         std::cout << "parsed:\n";         (auto& w : words)             std::cout << "\t'" << w << "'\n";     } else {         std::cout << "parse failed\n";     }      if (f != l)         std::cout << "remaining unparsed: '" << std::string(f,l) << "'\n"; }

with boost regex

this looks succinct but

requires linking boost_regex
uses "black magic" negative behind assertion: http://www.regular-expressions.info/lookaround.html

live on coliru

#include <iostream> #include <boost/regex.hpp> #include <boost/algorithm/string_regex.hpp> #include <vector>  int main() {     std::string const sentence = "my dog fluffy\\ cake likes jump";      std::vector<std::string> words;     boost::algorithm::split_regex(words, sentence, boost::regex("(?<!\\\\)\\s"), boost::match_default);      (auto& w : words)         std::cout << " '" << w << "'\n"; }

using c++11 raw literals write regular expression less obscurely: boost::regex(r"((?<!\\)\s)"), meaning "any whitespace not following backslash"

handwritten parser

this more tedious, spirit grammar generic, , allow nice performance.

however, doesn't scale gracefully spirit approach once start adding complexity grammar. advantage spend less time compiling code spirit version.

live on coliru

#include <iostream> #include <iterator> #include <vector>  template <typename it, typename out> out tokens(it f, l, out out) {     std::string accum;     auto flush = [&] {          if (!accum.empty()) {             *out++ = accum;             accum.resize(0);         }     };      while (f!=l) {         switch(*f) {             case '\\':                  if (++f!=l && *f==' ')                     accum += ' ';                 else                     accum += '\\';                 break;             case ' ': case '\t': case '\r': case '\n':                 ++f;                 flush();                 break;             default:                 accum += *f++;         }     }     flush();     return out; }  int main() {     std::string const sentence = "my dog fluffy\\ cake likes jump";      std::vector<std::string> words;      tokens(sentence.begin(), sentence.end(), back_inserter(words));      (auto& w : words)         std::cout << "\t'" << w << "'\n"; }

Search This Blog

Plus Code