unicode - write() in python/spyder -- 'charmap' codec encoding character u'\u0142' --- UnicodeEncodeError -
i'm having little trouble program (python 2.7) , after checking other similar questions on website, still can't find solution. show attempted solutions/thoughts.
i'm not sure if matters dataset i'm working yelp challenge dataset. don't plan on submitting work yelp challenge.
first load json file pandas dataframe. following code in order take text (100k reviews), lowercase it, stem it, join stemmed text 1 line/observation, , write text file:
reviews = df.text.tolist() reviews = [x.lower() x in reviews] revsub=reviews[0:100000] lrev = [[stem(word) word in re.compile("\w+",re.unicode).split(sentence)] sentence in revsub] testt = [" ".join(review) review in lrev] f2 = open("yelpreviewsparagragh.txt", "w") f2.write("\n".join(str(x) x in testt)) f2.close()
which gives following error:
f2 = open("yelpreviewsparagragh.txt", "w") f2.write("\n".join(str(x) x in testt)) f2.close() traceback (most recent call last): file "<ipython-input-55-bf9e5d409f4e>", line 2, in <module> f2.write("\n".join(str(x) x in testt)) file "<ipython-input-55-bf9e5d409f4e>", line 2, in <genexpr> f2.write("\n".join(str(x) x in testt)) file "c:\users\owner\anaconda\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) unicodeencodeerror: 'charmap' codec can't encode character u'\u0142' in position 918: character maps <undefined>
solutions attempted:
after research, realize lowercase latin l in unicode. weird because i've gone source code cp1252.py @ decoding_table , lowercase latin l in there, different character.
so, naively tried adding u'\u0142' decoding table didn't solve problem. saw when researching problem there way 'ignore' or 'replace' characters when these errors arise. so, tried changing source code again from:
""" python character mapping codec cp1252 generated 'mappings/vendors/micsft/windows/cp1252.txt' gencodec.py. """#" import codecs ### codec apis class codec(codecs.codec): def encode(self,input,errors='strict'): return codecs.charmap_encode(input,errors,encoding_table) def decode(self,input,errors='strict'): return codecs.charmap_decode(input,errors,decoding_table)
to: ... def encode(self,input,errors='replace'):
or ... def encode(self,input,errors='ignore'):
however neither worked. there else can do?
when creating file using open function try passing encoding well
f2 = open("yelpreviewsparagragh.txt", "w", encoding="utf-8")
Comments
Post a Comment