html - UTF-8 encoding problems with R -
trying parse senate statements mexican senate, having trouble utf-8 encodings of web page.
this html comes through clearly:
library(rvest) senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")
here example of bit of webpage:
"continÚa el senador corral jurado: nosotros decimos. entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora. una decisión de pre dictamen lo mejor lo único que va hacer es complicar más las cosas."
as can seen, both accents , "ñ" come through fine.
the issue arises in other htmls (of same domain!). example:
senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")
i get:
"-el c. diputado adame alemÃÂn: en consecuencia está discusión la propuesta. y para hablar sobre este asunto, se le concede el uso de la palabra la senadora…….."
on second piece i've tried iconv() , coercing encoding parameter on html() encoding="utf-8" keep getting same results.
i've checked webpage encoding using w3 validator , seems utf-8 , have no issues.
using gsub not seem efficient encoding downloads different characters same "code":
í - àá - àó - ÃÂ
pretty fresh out of ideas.
> sessioninfo() r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] lc_collate=english_united states.1252 lc_ctype=english_united states.1252 lc_monetary=english_united states.1252 [4] lc_numeric=c lc_time=english_united states.1252 attached base packages: [1] grdevices utils datasets graphics stats grid methods base other attached packages: [1] stringi_0.4-1 magrittr_1.5 selectr_0.2-3 rvest_0.2.0 ggplot2_1.0.0 geosphere_1.3-11 fields_7.1 [8] maps_2.3-9 spam_1.0-1 sp_1.0-17 soar_0.99-11 data.table_1.9.4 reshape2_1.4.1 xlsx_0.5.7 [15] xlsxjars_0.6.1 rjava_0.9-6 loaded via namespace (and not attached): [1] bitops_1.0-6 chron_2.3-45 colorspace_1.2-4 digest_0.6.8 evaluate_0.5.5 formatr_1.0 gtable_0.1.2 [8] httr_0.6.1 knitr_1.8 lattice_0.20-29 mass_7.3-35 munsell_0.4.2 plotly_0.5.17 plyr_1.8.1 [15] proto_0.3-10 rcpp_0.11.3 rcurl_1.95-4.5 rjsonio_1.3-0 scales_0.2.4 stringr_0.6.2 tools_3.1.2 [22] xml_3.98-1.1
update: seems issue:
stri_enc_mark(senate2) [1] "ascii" "latin1" "latin1" "ascii" "ascii" "latin1" "ascii" "ascii" "latin1"
... , forth. clearly, issue in latin1:
stri_enc_isutf8(texto2) [1] true false false true true false true true false
how can coerce latin1 correct utf-8 strings? when "translated" stringi appears doing wrong, giving me issues described earlier.
encodings 1 of 21st century's worse headaches. here's solution you:
# set-up remote reading connection, specifying utf-8 encoding. addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html" read.html.con <- file(description = addr, encoding = "utf-8", open = "rt") # read in cycles of 1000 characters html.text <- c() = 0 while(length(html.text) == i) { html.text <- append(html.text, readchar(con = read.html.con,nchars = 1000)) cat(i <- + 1) } # close reading connection close(read.html.con) # paste & @ same time, convert utf-8 # to... utf-8 iconv(). know. it's crazy. encodings secretely # meant drive insane. content <- paste0(iconv(html.text, from="utf-8", = "utf-8"), collapse="") # set-up local writing outpath <- "~/htmlfile.html" # create file connection specifying "utf-8" encoding, once more # (although 1 makes sense) write.html.con <- file(description = outpath, open = "w", encoding = "utf-8") # use capture.output dump html file # using cat inside prevent having [1]'s, quotes , such parasites capture.output(cat(content), file = write.html.con) # close output connection close(write.html.con)
then you're ready open newly created file in favorite browser. should see intact , have ready reopened tools of choosing!
Comments
Post a Comment