html - UTF-8 encoding problems with R -


trying parse senate statements mexican senate, having trouble utf-8 encodings of web page.

this html comes through clearly:

library(rvest) senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html") 

here example of bit of webpage:

"continÚa el senador corral jurado: nosotros decimos. entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora.   una decisión de pre dictamen lo mejor lo único que va hacer es complicar más las cosas." 

as can seen, both accents , "ñ" come through fine.

the issue arises in other htmls (of same domain!). example:

 senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html") 

i get:

 "-el c. diputado adame alemÃÂn: en consecuencia está discusión la propuesta. y para hablar sobre este asunto, se le concede el uso de la palabra la senadora…….." 

on second piece i've tried iconv() , coercing encoding parameter on html() encoding="utf-8" keep getting same results.

i've checked webpage encoding using w3 validator , seems utf-8 , have no issues.

using gsub not seem efficient encoding downloads different characters same "code":

í - àá - àó - à

pretty fresh out of ideas.

> sessioninfo() r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] lc_collate=english_united states.1252  lc_ctype=english_united states.1252    lc_monetary=english_united states.1252 [4] lc_numeric=c                           lc_time=english_united states.1252      attached base packages: [1] grdevices utils     datasets  graphics  stats     grid      methods   base       other attached packages:  [1] stringi_0.4-1    magrittr_1.5     selectr_0.2-3    rvest_0.2.0      ggplot2_1.0.0    geosphere_1.3-11 fields_7.1        [8] maps_2.3-9       spam_1.0-1       sp_1.0-17        soar_0.99-11     data.table_1.9.4 reshape2_1.4.1   xlsx_0.5.7       [15] xlsxjars_0.6.1   rjava_0.9-6       loaded via namespace (and not attached):  [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 digest_0.6.8     evaluate_0.5.5   formatr_1.0      gtable_0.1.2      [8] httr_0.6.1       knitr_1.8        lattice_0.20-29  mass_7.3-35      munsell_0.4.2    plotly_0.5.17    plyr_1.8.1       [15] proto_0.3-10     rcpp_0.11.3      rcurl_1.95-4.5   rjsonio_1.3-0    scales_0.2.4     stringr_0.6.2    tools_3.1.2      [22] xml_3.98-1.1     

update: seems issue:

stri_enc_mark(senate2) [1] "ascii"  "latin1" "latin1" "ascii"  "ascii"  "latin1" "ascii"  "ascii"  "latin1" 

... , forth. clearly, issue in latin1:

stri_enc_isutf8(texto2)     [1]  true false false  true  true false  true  true false 

how can coerce latin1 correct utf-8 strings? when "translated" stringi appears doing wrong, giving me issues described earlier.

encodings 1 of 21st century's worse headaches. here's solution you:

# set-up remote reading connection, specifying utf-8 encoding. addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html" read.html.con <- file(description = addr, encoding = "utf-8", open = "rt")  # read in cycles of 1000 characters html.text <- c() = 0 while(length(html.text) == i) {     html.text <- append(html.text, readchar(con = read.html.con,nchars = 1000))     cat(i <- + 1) }  # close reading connection close(read.html.con)  # paste & @ same time, convert utf-8  # to... utf-8 iconv(). know. it's crazy. encodings secretely  # meant drive insane. content <- paste0(iconv(html.text, from="utf-8", = "utf-8"), collapse="")  # set-up local writing outpath <- "~/htmlfile.html"  # create file connection specifying "utf-8" encoding, once more # (although 1 makes sense) write.html.con <- file(description = outpath, open = "w", encoding = "utf-8")  # use capture.output dump html file # using cat inside prevent having [1]'s, quotes , such parasites capture.output(cat(content), file = write.html.con)  # close output connection close(write.html.con) 

then you're ready open newly created file in favorite browser. should see intact , have ready reopened tools of choosing!


Comments

Popular posts from this blog

javascript - AngularJS custom datepicker directive -

javascript - jQuery date picker - Disable dates after the selection from the first date picker -