How to recognize and change the encoding of Latin characters in R?

12

Is there any efficient way to recognize the encoding of texts downloaded from the internet? I did a scraping of any site (see code below) and I can not find the correct encoding.

In the source code META tag the specification is "iso-8859-1" (latin1). But when I specify this setting, it still does not work ...

library(XML); library(httr)
url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"
site_gt = content(GET(url))
resumos_gt = xpathSApply(site_gt,'//div[@style="display:none;"]', xmlValue)
resumos_gt[1]

In this result, I get something like: "Estudos Legislativo no Brasil têm concentrado suas pesquisas nos âmbito federal e estadual" . How can têm become também and âmbito transform into âmbito ?

I tried everything that came to mind. And nothing worked:

    iconv(resumos_gt[1], from="UTF-8", to = "latin1")
    iconv(resumos_gt[1], from="UTF-8", to = "latin2")
    iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15")
    iconv(resumos_gt[1], from="UTF-8", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="UTF-8", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15//TRANSLIT")


    iconv(resumos_gt[1], from="latin1", to = "UTF-8")
    iconv(resumos_gt[1], from="latin2", to = "UTF-8")
    iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8")
    iconv(resumos_gt[1], from="latin1", to = "UTF-8//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "UTF-8//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8//TRANSLIT")

    ####

    iconv(resumos_gt[1], from="latin1", to = "ASCII")
    iconv(resumos_gt[1], from="latin2", to = "ASCII")
    iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
    iconv(resumos_gt[1], from="latin1", to = "ASCII//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "ASCII//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")

    iconv(resumos_gt[1], from="ASCII", to = "latin1")
    iconv(resumos_gt[1], from="ASCII", to = "latin2")
    iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15")
    iconv(resumos_gt[1], from="ASCII", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="ASCII", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15//TRANSLIT")

    ####

    iconv(resumos_gt[1], from="UTF-8", to = "ASCII")
    iconv(resumos_gt[1], from="UTF-8", to = "ASCII//TRANSLIT")

    iconv(resumos_gt[1], from="ASCII", to = "UTF-8")
    iconv(resumos_gt[1], from="ASCII", to = "UTF-8//TRANSLIT")


    ####

    iconv(resumos_gt[1], from="latin1", to = "latin2")
    iconv(resumos_gt[1], from="latin1", to = "iso-8859-15")
    iconv(resumos_gt[1], from="latin1", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="latin1", to = "iso-8859-15//TRANSLIT")

    iconv(resumos_gt[1], from="latin2", to = "latin1")
    iconv(resumos_gt[1], from="latin2", to = "iso-8859-15")
    iconv(resumos_gt[1], from="latin2", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "iso-8859-15//TRANSLIT")

    iconv(resumos_gt[1], from="iso-8859-15", to = "latin1")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin2")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin2//TRANSLIT")

I'm using a R 3.2.5 on a Windows 7 (and yes ... I have to keep that operating system. Apparently, in linux this problem does not occur - or is easier to resolve).

    
asked by anonymous 01.05.2016 / 19:34

1 answer

10
library(XML); library(httr)

url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"

site_gt =  GET(url)

site_gt = content(site_gt, as = "text")

site_gt <- htmlParse(site_gt, encoding = "UTF-8")

resumos_gt = xpathSApply(site,'//div[@style="display:none;"]', xmlValue)

resumos_gt

Solution was first to read the contents of the page as text, and only then apply the htmlParse with UTF-8 encoding

    
01.05.2016 / 21:07