Is there any efficient way to recognize the encoding of texts downloaded from the internet? I did a scraping of any site (see code below) and I can not find the correct encoding.
In the source code META tag the specification is "iso-8859-1" (latin1). But when I specify this setting, it still does not work ...
library(XML); library(httr)
url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"
site_gt = content(GET(url))
resumos_gt = xpathSApply(site_gt,'//div[@style="display:none;"]', xmlValue)
resumos_gt[1]
In this result, I get something like: "Estudos Legislativo no Brasil têm concentrado suas pesquisas nos âmbito federal e estadual"
. How can têm
become também
and âmbito
transform into âmbito
?
I tried everything that came to mind. And nothing worked:
iconv(resumos_gt[1], from="UTF-8", to = "latin1")
iconv(resumos_gt[1], from="UTF-8", to = "latin2")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15")
iconv(resumos_gt[1], from="UTF-8", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "UTF-8")
iconv(resumos_gt[1], from="latin2", to = "UTF-8")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8")
iconv(resumos_gt[1], from="latin1", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "ASCII")
iconv(resumos_gt[1], from="latin2", to = "ASCII")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="latin1", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="ASCII", to = "latin1")
iconv(resumos_gt[1], from="ASCII", to = "latin2")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15")
iconv(resumos_gt[1], from="ASCII", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15//TRANSLIT")
####
iconv(resumos_gt[1], from="UTF-8", to = "ASCII")
iconv(resumos_gt[1], from="UTF-8", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "latin2")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin1", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "latin1")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin2", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2//TRANSLIT")
I'm using a R 3.2.5 on a Windows 7 (and yes ... I have to keep that operating system. Apparently, in linux this problem does not occur - or is easier to resolve).