I'd like to scrape a page that is on https using the rvest
package. However, this is a site with security certificate issues. In these cases, you need to turn off SSL verification - but I do not know how to do this in that package. No RCurl
and httr
is very simple. I give some examples below
This is the page I want to scratch:
sucupira = "https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"
This is what I'm trying to do:
library(rvest)
read_html(sucupira) #NAO FUNCIONA
## Error in open.connection(x, "rb") :
## Peer certificate cannot be authenticated with given CA certificates
Obviously, just removing the "s" from https does not work:
sucupira2 = "http://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"
read_html(sucupira2) #CONTINUA NAO FUNCIONANDO
In RCurl
, a successful attempt would look like this:
library(RCurl)
getURL(sucupira) # NAO FUNCIONA
options(RCurlOptions =
list(capath = system.file("CurlSSL",
"cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
getURL(sucupira) # AGORA FUNCIONA
No httr
would look like this:
library(httr)
GET(sucupira) # NAO FUNCIONA
set_config( config( ssl_verifypeer = 0L ))
GET(sucupira) # AGORA FUNCIONA
My purpose is to learn to use rvest
. So I would not like, if possible, to use strategies like:
read_html(GET(sucupira)) # a resposta do comando GET do httr é
# passada para o read_html do rvest