How to do webscrapping from an https using rvest?

7

I'd like to scrape a page that is on https using the rvest package. However, this is a site with security certificate issues. In these cases, you need to turn off SSL verification - but I do not know how to do this in that package. No RCurl and httr is very simple. I give some examples below

This is the page I want to scratch:

sucupira = "https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"

This is what I'm trying to do:

library(rvest)
read_html(sucupira) #NAO FUNCIONA
 ##  Error in open.connection(x, "rb") : 
 ##  Peer certificate cannot be authenticated with given CA certificates

Obviously, just removing the "s" from https does not work:

sucupira2 = "http://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"

read_html(sucupira2) #CONTINUA NAO FUNCIONANDO

In RCurl , a successful attempt would look like this:

library(RCurl)
getURL(sucupira) # NAO FUNCIONA

options(RCurlOptions = 
      list(capath = system.file("CurlSSL", 
                                "cacert.pem", 
                                package = "RCurl"), 
           ssl.verifypeer = FALSE))

getURL(sucupira) # AGORA FUNCIONA

No httr would look like this:

library(httr)
GET(sucupira) # NAO FUNCIONA

set_config( config( ssl_verifypeer = 0L ))
GET(sucupira) # AGORA FUNCIONA

My purpose is to learn to use rvest . So I would not like, if possible, to use strategies like:

read_html(GET(sucupira)) # a resposta do comando GET do httr é
                         # passada para o read_html do rvest
    
asked by anonymous 16.12.2015 / 17:00

1 answer

4

This does not seem to be possible using the rvest package.

Reading the source code, we see that the read_html function is a wrapper of the read_xml function. Source code is available at this link .

The read_xml function uses some method depending on the input type, which can be character , raw or connection .

When we pass a URL, for the function read_xml , it converts it to a connection and then reads it as a raw .

Below is the method for connections of the function read_xml

read_xml.connection <- function(x, encoding = "", n = 64 * 1024,
                                verbose = FALSE, ..., base_url = "",
                                as_html = FALSE) {
  if (!isOpen(x)) {
    open(x, "rb")
    on.exit(close(x))
  }

  raw <- read_connection_(x, n)
  read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html)
}

See that it uses the open function of base .

From the help of open we read:

  

Note that the https: // URL scheme is not supported by the internal   method except on Windows. It is only supported if --internet2 or   setInternet2 (TRUE) was used (to make use of Windows internal   functions), and then only if the certificate is considered to be   valid. With that option only, the link notation for   sites requiring authentication is also accepted.

That is, https is only supported on windows if setInternet2(TRUE) is used before. In this case, it would only work if the certificate were valid.

All this to explain that there is no native form, or a simple argument change in rvest that allows you to read https pages.

I believe that the best method is even read_html(GET(sucupira)) that you did not even suggest. Or more cute:

GET(sucupira) %>% read_html()

If in the method of read_xml.connection function you changed the line open(x, "rb") to url(x,"rb", method = "libcurl") it is likely to work ...

    
16.12.2015 / 20:37