How to handle errors during web scraping?

Question

How to handle errors during web scraping?

Navigation

#1 by (1 votes)

2

Hello everyone. During the Web Scraping process, I began to encounter some errors that occur during the requisition process. I have now identified 4 most common types of errors:

    Error in curl::curl_fetch_memory(url, handle = handle) : 
      Timeout was reached: Recv failure: Connection was reset

    Error: Can only save length 1 node sets

    Error in curl::curl_fetch_memory(url, handle = handle) : 
      Could not resolve host: www.tcm.ba.gov.br

    Error in curl::curl_fetch_memory(url, handle = handle) : 
    Timeout was reached: Operation timed out after 20000 milliseconds with 0 bytes received

The latter is intentionally caused by the timeout (20) function, which is intended to check if a request is taking more than 20 seconds to complete, which may be an indication of an error during the request.

The question is : How to develop a script / function in R (like trycatch) to do the following routine:

- > If, during the scraping process, an error occurs (such as one of the 4 above), repeat the same request after 60 seconds, no more than 3 attempts. After 3 attempts, skip to the next "i" in the loop. If it also returns error, return print ("Consecutive errors during Web Scraping").

(Plus): - > It would be VERY TOP if, instead of the print suggested in the last step above, the script sent a warning via email or telegram to the "Web Scraping Administrator" (in this case, I), informing that the script had to be interrupted. / p>

NOTE: Consider that the request function (httr :: GET) is inside a for loop, as in the simplified example below:

 for (i in link) { httr::GET(i, timeout(20))}

NOTE: As I am not in the IT area, I am having a hard time understanding how the error handling part in R works, and therefore how to use the trycatch function.

Thanks in advance for your help.

r try-catch web-scraping

asked by anonymous 17.12.2017 / 17:34

1 answer

Use a connection for several methods in the same class and in others AJAX request is running before the click

score 1 · Accepted Answer

To solve this problem, and others similar, you need to do the error handling of the script. There are basically two ways to do this:

1. tryCatch

2. purrr package

USING tryCatch

This is an example of my own practice:

for (i in 1:nrow(pg_amostra)) {

  pagina <- tryCatch({
    pagina = html_session(pg_amostra$http[i]) 
  }, warning = function(w) {
    print('Aviso!')
    Sys.sleep(0.3)
    pagina = html_session(pg_amostra$http[i])
  }, error = function(e) {
    print('Erro')
    texto <- ''
  }, finally = {
    Sys.sleep(0.3)
    texto <- html_nodes(pagina, css = '.corpoTextoLongo') %>%
      html_text()
  })

  print(i)

  assign(paste("texto", i, sep = ""), texto)
}

Here in the pg_amostra table there is information about several pages and the url of these pages is in the http column.

SOLUTION WITH purrr

For the treatment of errors I tend to always use purrr because it is simply much easier. In this example, also from my own practice, I create a function that extracts a piece of text from a page.

pega_texto <- function(url) {

  texto <- read_html(url, encoding = 'iso-8859-1') %>%
  html_nodes(css = '.corpoTextoLongo') %>% html_text()
  pb$tick()
  return(texto)
}

HOWEVER, if read_html() fails the entire script may stop. So I'm going to "wrap" the function with possibly() of purrr in order to handle possible errors simply by generating an NA.

pega_texto <- possibly(.f = pega_texto, otherwise = NA_real_, quiet = T)

From now on, if I use this function and some error happens it will always return a NA . Then storing the results in a table, or I'll have a text or an NA. Just redo the search ONLY urls that returned NA .