Hello everyone. During the Web Scraping process, I began to encounter some errors that occur during the requisition process. I have now identified 4 most common types of errors:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Recv failure: Connection was reset
Error: Can only save length 1 node sets
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: www.tcm.ba.gov.br
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Operation timed out after 20000 milliseconds with 0 bytes received
The latter is intentionally caused by the timeout (20) function, which is intended to check if a request is taking more than 20 seconds to complete, which may be an indication of an error during the request.
The question is : How to develop a script / function in R (like trycatch) to do the following routine:
- > If, during the scraping process, an error occurs (such as one of the 4 above), repeat the same request after 60 seconds, no more than 3 attempts. After 3 attempts, skip to the next "i" in the loop. If it also returns error, return print ("Consecutive errors during Web Scraping").
(Plus): - > It would be VERY TOP if, instead of the print suggested in the last step above, the script sent a warning via email or telegram to the "Web Scraping Administrator" (in this case, I), informing that the script had to be interrupted. / p>
NOTE: Consider that the request function (httr :: GET) is inside a for loop, as in the simplified example below:
for (i in link) { httr::GET(i, timeout(20))}
NOTE: As I am not in the IT area, I am having a hard time understanding how the error handling part in R works, and therefore how to use the trycatch function.
Thanks in advance for your help.