How to ignore links that do not fit the established conditions and continue with scraping?

3

I'd like to know how to ignore links that do not fit the conditions set forth in title, date_time, and text; thus managing to continue scraping the site.

The error that occurs when a link does not have or does not follow the conditions: "Error in data.frame (title, date_time, text):   arguments imply differing number of rows: 1, 0 "

Below is the script:

# iniciar bibliotecas 
library(XML)
library(xlsx)

#url_base <- "http://www.saocarlosagora.com.br/busca/?q=PDT&page=2"

url_base <- "http://www.saocarlosagora.com.br/busca/?q=bolt&page=koxa"

url_base <- gsub("bolt", "PDT", url_base)

links_saocarlos <- c()
for (i in 1:4){
url1 <- gsub("koxa", i, url_base)
pag<- readLines(url1)
pag<- htmlParse(pag)
pag<- xmlRoot(pag)
links <- xpathSApply(pag, "//div[@class='item']/a", xmlGetAttr, name="href")
links <- paste("http://www.saocarlosagora.com.br/", links, sep ="")
links_saocarlos<- c(links_saocarlos, links)

}

dados <- data.frame()
for(links in links_saocarlos){

pag1<- readLines(links)
pag1<- htmlParse(pag1)
pag1<- xmlRoot(pag1)

    titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
    data_hora <- xpathSApply  (pag1, "//div[@class='horarios']", xmlValue)  
    texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)


dados <- rbind(dados, data.frame(titulo, data_hora, texto))

}  
agregar <- aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')
    
asked by anonymous 01.09.2016 / 01:57

1 answer

1

In your case, I think a if already resolves, for example by replacing the line you place in the database with:

if (length(titulo) == 1 & length(data_hora == 1) & length(texto) == 1){
    dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}

That is, "just add this new line if all elements of it exist."

However, you could do your scraping in a more robust way as follows:

library(plyr)

raspar <- failwith(NULL, function(links){
  pag1 <- readLines(links)
  pag1 <- htmlParse(pag1)
  pag1 <- xmlRoot(pag1)

  titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
  data_hora <- xpathSApply(pag1, "//div[@class='horarios']", xmlValue)  
  texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)

  data.frame(titulo, data_hora, texto)
})

dados <- ldply(links_saocarlos, raspar)

The failwith function catches errors without stopping execution. This is very good when we are doing webscraping, since connection problems are common, for example, they can cause unexpected errors in the code.

In addition, using plyr (function ldply ) has some advantages over its for . The main one is that you do not grow the object dynamically, which is usually much faster. Another advantage is that you can use the argument .progress = "text" and put a progress bar in your code:)

dados <- ldply(links_saocarlos, raspar, .progress = "text")
    
01.09.2016 / 14:30