I'm having trouble doing webscrapping
for sites that use the post
method, for example, I need to extract all news related to political parties from the site: link .
Below is a schedule I made of a journal that uses the get
method to show what my goal is with this schedule.
# iniciar bibliotecas
library(XML)
library(xlsx)
# URL real = http://www.imparcial.com.br/site/page/2?s=%28PSDB%29
url_base <-"http://www.imparcial.com.br/site/page/koxa?s=%28quatro%29"
url_base <- gsub("quatro", "PSD", url_base)
link_imparcial <- c()
for (i in 1:4){
print(i)
url1 <- gsub("koxa", i, url_base)
pag<- readLines(url1)
pag<- htmlParse(pag)
pag<- xmlRoot(pag)
links <- xpathSApply(pag, "//h1[@class='cat-titulo']/a", xmlGetAttr, name="href")
link_imparcial <- c(link_imparcial, links)
}
dados <- data.frame()
for(links in link_imparcial){
pag1<- readLines (links)
pag1<- htmlParse(pag1)
pag1<- xmlRoot(pag1)
titulo <- xpathSApply (pag1, "//div[@class='titulo']/h1", xmlValue)
data_hora <-xpathSApply (pag1, "//span[@class='data-post']", xmlValue)
texto <- xpathSApply (pag1, "//div[@class='conteudo']/p", xmlValue)
dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}
agregar <-
aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')
#definir diretorio
setwd("C:\Users\8601314\Documents")
# salvar em xlsx
write.xlsx(agregar, "PSDB.xlsx", col.names = TRUE, row.names = FALSE)
If it is not possible to solve my problem, I would like indications where I can find programming examples with the post
method.