How to do the webscrapping of a site that has method post?

5

I'm having trouble doing webscrapping for sites that use the post method, for example, I need to extract all news related to political parties from the site: link .

Below is a schedule I made of a journal that uses the get method to show what my goal is with this schedule.

# iniciar bibliotecas 
library(XML)
library(xlsx)
# URL real = http://www.imparcial.com.br/site/page/2?s=%28PSDB%29


url_base <-"http://www.imparcial.com.br/site/page/koxa?s=%28quatro%29"

url_base <- gsub("quatro", "PSD", url_base)

link_imparcial <- c()

for (i in 1:4){

  print(i)

  url1 <- gsub("koxa", i, url_base)

  pag<- readLines(url1)

  pag<- htmlParse(pag)

  pag<- xmlRoot(pag)

  links <- xpathSApply(pag, "//h1[@class='cat-titulo']/a", xmlGetAttr, name="href")

link_imparcial <- c(link_imparcial, links)
}
dados <- data.frame()

for(links in link_imparcial){

pag1<- readLines (links)

pag1<- htmlParse(pag1)

pag1<- xmlRoot(pag1)

titulo <- xpathSApply (pag1, "//div[@class='titulo']/h1", xmlValue)

data_hora <-xpathSApply (pag1, "//span[@class='data-post']", xmlValue)

texto <- xpathSApply (pag1, "//div[@class='conteudo']/p", xmlValue)

dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}


agregar <- 

aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')


#definir diretorio

setwd("C:\Users\8601314\Documents")

# salvar em xlsx
write.xlsx(agregar, "PSDB.xlsx", col.names = TRUE, row.names = FALSE)

If it is not possible to solve my problem, I would like indications where I can find programming examples with the post method.

    
asked by anonymous 31.05.2016 / 18:18

1 answer

7

In this case you can do so using the httr package:

library(httr)
library(rvest)
library(purrr)
library(stringr)

url <- "http://www.diariodemarilia.com.br/resultado/"
res <- POST(url, body = list("Busca" = "PT"))

You can then extract the data in the usual way or by using rvest :

noticias <- content(res, as = "text", encoding = "latin1") %>%
      read_html() %>%
      html_nodes("td") 


# extrai titulos
noticias %>%
  html_nodes("strong") %>%
  html_text()
# extrai links
noticias %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  keep(~str_detect(.x, fixed("/noticia/")))
# extrai data
noticias %>%
  html_nodes("em") %>%
  html_text()

The idea to extract information when the site receives POST forms is to find out what information the site sends to the server.

I always open the site using Chrome F12 squeeze to open developer tools and go to Network .

Then I send the form normally through the website and I go back to the Network , and click on the first item in the list, in this case it is /resultado/ .

Now, see the Form Data part of the image below, it is this information you need to send to the server using the body parameter of the POST function of the httr .

    
31.05.2016 / 19:16