Web Scraping: How to change the value of a drop-down button of a site using R?

5

I want to create a script in R to read an HTML table. Doing this from a static page with the rvest package is easy, the problem is that I have to change the value of two buttons on the page.

The site is this here . Note that above the graph, it has two buttons: one related to status ( ctl00$cphConteudo$CpDDLEstado ) and other related to an agricultural product ( ctl00$cphConteudo$CpDDLProduto ).

I tried the following code with no success:

library(rvest)
url <- "http://www.agrolink.com.br/cotacoes/historico/rs/leite-1l"
pgsession <- html_session(url)               ## create session
pgform    <- html_form(pgsession)[[1]]       ## pull form from session
filled_form <- set_values(pgform,
                          'ctl00$cphConteudo$CpDDLEstado' = "9826", #bahia
                          'ctl00$cphConteudo$CpDDLProduto' = "17") # algodão

submit_form(pgsession,filled_form)

The code returns a link from a blank page.

    
asked by anonymous 21.06.2016 / 05:34

1 answer

3

This site has a very annoying form of requesting POST , but it has the advantage of accepting GET requests as well. For GET , it uses a format

http://www.agrolink.com.br/cotacoes/historico/#ESTADO/#NOME_PRODUTO

Testing some vi that it always uses in #ESTADO , the state acronym in lowercase. For the product name, I saw that it changed everything that was not alpha-numeric by - .

Then you could convert the name of the products with a function of type:

library(stringr)
produtos <- c("Banana Prata Anã Primeira Atacado Cx 20Kg",
              "Cebola Amarela (Ipa) Produtor 1Kg",
              "Açúcar VHP Sc 50Kg"
              )

produtos <- produtos %>% str_replace_all("[:punct:]", "-") %>%
  str_replace_all("[:space:]", "-") %>%
  tolower() %>%
  iconv(to = "ASCII//TRANSLIT")
produtos
[1] "banana-prata-ana-primeira-atacado-cx-20kg" "cebola-amarela--ipa--produtor-1kg"        
[3] "acucar-vhp-sc-50kg"

Then you can order it this way by accessing each page with a loop that traverses the vector of states and products:

estados <- c("sp", "mg")
for(estado in estados){
  for(produto in produtos){
    url <- sprintf("http://www.agrolink.com.br/cotacoes/historico/%s/%s", estado, produto)
    tabela <- read_html(url) %>%
      html_nodes("#ctl00_cphConteudo_gvHistorico") %>%
      html_table()
    tabela <- tabela[[1]]
  }
}

Of course, in this way, you will still need to create a list with the name of the products and a list with the acronym of the states, but I believe it is the easiest way.

    
21.06.2016 / 14:47