Web scraping with R

6

I'm trying to do a Web Scrapping of the following link: link

I want to access all categories and extract a date frame with the name of all companies.

If you click on the name of any of the companies will have some data like:

  • Fancy name
  • Company Name
  • Date opened
  • Company status
  • Legal nature
  • Address

I would like in addition to the names, how to get this information too.

I tried to use rvest but I did not succeed.

Any ideas?

    
asked by anonymous 21.01.2016 / 13:28

3 answers

6

My access was blocked while I was doing it, but my code ended up being as below. It has some explanations of how each part works. I used only contemporary Hadley Wickham packages, including the% w_that you wanted to use.

Unfortunately the scraper is not so useful because of the captcha and the locks. The site allows you to make quotes here . The code below can be used for other scrapers. I recommend that you make a more legal error handling, for example using rvest or dplyr::failwith .

library(rvest)
library(httr)
library(tidyr)
library(dplyr)
library(stringr)

#' Tem captcha?
#'
#' Verifica se uma resposta tem captcha
#'
#' @param r resultado de uma request (pacote \code{\link{httr}}).
#'
#' @return \code{TRUE} se tiver captcha e \code{FALSE} caso contrário.
tem_captcha <- function(r) {
  res <- r %>%
    content('text') %>%
    read_html() %>%
    html_nodes(xpath = '//form[@action="/verificarCaptcha/confirmar"]') %>%
    length()
  res > 0
}

bloqueado <- function(r) {
  r %>%
    content('text') %>%
    str_detect('Acesso bloqueado')
}

#' Baixar categorias
#'
#' Baixa as categorias a partir do link inicial
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/
#'
#' @param link URL do município.
#'
#' @return \code{data.frame}
baixar_categorias <- function(link) {
  r <- GET(link)
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  u_base <- 'http://empresasdobrasil.com'
  r %>%
    content('text') %>%
    read_html() %>%
    html_nodes('.container a.linhas') %>% {
      data.frame(tipo = html_text(.),
                 link_categoria = paste0(u_base, html_attr(., 'href')),
                 stringsAsFactors = FALSE)
    } %>%
    mutate(result = 'OK')
}

#' Baixar empresas
#'
#' Baixa as empresas a partir do link de uma categoria
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/hoteis
#'
#' @param link URL da categoria
#'
#' @return \code{data.frame}
baixar_empresas <- function(link) {
  r <- GET(link, write_disk('arq.html', overwrite = TRUE))
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  u_base <- 'http://empresasdobrasil.com'
  r %>%
    content('text') %>%
    read_html() %>%
    html_node('table') %>% {
      tab <- html_table(.) %>%
        setNames('nome_razao') %>%
        separate(nome_razao, c('nome_fantasia', 'razao_social'),
                 sep = ' - ', extra = 'merge', fill = 'left')
      links <- html_nodes(., 'a') %>%
        html_attr('href')
      tab$link_empresa <- paste0(u_base, links)
      tab
    } %>%
    mutate(result = 'OK')
}

#' Baixar infos de uma empresa
#'
#' Baixa as empresas a partir do link de uma categoria
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/hoteis
#'
#' @param link URL da categoria
#'
#' @return \code{data.frame}
baixar_empresa <- function(link) {
  r <- GET(link)
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  r %>%
    content('text') %>%
    read_html() %>% {
      data.frame(titulo = html_text(html_nodes(., 'h4')),
                 texto = html_text(html_nodes(., 'h5')),
                 stringsAsFactors = FALSE)
    } %>%
    mutate(result = 'OK')
}

baixar_tudo <- function(link) {
  link <- 'http://empresasdobrasil.com/empresas/alta-floresta-mt/'
  d <- link %>%
    baixar_categorias() %>%
    group_by(tipo, link_categoria) %>%
    do(baixar_empresas(.$link_categoria)) %>%
    ungroup() %>%
    group_by(tipo, link_categoria, nome_fantasia,
             razao_social, link_empresa) %>%
    do(baixar_empresa(.$link_empresa))
  d
}
    
22.01.2016 / 00:42
4

You only need the XML library to wipe the data.

The code below worked to capture info from the first companies. However, I was blocked by excessive access. If you get past this barrier, the code works.

First, I captured links for all categories. Next, in each category, the links of each company. Finally, with a for loop, you wipe the page of each company, extract the data that is in the tag and insert as new line in an empty data frame.

library(XML)

url <- "http://empresasdobrasil.com/empresas/alta-floresta-mt/"
page_source <- xmlRoot(htmlParse(readLines(url)))
links_categorias<- xpathSApply(page_source, "//a[@class = 'linhas']", xmlGetAttr, name = "href")

url_parcial <- "http://empresasdobrasil.com/"
links_empresas <- c()
i = 1
for (categoria in links_categorias){
print(i); i = i + 1
url <- paste0(url_parcial, categoria)
page_source <- xmlRoot(htmlParse(readLines(url)))
links <- xpathSApply(page_source, "//td/a[@href]", xmlGetAttr, name = "href")
  links_empresas <- c(links_empresas, links)
}

i = 1
dados <- data.frame()
for (empresa in links_empresas){
print(i); i = i + 1
url <- paste0(url_parcial, empresa)
page_source <- xmlRoot(htmlParse(readLines(url)))
info_empresa <- xpathSApply(page_source, "//h5", xmlValue)
  dados <- rbind(dados, info_empresa)
}
    
21.01.2016 / 23:45
-1

RSelenium is a good idea. Although you do not need to load pages, just use the XML package functions (htmlParse, getNodeSet, xmlValue and xmlGetAttr)

1 - collect all links from the sectors;

2 - collect company links (needs a loop, with links from the previous step)

3 - collect company data (loop with links from previous step)

    
22.01.2016 / 00:18