How to do scrapping on Google Scholar?

2

Using Mozilla Firefox, could anyone tell you how to do scrapping on Google Scholar? Where to start?

    
asked by anonymous 27.10.2014 / 14:11

1 answer

2

I'm going to publish here my scraping web code for google scholar, using keywords. With this code I was able to extract information such as: Title, Authors, Summary and number of citations; of the google scholar page. The code is based on information obtained by the following R programmers: Kay Cichini, Gabor Pozsgai and Rogério Barbosa.

library(RSelenium)
library(xlsx)

checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
startServer() # mantenha essa janela aberta

Have Mozilla firefox installed !! Let's open it:

firefox_con <- remoteDriver(remoteServerAddr = "localhost", 
                            port = 4444, 
                            browserName = "firefox"
)

Opening firefox (navigation will take place in it)

firefox_con$open() # mantenha essa janela aberta

Perform Scrapping

url <- paste("http://scholar.google.com/scholar?q=", "+key+word", "&num=1&as_sdt=1&as_vis=1", 
             sep = "")

firefox_con$navigate("http://scholar.google.com.br")
busca <- firefox_con$findElement(using = "css selector", value = "#gs_hp_tsi")
Keyword <- busca$sendKeysToElement(list("key word", key="enter"))

pages.max <- 10

scraper_internal <- function(x) {
  doc <- htmlParse(url, encoding="UTF-8")
  tit <- xpathSApply(x, "//h3[@class='gs_rt']", xmlValue)
  aut <- xpathSApply(x, "//div[@class='gs_a']", xmlValue)
  abst <- xpathSApply(x, "//div[@class='gs_rs']", xmlValue)
  others <- xpathSApply(x, "//div[@class='gs_fl']", xmlValue)
  dat <- data.frame(TITLE = tit, AUTHORS = aut, ABSTRACT = abst, CITED = others)
}

for (i in seq(1,pages.max*10,10)){
    baseURL <- paste("http://scholar.google.com/scholar?start=", i, "&q=", "+key+word",
                   "&hl=en&lr=lang_en&num=10&as_sdt=1&as_vis=1",
                   sep = "")
  firefox_con$navigate(baseURL)
  pagina <- xmlRoot(htmlParse(
    unlist(firefox_con$getPageSource())
  ))
  result <- scraper_internal(pagina)
  write.xlsx(result, "C:/KEYWORD.xlsx", 
             sheetName = paste("keyword", i), row.names=TRUE, col.names = TRUE, append=TRUE)
}
    
28.10.2014 / 19:54