Using Mozilla Firefox, could anyone tell you how to do scrapping on Google Scholar? Where to start?
Using Mozilla Firefox, could anyone tell you how to do scrapping on Google Scholar? Where to start?
I'm going to publish here my scraping web code for google scholar, using keywords. With this code I was able to extract information such as: Title, Authors, Summary and number of citations; of the google scholar page. The code is based on information obtained by the following R programmers: Kay Cichini, Gabor Pozsgai and Rogério Barbosa.
library(RSelenium)
library(xlsx)
checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
startServer() # mantenha essa janela aberta
firefox_con <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox"
)
firefox_con$open() # mantenha essa janela aberta
url <- paste("http://scholar.google.com/scholar?q=", "+key+word", "&num=1&as_sdt=1&as_vis=1",
sep = "")
firefox_con$navigate("http://scholar.google.com.br")
busca <- firefox_con$findElement(using = "css selector", value = "#gs_hp_tsi")
Keyword <- busca$sendKeysToElement(list("key word", key="enter"))
pages.max <- 10
scraper_internal <- function(x) {
doc <- htmlParse(url, encoding="UTF-8")
tit <- xpathSApply(x, "//h3[@class='gs_rt']", xmlValue)
aut <- xpathSApply(x, "//div[@class='gs_a']", xmlValue)
abst <- xpathSApply(x, "//div[@class='gs_rs']", xmlValue)
others <- xpathSApply(x, "//div[@class='gs_fl']", xmlValue)
dat <- data.frame(TITLE = tit, AUTHORS = aut, ABSTRACT = abst, CITED = others)
}
for (i in seq(1,pages.max*10,10)){
baseURL <- paste("http://scholar.google.com/scholar?start=", i, "&q=", "+key+word",
"&hl=en&lr=lang_en&num=10&as_sdt=1&as_vis=1",
sep = "")
firefox_con$navigate(baseURL)
pagina <- xmlRoot(htmlParse(
unlist(firefox_con$getPageSource())
))
result <- scraper_internal(pagina)
write.xlsx(result, "C:/KEYWORD.xlsx",
sheetName = paste("keyword", i), row.names=TRUE, col.names = TRUE, append=TRUE)
}