I have a suggestion using the RSelenium and XML packages. RSelenium controls a web browser (in this case, firefox) and allows you to automatically navigate through the command line. This is very advantageous for pages with many complex codes and JavaScript. Not the easiest solution however. I believe someone can post an example here using the package rvest
...
Let's go then:
Installing ...
#install.packages("devtools")
#install.packages("RCurl",dep=T)
#install.packages("XML",dep=T)
#install.packages("RJSONIO",dep=T)
#library(devtools)
#install_github("ropensci/RSelenium")
Now we load the RSelenium package
library(RSelenium)
And we installed a Java / Selenium server to control Firefox. It is a program that is opened together with R and serves as a "translation" interface between the R and the browser.
checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
startServer() # mantenha essa janela aberta
Have Mozilla Firefox installed !! Let's open it:
firefox_con <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox"
)
Opening firefox (navigation will take place in it)
firefox_con$open() # mantenha essa janela aberta
# Definindo a pagina de interesse
url <- "http://www.sciencedirect.com"
We navigate to the page of interest in firefox
firefox_con$navigate("http://www.sciencedirect.com")
And we enter the search term ("Biology") in the text box. Then press ENTER to perform the search:
busca <- firefox_con$findElement(using = "css selector", "#qs_all")
busca$sendKeysToElement(list("Biology", key="enter"))
Now the rest is with XML:
# Extraindo o codigo fonte da pagina
pagina <- xmlRoot(
htmlParse(
unlist(firefox_con$getPageSource())
))
# Extraindo os links para os PDF (alguns deles podem requerer acesso pago...)
pdf_links <- xpathSApply(pagina, '//span[@class="pdfIconSmall"]/..', xmlGetAttr, "href")
links_incompletos <- grep("^/", pdf_links)
pdf_links[links_incompletos] <- paste0(url,pdf_links[links_incompletos])
# Seus links
pdf_links
# links que funcionam (gratuitos)
pdf_gratis <- pdf_links[grep("article",pdf_links)]
# DOI (o DOI será o nome do arquivo salvo)
DOI <- substr(pdf_gratis,50,66)
# Fazendo o download
### setwd... defina um diretorio...
for(i in 1:length(pdf_gratis)){
download.file(pdf_gratis[i],
paste0(DOI[i],".pdf"),
mode = "wb")
}
I hope you have helped.