Web scraping to collect scientific articles on ScienceDirect

5

I'm trying to use R to select ScienceDirect page articles using keywords. I was able to extract pdfs from a page last week, using the page source information. The code used was as follows:

base.url = "http"
doc.html <- htmlParse(base.url)
doc.links <- xpathSApply(doc.html, "//a/@href")
pdf.url <- doc.links[grep("http:/", doc.links)]
dat<-as.data.frame(pdf.url)
colnames(dat)<-"url"
dat$pdf<-unlist(lapply(dat$url, FUN = function(x) strsplit(x, "/")[[1]][3]))
lapply(dat$pdf, function(x)
download.file(paste("http//pdf/", x, sep=""), 
paste(download.folder, x, sep=""), mode = "wb", cacheOK=TRUE))

Does anyone have any suggestions on how I can do the same for Science Direct?

    
asked by anonymous 13.10.2014 / 17:44

1 answer

6

I have a suggestion using the RSelenium and XML packages. RSelenium controls a web browser (in this case, firefox) and allows you to automatically navigate through the command line. This is very advantageous for pages with many complex codes and JavaScript. Not the easiest solution however. I believe someone can post an example here using the package rvest ...

Let's go then:

Installing ...

    #install.packages("devtools")
    #install.packages("RCurl",dep=T)
    #install.packages("XML",dep=T)
    #install.packages("RJSONIO",dep=T)

    #library(devtools)
    #install_github("ropensci/RSelenium")

Now we load the RSelenium package

    library(RSelenium)

And we installed a Java / Selenium server to control Firefox. It is a program that is opened together with R and serves as a "translation" interface between the R and the browser.

    checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
    startServer() # mantenha essa janela aberta

Have Mozilla Firefox installed !! Let's open it:

    firefox_con <- remoteDriver(remoteServerAddr = "localhost", 
                                port = 4444, 
                                browserName = "firefox"
    )

Opening firefox (navigation will take place in it)

    firefox_con$open() # mantenha essa janela aberta

    # Definindo a pagina de interesse
    url <- "http://www.sciencedirect.com"

We navigate to the page of interest in firefox

    firefox_con$navigate("http://www.sciencedirect.com")

And we enter the search term ("Biology") in the text box. Then press ENTER to perform the search:

    busca <- firefox_con$findElement(using = "css selector", "#qs_all")
    busca$sendKeysToElement(list("Biology", key="enter"))

Now the rest is with XML:

    # Extraindo o codigo fonte da pagina
    pagina <- xmlRoot(
                    htmlParse(
                            unlist(firefox_con$getPageSource())
                    )) 

    # Extraindo os links para os PDF (alguns deles podem requerer acesso pago...)
    pdf_links <- xpathSApply(pagina, '//span[@class="pdfIconSmall"]/..', xmlGetAttr, "href")
    links_incompletos <- grep("^/", pdf_links)
    pdf_links[links_incompletos] <- paste0(url,pdf_links[links_incompletos])

    # Seus links
    pdf_links

    # links que funcionam (gratuitos)
    pdf_gratis <- pdf_links[grep("article",pdf_links)]

    # DOI (o DOI será o nome do arquivo salvo)
    DOI <- substr(pdf_gratis,50,66)

   # Fazendo o download
   ### setwd... defina um diretorio...

    for(i in 1:length(pdf_gratis)){
            download.file(pdf_gratis[i], 
                          paste0(DOI[i],".pdf"),
                          mode = "wb")
    }

I hope you have helped.

    
13.10.2014 / 17:52