Introduction
Brazilian researchers have, since 1999, a website where they can post information about their academic career. This information is known as Currículos Lattes . I want to download a few thousand of these resumes and write, along with some contributors, an article in this regard.
This link goes to the curriculum of researcher Suzana Carvalho Herculano Houzel . Notice that by clicking on the link, the browser was directed to a page with a captcha. This is my first problem. How to get through it? I tried two different approaches: one using python, one using R.
python
Apparently there is a well-known python module called scriptLattes . In theory, he would be able to download a number of latte curricula, provided he was given a list of curriculum IDs (for example, the curriculum ID I put above is 4706332670277273).
However, the module has not received updates since 2015. Lattes have since implemented captcha in their pages. I think this is a problem for the module, because I tried to run one of the examples in my Ubuntu and received the following result:
$ ./scriptLattes.py ./exemplo/teste-01.config
[ROTULO] [Sem rotulo]
[LENDO REGISTRO LATTES: 1o. DA LISTA]
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
This command only stopped after I manually canceled it with Ctrl + C. I imagine that the problem is precisely the captcha implemented after the last version of this module was published.
I have some experience with web scraping in python. I know the modules scrapy
and beautifulsoup
, but I'm not an expert on them.
R
R has a package called GetLattesData . However, the following news is posted in your repository:
** ATTENTION: The package is not working as of 2017-11-26. The Lattes website, where the xml files were available, is offline. **
In fact, this server with xml files has been down since November last year and has never been back. I tested the package today and it still does not work.
I've found other R packages that work with lattes, like the CochoLattes , for example. The problem is that I need to download the data manually, entering the captchas one by one.
I have experience with web scraping in R, working with the package rvest
.
Excavator
The site Digger makes itself a scrap of the lattes curricula. I contacted the site team and the data is not available. However, they sell access to their API through a credit system. I'm against paying for free information, but if nothing else works, I might have to do this myself.
Conclusion
Notice that my problem is not with the organization and scraping of the data itself, but it is prior to this: how to access the pages with the researchers' curricula? I have experience with data scraping, but I have never faced a problem like this with captcha.
Also, I do not have a list with all the IDs of all the resumes I want. Each curriculum has two unique IDs. In the case of this curriculum , the IDs are 4706332670277273 and K4727050Y3, each accessed by a different url:
Although the IDs are different, the pages above have the same content.
What to do in this case? I believe that getting the list of resumes I want is not difficult. This link has an address of more than 5 million lattes curricula. I could do a crawl and a scrap on it to get the IDs I need.
So, my problem is to download resume data (ie pages like link ) automatically, without You need to enter the captcha. How could I do this? R or python, whatever.