Extract information from lattes

7

Introduction

Brazilian researchers have, since 1999, a website where they can post information about their academic career. This information is known as Currículos Lattes . I want to download a few thousand of these resumes and write, along with some contributors, an article in this regard.

This link goes to the curriculum of researcher Suzana Carvalho Herculano Houzel . Notice that by clicking on the link, the browser was directed to a page with a captcha. This is my first problem. How to get through it? I tried two different approaches: one using python, one using R.

python

Apparently there is a well-known python module called scriptLattes . In theory, he would be able to download a number of latte curricula, provided he was given a list of curriculum IDs (for example, the curriculum ID I put above is 4706332670277273).

However, the module has not received updates since 2015. Lattes have since implemented captcha in their pages. I think this is a problem for the module, because I tried to run one of the examples in my Ubuntu and received the following result:

$ ./scriptLattes.py ./exemplo/teste-01.config
[ROTULO]  [Sem rotulo]

[LENDO REGISTRO LATTES: 1o. DA LISTA]
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>

This command only stopped after I manually canceled it with Ctrl + C. I imagine that the problem is precisely the captcha implemented after the last version of this module was published.

I have some experience with web scraping in python. I know the modules scrapy and beautifulsoup , but I'm not an expert on them.

R

R has a package called GetLattesData . However, the following news is posted in your repository:

** ATTENTION: The package is not working as of 2017-11-26. The Lattes website, where the xml files were available, is offline. **

In fact, this server with xml files has been down since November last year and has never been back. I tested the package today and it still does not work.

I've found other R packages that work with lattes, like the CochoLattes , for example. The problem is that I need to download the data manually, entering the captchas one by one.

I have experience with web scraping in R, working with the package rvest .

Excavator

The site Digger makes itself a scrap of the lattes curricula. I contacted the site team and the data is not available. However, they sell access to their API through a credit system. I'm against paying for free information, but if nothing else works, I might have to do this myself.

Conclusion

Notice that my problem is not with the organization and scraping of the data itself, but it is prior to this: how to access the pages with the researchers' curricula? I have experience with data scraping, but I have never faced a problem like this with captcha.

Also, I do not have a list with all the IDs of all the resumes I want. Each curriculum has two unique IDs. In the case of this curriculum , the IDs are 4706332670277273 and K4727050Y3, each accessed by a different url:

Although the IDs are different, the pages above have the same content.

What to do in this case? I believe that getting the list of resumes I want is not difficult. This link has an address of more than 5 million lattes curricula. I could do a crawl and a scrap on it to get the IDs I need.

So, my problem is to download resume data (ie pages like link ) automatically, without You need to enter the captcha. How could I do this? R or python, whatever.

    
asked by anonymous 18.04.2018 / 17:32

1 answer

1

I do not know the specific system of captcha Lates, but I will try to give a "wide" answer to the solution.

In general the ideal is to do only HTML scraping with requests and BeautifulSoup as you mentioned (or, with my new favorite library for this, requests-html ). This method is preferable because it consumes little processing power and little bandwidth, since it only consists of HTML requests and parsing, without loading images, scripts, etc.

Unfortunately, captcha is designed to prevent such scraping and is effective. The solution to this requires a little more technology. selenium is a browser driver ; that is, it provides you with a "zombie" browser and an API to control that browser programmatically (click such button, go to such address, etc.).

Still, it does not directly go through captchas. The solution is to know where to be captcha, get a screenshot of the browser in that area, and then either use a view / OCR algorithm, if the captcha is weak, or use a captcha wrap service (you send the message to the service and get back the contained text).

These options are obviously not ideal; running a browser consumes far more resources on both your machine and the Lates server for loading images, CSS, and scripts, and captcha-breaking services, though cheap, are not free. It is worth analyzing the site to see if there is any way around the captcha.

    
18.04.2018 / 18:37