Protect automated access web pages

9

How can I protect my web pages from being accessed in an automated way?

  • By search engine bots like Googlebot (I think the basic form was the meta tag with noindex and nofollow).
  • By Headless Browser (browsers without graphical interface and responding to commands via the command line and / or scripts, being able to batch access thousands of pages).
  • By artesanais scripts (these are scripts (usually in PHP which I have a little knowledge) that can access thousands of using common functions such as file_get_html or file_get_contents ).

OBS: The last two topics are possible to set the HTTP field user_agent so that script/headless browser passes through a common one like firefox.

OBS2: Related question: What does this anti-theft code in Javascript do?

    
asked by anonymous 20.05.2015 / 13:57

1 answer

4

Block access to search engine bots is very different from other cases. The first one respects the rules you create, while others try to circumvent any rules ... it is a game of cat and mouse.

Because restricting access to official search engines is trivial and has extensive extensive documentation , I'm going to focus on methods that make it difficult for these other unregulated webcrawlers to access.

Do not use sequential URLs

Watch out for the pages you get, for example, in www.site.com/dados.php?id=100 format. Making a script that downloads a batch of data from that site would be as easy as this simple command on the UNIX terminal curl -O www.site.com/dados.php?id=[100-1000] .

Upload content by AJAX

This prevents simple scripts (whether in Bash, PHP, Python, etc.) from accessing content, as they do not have a JavaScript interpreter (some do not even have an HTML interpreter). They just download the page via HTTP. It is even a part of SEO techniques to avoid pages that make high use of AJAX, since it is difficult to Google index them correctly.

But be careful not to make it easier for them to implement an AJAX solution that returns a JSON ready to be parseado . You should implement CSRF tokens to restrict access to JSON / XML only to those who have already loaded the main page, otherwise this will make it easier for them to work rather than hinder.

However, there is nothing to prevent someone who is minimally involved in using a headless browser such as PhantomJS , which is able to interpret JavaScript and load the entire page.

Captcha

Image with distorted letters and psychedelic background will cut many crawlers. Still, this most popular method for identifying humans is not infallible.

There are OCRs able to read captchas, but it is laborious to program them, as well as being specific to each captcha-generating mechanism. Slight changes to the captcha algorithm may require a lot of work to update the OCR, which, depending on how often you do this, can make the procedure unfeasible.

There are also specialized services for reading captchas, such as DBC and DeCaptcher . They charge a few exchanges to solve a thousand captchas. The advantage is that they are able to break any captcha, even the old model of Google's reCAPTCHA, taken as unbreakable for some time. That's because we do not have a robot trying to be human. These services employ human laborers in countries with cheap labor that are typing the 24/7 letters.

IP Blocking

This mechanism is fundamental. It is a separate science that aims to separate the chaff from wheat, that is, the human robot, through patterns of "behavior". I recommend the excellent Code Horror article that deals with the theoretical part of the subject with great analogies.

The implementation on your site can be done through middlewares if you are using some framework or "in the race", using fail2ban .

This method can be circumvented if the bot uses proxies. But in that case, the cost factor is going to be higher for the attacker, as his engine will burn the IPs he hired to use.

By combining these methods, you can avoid a lot of crawlers.

But as I have shown, it is impossible to know for sure whether a request was made by a robot or a human. Even if you implement all these measures, in the end it will only depend on the cost-benefit ratio that the bot author calculated (cost in the monetary sense and effort). That's why anyone who does not want to be snooped by bots must have someone who monitors access, checking for abuse, reinventing blocking techniques as bots learn to circumvent ancient techniques. A game of cat and mouse.

    
22.05.2015 / 01:12