How to detect a bot

0

I'm helping a friend develop a system of visits, such as Grana Network , eg Grana Social. It turns out that we will make payments for actual visits on the page, and as we know there are malicious people who will try to circumvent the system to take advantage of and gain views, such as using a fictitious user, a bot a href="https://hitleap.com/"> HitLeap ).

I need to know how to differentiate a real view from a view by bot . I already looked for a solution with HTTP_USER_AGENt but I did not get anything, I also compared it with real views and did not find anything I could use.

What would be the best solution to protect yourself from this type of case, something like YouTube is already able to do, distinguish the real hits from non-reai.

Thanks in advance ...

Q: I know how to spot common crawlers! So do not show me articles that talk about googlebot.

    
asked by anonymous 30.05.2016 / 21:29

2 answers

1

I think the only effective way is using Captcha , other ways are easy to circumvent.

There are good ways to estimate the number of visitors, an example is view count of the OS , but even this method can be circumvented with distributed bots or using Proxy .

    
30.05.2016 / 21:57
0

One way to do this is by creating rules in .htaccess , which prevent some known agents that are robots, so you would have to have a complete list or get a complex list of these agents:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/sem_crawler.htm
RewriteRule .* http://seusite.com.br/sem_crawler.htm [L]

Another way is by making use of PHP:

<?php 
class CrawlerDetect
{
   //lista de robôs
  private $agentsInvalids = array(
    'Google'=>'Google',
    'MSN' => 'msnbot',
    'Rambler'=>'Rambler',
    'Yahoo'=> 'Yahoo',
    'AbachoBOT'=> 'AbachoBOT',
    'accoona'=> 'Accoona',
    'AcoiRobot'=> 'AcoiRobot',
    'ASPSeek'=> 'ASPSeek',
    'CrocCrawler'=> 'CrocCrawler',
    'Dumbot'=> 'Dumbot',
    'FAST-WebCrawler'=> 'FAST-WebCrawler',
    'GeonaBot'=> 'GeonaBot',
    'Gigabot'=> 'Gigabot',
    'Lycos spider'=> 'Lycos',
    'MSRBOT'=> 'MSRBOT',
    'Altavista robot'=> 'Scooter',
    'AltaVista robot'=> 'Altavista',
    'ID-Search Bot'=> 'IDBot',
    'eStyle Bot'=> 'eStyle',
    'Scrubby robot'=> 'Scrubby',
    ...
    );
//lista de navegadores válidos
private $agentsValids = array(
    'Mozilla' => 'Mozilla',
    'Chrome'  => 'Chrome',
    'Safari'  => 'Safari',
    'Opera'   => 'Opera',
     ...
);


public function __construct($USER_AGENT)
{
    $invalids =  implode('|',$this->agentsInvalids);
    $valids =  implode('|',$this->agentsValids);
    /* aqui você escolhe como prefere,
    acredito que basta testar uma única lista */
    if (strpos($invalids, $USER_AGENT) !== false ||
        strpos($valids, $USER_AGENT) === false) {
       return true;
    } else {
       return false;
    }
}

//verifica o navegador

$crawler = new CrawlerDetect($_SERVER['HTTP_USER_AGENT']);

//se for robô ele verifica
if ($crawler) {
  echo "acesso inválido!";
} else {
  echo "acesso válido!"; 
}

On this site you have a complete list or a list that shows you a list full of brownsers and Crawlers.

    
30.05.2016 / 22:56