How to monitor a URL when there are changes?

0

I have a system, which needs to compare multiple values, between multiple sites. These values are read from an XML provided by the site in question.

The problem is that the default URL reading via cURL, for example, to 1 single site, however, in my case, are numerous sites.

After getting the information, I need to compare it and that's the problem.

It's getting slower every time a new site is added. I'm currently doing cron jobs + cURL in PHP.

    
asked by anonymous 06.04.2018 / 19:55

1 answer

4

The URL tracking itself is not done without a query.

Unless the site notifies you that there has been a change, you will only know if you will consult.

I'll explain some ideas on how to design this.

First, let's consider that you have separate resources, which are:

  • Controller of sites that will be monitored;
  • Site Crawler;
  • Comparator.
  • Consider the following, the Controller does the work of knowing which sites need to be queried, when and to whom it should pass the job.

    The Controller will be in the Crontab, however, it will not do the Crawler, it will pass that responsibility to the Crawler. That is, you may have multiple queries at the same time.

    The Comparator is independent and fired in whatever way you prefer, so it does not disturb anything.

    I considered separating the resources so that nothing gets 'plastered' and so dependent. Being able to separate into other servers if the project grows, simply.

    A start of work:

    Consider this to be the Site Controller:

    $sites = ['site1', 'site2', 'site3'];
    
    foreach ($sites as $site) {
        // Aqui vc passa o site a ser consultado para o crawleador.
        // Poderia fazer isso em um metodo no proprio arquivo, mas isso nao permitiria multithread.
        // Para isso ser eficaz, cria um script php que fará o crawler e chame-o aqui sem esperar retorno. Ex:
    
        shell_exec("php crawleador.php?site=$site &");
    
        // Assim vc tera o foreach acabando muito rapido e os crawleadores disparados vao fazer seus trabalhos sozinhos.
    }
    

    Dai you work on your crawler:

    $url = $_GET['site'];
    // Aqui vc implementa a logica do seu analisador e armazena essa informaçao em algum lugar (mysql?)
    // Esse arquivo entrará em açao por que o arquivo anterior mandou. Como se trata de multiplos links, voce terá varias sessoes rodando independentes.
    // fim
    

    Another alternative would be if you implement some MultiThread feature in your PHP, it will probably be even more performative.

    If what you need is to just pick up if there is a change, one idea that can simplify your parser is to use hashes. Ex:

    If you get the md5sum from a file twice, the result is the same.

    If the file changes, md5 will be another.

    I suggest that you make comparisons with this hash, maybe even thinking about the hash in the query itself can make your database leaner and make the analysis process faster.

        
    06.04.2018 / 20:14