Content crawler for external pages with PHP [closed]

2

I was given the mission to create a script that will capture the price, image and content of products from some sites indicated by the application administrator, taking into account that the structure of each of these sites is different, and that the script would need to scan all pages related to the product category or subpages of a particular address (ex: / shirts /, / shirts / black, / shirts / blue). At first I thought I could do this by using DOMXPath + cURL from PHP to search for areas related to products, but it does not seem the right way.

Could you tell me where to start, what to use to create something like this?

    
asked by anonymous 06.03.2014 / 19:38

1 answer

3

You really want to create a Web Crawler.

There is a PHP library for creating Web Crawlers:

link

Translating the example from the parent site:

<?php 

// Tempo de atuação do crawler 
set_time_limit(10000); 

// Inclusão da classe principal
include("libs/PHPCrawler.class.php"); 

// Extendendo a classe principal e fazendo override no método handleDocumentInfo()
class MyCrawler extends PHPCrawler  
{ 
  function handleDocumentInfo($DocInfo)  
  { 
    // Dectecta quebra de linha na saída ("\n" em modo CLI, "<br>" em outros casos). 
    if (PHP_SAPI == "cli") $lb = "\n"; 
    else $lb = "<br />"; 

    // Imprime URL e Status HTTP
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb; 

    // Imprime URL referenciada
    echo "Referer-page: ".$DocInfo->referer_url.$lb; 

    // Imprime se conteúdo do documento foi recebido ou não. 
    if ($DocInfo->received == true) 
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb; 
    else 
      echo "Content not received".$lb;  

    // O conteúdo da página está em $DocInfo->source

    echo $lb; 

    flush(); 
  }  
} 

// Crie uma instância da sua classe, defina o comportamento do crawler
// e inicie o processo.

$crawler = new MyCrawler(); 

// URL para realizar o crawling
$crawler->setURL("www.php.net"); 

// Faz o crawl apenas de documentos content-type "text/html" 
$crawler->addContentTypeReceiveRule("#text/html#"); 

// Ignorar imagens
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 

// Armazenar cookies
$crawler->enableCookieHandling(true); 

// Baixar apenas 1 megabyte do site (não precisa baixar tudo)
$crawler->setTrafficLimit(1000 * 1024); 

// Se tudo está ok, só chamar o método go()
$crawler->go(); 

// Para imprimir um relatório do processo, use o método abaixo
$report = $crawler->getProcessReport(); 

if (PHP_SAPI == "cli") $lb = "\n"; 
else $lb = "<br />"; 

echo "Sumário:".$lb; 
echo "Links seguidos: ".$report->links_followed.$lb; 
echo "Documentss recebidos: ".$report->files_received.$lb; 
echo "Bytes recebidos: ".$report->bytes_received." bytes".$lb; 
echo "Tempo de execução: ".$report->process_runtime." sec".$lb;  
?>
    
06.03.2014 / 21:03