Crawler for when http_status_code is different from 200

1

I'm doing a mini crawler in .php using a library called "PHPCrawl" to make the crawler function and the library "simple_html_dom_parser" to make html parser. The question is: simple_html_dom can not parse when http_status_code is different from '200' (variable coming from phpcrawl) returning a Fatal error: Call to a member function find() on boolean in C:\xampp\htdocs\PHP\Crawler\modules\admin\controllers\Crawler.php on line 14

PHP code:

<?php
/* Configuracoes de conexao */
set_time_limit(10000);

require_once '../../../library/PHPCrawl_083/libs/PHPCrawler.class.php';
require_once '../../../library/Simple_HTML_DOM/simple_html_dom.php';

//Extend the Class and Override the handleDocumentInfo() Method
class Crawler extends PHPCrawler{
    function handleDocumentInfo($DocInfo){
        echo '*******************************'.'<br />';
        //Print Page Title
        $html = str_get_html($DocInfo->content);
        $title = $html->find('title');
        echo $title[0]->plaintext.'<br />';

        //Print the URL and the HTTP-status-Code 
        echo 'Page requested: '.$DocInfo->url.' ('.$DocInfo->http_status_code.')'.'<br />';

        //Print the refering URL 
        echo 'Referer-page: '.$DocInfo->referer_url.'<br />';
        echo '*******************************'.'<br />';

        //Print if the content of the document was be recieved or not 
        if($DocInfo->received == true){
            echo "Content received: ".$DocInfo->bytes_received." bytes".'<br />';
        }
        else{
            echo "Content not received".'<br />';
        }
        echo '<br /><br />';
    }
}

$crawler = new Crawler();

//URL to crawl 
$crawler->setURL("http://php.net/docs.php");

//Only receive content of files with content-type "text/html" 
$crawler->addContentTypeReceiveRule("#text/html#"); 

//Ignore links to pictures, dont even request pictures 
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 

//Store and send cookie-data like a browser does 
$crawler->enableCookieHandling(true); 

//Set the traffic-limit to 1 MB (in bytes, 
//for testing we dont want to "suck" the whole site) 
$crawler->setTrafficLimit(1000 * 1024); 

//Set a depth limit 
$crawler->setCrawlingDepthLimit(2);

//Crawler will searches for links only on href
$crawler->setLinkExtractionTags(array("href"));

//Crawler will searches for links only inside <tags>
$crawler->enableAggressiveLinkSearch(false);

//Set timeout to establishing connection
$crawler->setConnectionTimeout(60);

//Set timeout to Server send a data
$crawler->setStreamTimeout(60);

//Start the Crawl process
$crawler->go();

// At the end, after the process is finished, we print a short 
// report (see method getProcessReport() for more information) 
$report = $crawler->getProcessReport();

echo "Summary:".'<br />'; 
echo "Links followed: ".$report->links_followed.'<br />'; 
echo "Documents received: ".$report->files_received.'<br />'; 
echo "Bytes received: ".$report->bytes_received." bytes".'<br />'; 
echo "Process runtime: ".$report->process_runtime." sec".'<br />';
?>

Part of the printed result in the browser

*******************************
PHP: Context options and parameters - Manual 
Page requested: http://php.net/manual/en/context.php (200)
Referer-page: http://php.net
*******************************
Content received: 20056 bytes


Summary:
Links followed: 27
Documents received: 23
Bytes received: 1034007 bytes
Process runtime: 69.525975942612 sec
    
asked by anonymous 10.02.2015 / 14:54

1 answer

1

As stated in the comment, it was only necessary to insert a conditional with the condition of the desired status_code so that it works perfectly.

function handleDocumentInfo($DocInfo){
    if ($DocInfo->http_status_code == 200){
        echo '*******************************'.'<br />';
        //Print Page Title
        $html = str_get_html($DocInfo->content);
        $title = $html->find('title');
        echo $title[0]->plaintext.'<br />';

        //Print the URL and the HTTP-status-Code 
        echo 'Page requested: '.$DocInfo->url.' ('.$DocInfo->http_status_code.')'.'<br />';

        //Print the refering URL 
        echo 'Referer-page: '.$DocInfo->referer_url.'<br />';
        echo '*******************************'.'<br />';

        //Print if the content of the document was be recieved or not 
        if($DocInfo->received == true){
            echo "Content received: ".$DocInfo->bytes_received." bytes".'<br />';
        }
        else{
            echo "Content not received".'<br />';
        }
        echo '<br /><br />';
    }
 }
    
20.05.2015 / 01:14