Error fetching external site content

0

I have a code that looks for a content inside an external site in G1, it works perfectly but it brings me with the CSS of the page being so I can not customize to leave it in the pattern of my site, searching I found another that shows me the I am not sure what to do, but I do not know how to do it.

Notice: Undefined offset: 2 in C:\xampp\htdocs\ruralrio\blog.php on line 15

My code and that

<?php

$url_base = "http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html";
$texto = preg_replace("/((\r\n|\t)+|\s{2,})/", "",
file_get_contents($url_base));

preg_match('/<title>(.*)<\/title>/i', stripslashes($texto), $titulo);
preg_match('/<h1 class="entry-title">(.*)<\/h1>/i', stripslashes($texto), $titulomateria);
preg_match('/<h2>(.*)<\/h2>/i', stripslashes($texto), $titulomateria2);
preg_match('/<div class="materia-conteudo entry-content" id="materia-letra">(.*)<\/div>/i', stripslashes($texto), $titulomateria3);

echo strip_tags($titulo[1]) . "<br /><br />";
echo strip_tags($titulomateria[1]) . "<br /><br />";
echo strip_tags($titulomateria2[1]) . "<br /><br />";
echo strip_tags($titulomateria3[2]) . "<br /><br />";

?>
    
asked by anonymous 03.07.2015 / 20:51

1 answer

1

Possible issues:

  • file_get_contents is not enabled to access external urls, to fix use:

    Edit php.ini and change allow_url_fopen=0 to allow_url_fopen=1 ( link )

  • file_get_contents requires context with user-agent, so you need to do something like:

    $headers = array(
        'Accept-language: pt-br',
        'User-Agent: ' . $_SERVER['HTTP_USER_AGENT']
    );
    
    $opts = array(
        'http'=>array(
            'method' => 'GET',
            'header' => implode(PHP_EOL, $headers)
        )
    );
    
    $context = stream_context_create($opts);
    
    $texto = file_get_contents('http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html', false, $context);
    
  • Instead of using preg_match try using DOM, for example:

    $doc = new DOMDocument();
    
    //Modifica o estado
    $libxml_previous_state = libxml_use_internal_errors(true);
    
    //Faz um parse na string
    $doc->loadHTML($texto);
    
    //Limpa os erros
    libxml_clear_errors();
    
    //Restaura ao normal
    libxml_use_internal_errors($libxml_previous_state);
    

    Source: link

    And then use methods like getelementsbytagname , getelementbyid and DOMXPath (to facilitate )

  • The final code should look something like:

    <?php
    $headers = array(
        'Accept-language: pt-br',
        'User-Agent: ' . $_SERVER['HTTP_USER_AGENT']
    );
    
    $opts = array(
        'http'=>array(
            'method' => 'GET',
            'header' => implode(PHP_EOL, $headers)
        )
    );
    
    $context = stream_context_create($opts);
    
    $texto = file_get_contents('http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html', false, $context);
    
    $doc = new DOMDocument();
    
    // modify state
    $libxml_previous_state = libxml_use_internal_errors(true);
    
    // parse
    $doc->loadHTML($texto);
    
    // handle errors
    libxml_clear_errors();
    
    // restore
    libxml_use_internal_errors($libxml_previous_state);
    
    $tmp = $doc->getElementsByTagName('title');
    
    foreach ($tmp as $value) {
        echo 'Titulo:', $value->nodeValue, '<br>';
    }
    
    $xpath = new DOMXPath($doc);
    
    $tmp = $xpath->query('//h1[contains(@class,"entry")]');
    
    foreach ($tmp as $value) {
        echo 'h1.entry:', $value->nodeValue, '<br>';
    }
    
    $tmp = $doc->getElementsByTagName('h2');
    
    foreach ($tmp as $value) {
        echo 'h2:', $value->nodeValue, '<br>';
    }
    
    $tmp = $doc->getElementById('materia-letra')->getElementsByTagName('div');
    
    foreach ($tmp as $value) {
        echo '#materia-letra:', $value->nodeValue, '<br>';
    }
    
        
    03.07.2015 / 21:13