capture information from sites [closed]

Question

capture information from sites [closed]

Navigation

#1 by (6 votes)
#2 by (5 votes)
#3 by (3 votes)

3

How do Buscapé and other sites get information from sites? is it through the curl or an xml that the store sites make available?

php xml curl

asked by anonymous 04.05.2015 / 16:07

3 answers

5

It depends on the sites, it's not generic.

You can get information from:

sitemaps
information feeds (JSON for example)
APIs
crawling through pages and site links
other mechanisms ...

04.05.2015 / 16:14

3

There are several alternatives for fetching content from a site:

Parsing the site: It will literally download the HTML and you can check the DOM elements of the page. PHP library for this purpose: Simple HTML DOM Parser

Parsing the XML provided by the site: You can use the native PHP functions for this, see example below:

$feed = simplexml_load_file($feedLink, 'SimpleXMLElement', LIBXML_NOCDATA);

foreach($feed->channel->item AS $item){
    if($count == $limit){
        break;
    }    
    echo $item->link . '<br />';
    echo $item->title . '<br />';
    echo $item->description . '<br />';
    echo $item->pubDate . '<br />';
    echo '<br />------------------<br /><br />';
    $count++;
}

Crawling: It will follow the links and is used in conjunction with a parser (which will extract the information from the pages). PHP library for this purpose: PHPCrawl

04.05.2015 / 16:30

How to update page title with notifications? [closed] Add multiple items to a list

score 6 · Accepted Answer

There are several ways and techniques that can get information from other sites, the name given to this technique is 'parse' , many programmers here erroneously say 'site parse', if the sites offer XML to the search, logo the work of the site engineers will go down significantly by XML already contain the tags formatted correctly, making the work faster for PHP , since simplexml_load_file is very fast and easy to use.

But if the site does not offer such a file, the solution can be Crawling to get the links, or cURL itself, cURL will only serve to pass and get the HTML data of the remote server, using something like POST or GET , then to "get" this data, you can use DOMDocument , which is what I use most in conjunction with DomXpath which is a subfunction of DOM itself to analyze HTML , but also has Simple HTML DOM Parser .

Here's an example I just made to show you, capturing G1 data:

    libxml_use_internal_errors(true) and libxml_clear_errors();
    $header = "X-Forwarded-For: {$_SERVER['REMOTE_ADDR']}";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,   "http://g1.globo.com/bemestar/noticia/2011/03/medica-orienta-sobre-o-que-fazer-em-caso-de-dor-de-ouvido-e-como-evita-la.html");
    curl_setopt($ch, CURLOPT_REFERER, "http://g1.globo.com");
    curl_setopt($ch, CURLOPT_HTTPHEADER, array($header));
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    $DOM = new DOMDocument();
    $DOM->loadHTML($html);
    $xpath = new DomXpath($DOM);
    $titulo = $xpath->query('//input[@name="materia_titulo"]/@value')->item(0);
    $letra = $xpath->query('//div[@id="materia-letra"]')->item(0);
    echo "Titulo da matéria: ". $titulo->nodeValue . "<p>" . "Conteúdo da matéria: "   .$letra->nodeValue;