How do Buscapé and other sites get information from sites? is it through the curl or an xml that the store sites make available?
How do Buscapé and other sites get information from sites? is it through the curl or an xml that the store sites make available?
There are several ways and techniques that can get information from other sites, the name given to this technique is 'parse'
, many programmers here erroneously say 'site parse', if the sites offer XML
to the search, logo the work of the site engineers will go down significantly by XML
already contain the tags
formatted correctly, making the work faster for PHP
, since simplexml_load_file
is very fast and easy to use.
But if the site does not offer such a file, the solution can be Crawling
to get the links, or cURL
itself, cURL
will only serve to pass and get the HTML data of the remote server, using something like POST
or GET
, then to "get" this data, you can use DOMDocument
, which is what I use most in conjunction with DomXpath
which is a subfunction of DOM
itself to analyze HTML
, but also has Simple HTML DOM Parser
.
Here's an example I just made to show you, capturing G1 data:
libxml_use_internal_errors(true) and libxml_clear_errors();
$header = "X-Forwarded-For: {$_SERVER['REMOTE_ADDR']}";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://g1.globo.com/bemestar/noticia/2011/03/medica-orienta-sobre-o-que-fazer-em-caso-de-dor-de-ouvido-e-como-evita-la.html");
curl_setopt($ch, CURLOPT_REFERER, "http://g1.globo.com");
curl_setopt($ch, CURLOPT_HTTPHEADER, array($header));
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$xpath = new DomXpath($DOM);
$titulo = $xpath->query('//input[@name="materia_titulo"]/@value')->item(0);
$letra = $xpath->query('//div[@id="materia-letra"]')->item(0);
echo "Titulo da matéria: ". $titulo->nodeValue . "<p>" . "Conteúdo da matéria: " .$letra->nodeValue;
It depends on the sites, it's not generic.
You can get information from:
There are several alternatives for fetching content from a site:
Parsing the XML provided by the site: You can use the native PHP functions for this, see example below:
$feed = simplexml_load_file($feedLink, 'SimpleXMLElement', LIBXML_NOCDATA);
foreach($feed->channel->item AS $item){
if($count == $limit){
break;
}
echo $item->link . '<br />';
echo $item->title . '<br />';
echo $item->description . '<br />';
echo $item->pubDate . '<br />';
echo '<br />------------------<br /><br />';
$count++;
}
Crawling: It will follow the links and is used in conjunction with a parser (which will extract the information from the pages). PHP library for this purpose: PHPCrawl