How to make a "link format" (a system that reads the content of other websites)? [closed]

2

I would like to integrate a facebook-like system for reading external links in my project.

Type when posting a "www.un-site-qualquer.com" link on my site I would like to get a result like the picture below!

    
asked by anonymous 23.01.2017 / 08:58

1 answer

4

You can use a cURL for this and then use DOMDocument (or REGEX) to get the page data.

Facebook uses Open Graph markup , since many websites support it you can also read such data.

  

I'm using http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html ", which is the last news from Globo .com at this time.

You can extract from this page the goal og:image and og:title and also og:description . In addition, all websites have meta defaults or it is expected to have description and title .

For example, using as a base an answer to the other question :

// Obtem o HTML da página
$ch = curl_init('http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html');
curl_setopt_array($ch, [    
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_SSL_VERIFYHOST => 2,
    CURLOPT_SSL_VERIFYPEER => true,
    CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,  
    CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
    CURLOPT_TIMEOUT => 5,
    CURLOPT_MAXREDIRS => 2
]);
$html = curl_exec($ch);
curl_close($ch);

// Inicia o DOM e XPath:
$DOM = new DOMDocument;
$DOM->loadHTML($html);
$XPath = new DomXPath($DOM);

// Propriedades buscadas
$propriedades = ['description', 'title', 'type', 'image'];

// Verifica cada item da Array:
foreach ($propriedades as $propriedade){

    $Meta = $XPath->query('//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade);

    // Se achar o elemento irá obter o resultado
    if($Meta->length !== 0){
        $conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;
    }


}

Result:

array(4) {
  ["description"]=>
  string(134) "Serviço de remoção aconteceu no início da noite deste domingo (22).
Retirada foi feita por empresa contratada pelo Grupo Emiliano."
  ["title"]=>
  string(73) "Acidente com Teori Zavascki: Avião que caiu em Paraty é retirado do mar"
  ["type"]=>
  string(7) "article"
  ["image"]=>
  string(122) "http://s2.glbimg.com/IAaOKflQpOoOSoi7pGNjkmirtjI=/1200x630/filters:max_age(3600)/s02.video.glbimg.com/deo/vi/65/44/5594465"
}

With this information you can assemble the HTML as you wish.

Explanations:

  

CURL:

The CURLOPT_FOLLOWLOCATION is used to follow location: if this is informed by the server, CURLOPT_RETURNTRANSFER is required to get the result, since CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER have been turned off so that you can get the information even on a server that has a self-signed certificate for example. You can also add timeout and maximum redirection.

  

XPath:

Used to fetch information from query:

//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade

This will make all the situations below valid:

<head>
<description>Valor</description>
<meta name="description" content="Valor" />
<meta property="og:description" content="Valor" />
</head>

To check if there was any occurrence, if there is any data, it is used:

$Meta->length !== 0

As the content can be within content (in the last two examples) or inside the tag itself (in the first example), it was used:

$conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;

This will check if the content attribute exists, in fact it will check if there is any data in it, otherwise it will get the value of the element.

    
23.01.2017 / 09:48