About Regular Expression in PHP - How to Take Part of a Text? [closed]

0

I'm having trouble catching part of a text from a page on the WikiPedia . I can get the title like this:

$content = 
     file_get_contenst("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_natureza");

preg_match("/< title>(.*?)<\/title>/",$content,$title);

What I can not do is get content from <div id="content" class="mw-body" role="main"> to <span class="mw-headline" id="Ver_tamb.C3.A9m">Ver também</span>

I do not understand why it does not work, I've already tried it in different ways.

    
asked by anonymous 26.09.2016 / 01:23

1 answer

2

Would not it be better to use DomDocument ?

In my humble opinion, any feature that already exists to solve a problem should be the one chosen. I think using regular expressions for cases like yours is going to be a lot of work.

So I recommend using DomDocument , which is meant to represent an HTML or XML entity.

Here is an example of how it could be done:

$content = 
 file_get_contents("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_natureza")

$doc = new DOMDocument();

@$doc->loadHTML($content);


$titleTag = $doc->getElementsByTagName('title')->item(0);

// Pega o título da página

$title = $titleTag ? $titleTag->nodeValue : null;

// Pega o valor da div#content, porém somente texto

$body = $doc->getElementById('content')->nodeValue;

Note that the nodeValue method will return only the text, thus removing all tags present within #content .

If you need to get the text with the tags, use the saveXml method to solve the problem:

 $bodyWithTags = $doc->saveXml($doc->getElementById('content'));

Update

If you want a reusable way to get only the page title, you can create a function:

/**
 * Obtém o título da tag <title> de uma url
 * 
 * @param string $url
 * @return string|null
 * */
function  url_get_title($url) {

    $content = file_get_contents($url);

    $doc = new DOMDocument();

    @$doc->loadHTML($content);

    $titleTag = $doc->getElementsByTagName('title')->item(0);

    if ($titleTag) {
        return $titleTag->nodeValue;
    }

    return null;
}

So, when you wanted to get the page title, you would just do it:

url_get_title('http://www.google.com.br'); // string (Google)

NOTE : Whenever you use file_get_contents to capture the content of a url, remember that you are always required to enter the url's schema (http or https). If you do not do this, PHP will try to open the path of a file. Even though it is a request made for the domain itself, it is necessary to include the schema.

    
26.09.2016 / 16:59