Would not it be better to use DomDocument
?
In my humble opinion, any feature that already exists to solve a problem should be the one chosen. I think using regular expressions for cases like yours is going to be a lot of work.
So I recommend using DomDocument
, which is meant to represent an HTML or XML entity.
Here is an example of how it could be done:
$content =
file_get_contents("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_natureza")
$doc = new DOMDocument();
@$doc->loadHTML($content);
$titleTag = $doc->getElementsByTagName('title')->item(0);
// Pega o título da página
$title = $titleTag ? $titleTag->nodeValue : null;
// Pega o valor da div#content, porém somente texto
$body = $doc->getElementById('content')->nodeValue;
Note that the nodeValue
method will return only the text, thus removing all tags present within #content
.
If you need to get the text with the tags, use the saveXml
method to solve the problem:
$bodyWithTags = $doc->saveXml($doc->getElementById('content'));
Update
If you want a reusable way to get only the page title, you can create a function:
/**
* Obtém o título da tag <title> de uma url
*
* @param string $url
* @return string|null
* */
function url_get_title($url) {
$content = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($content);
$titleTag = $doc->getElementsByTagName('title')->item(0);
if ($titleTag) {
return $titleTag->nodeValue;
}
return null;
}
So, when you wanted to get the page title, you would just do it:
url_get_title('http://www.google.com.br'); // string (Google)
NOTE : Whenever you use file_get_contents
to capture the content of a url, remember that you are always required to enter the url's schema (http or https). If you do not do this, PHP will try to open the path of a file. Even though it is a request made for the domain itself, it is necessary to include the schema.