I have some texts that are in HTML, which may have specific style attributes, I would like to make a method that removes these tags and their contents, since they are titles ... and specific images that must be filtered from HTML. Below I have an example snippet of HTML content:
<span style="font-size:18px">
<span style="font-family:helvetica-light"><span style="color:rgb(140, 190, 207)">Cultura</span></span></span></p>
<p> </p>
<h2 style="text-align:center"><span style="font-size:42px"><span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">O CIRCO CHEGOU!</span></strong></span></span></h2>
<p style="margin-left:80px; margin-right:80px; text-align:center"><span style="color:rgb(71, 71, 71); font-family:helveticaneue-light; font-size:30px">Cirque du Soleil apresenta espetáculo “Amaluna” em São Paulo e no Rio de Janeiro, na sexta passagem da maior companhia circense do mundo pelo Brasil</span></p>
<p style="text-align:center"> </p>
<p style="text-align:center"><span style="font-size:22px"><span style="font-family:helveticaneue"><span style="color:#8cbecf"><em>Por Melissa Schröder - Edição de André Schröder</em><br />
25/09/2017</span></span></span></p>
I would like to remove the title, for example, from the specific attributes, as in the example below:
<span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">
I have a method that does almost this.
public function htmlToTextTags($Document) {
if (preg_match('/<img(.+)? style=\".+?height:(4\d|5\d|6\d|7\d)(%|px);.+?\"[^>]*>/', $Document, $matches)) {
if(count($matches)) {
$Document = str_replace($matches[0], "", $Document);
}
}
$Rules = array (
'@<script[^>]*?>.*?<\/script>@si',
'@<style[^>]*?>.*?<\/style>@si',
'@<h2 style=\"text-align:center\">.*?<\/h2>@si',
'@<span style=\"color\:rgb(\(71\, 71\, 71\)); font-family:helveticaneue-light.*?\"?>*.?<\/span>@si',
'@<span.*?><span style=\"color\:#8cbecf\">.*?</span></span>@si',
'@<p style=\"text-align:center\"><img.*?></p>@si',
'@([\r\n])[\s]+@',
'@&(quot|#34);@i',
'@&(amp|#38);@i',
'@&(lt|#60);@i',
'@&(gt|#62);@i',
'@&(nbsp|#160);@i',
'@<div style=\"transform: rotate(\(\-90deg\)); \-webkit\-transform: rotate(\(\-90deg\))\;.+\"?>*.?<\/div>@'
);
$Replace = array (
'',
'',
'',
'',
'',
'',
'',
'"',
'&',
'<',
'>',
' ',
''
);
return html_entity_decode(utf8_decode((preg_replace($Rules, $Replace, $Document))));
}
However, it does not work for all cases, it always has different content, so I need to be creating rules.
My question, is whether there is another better and more efficient way of doing this, can anyone tell me?