Regular expression to remove certain content

0

I have some texts that are in HTML, which may have specific style attributes, I would like to make a method that removes these tags and their contents, since they are titles ... and specific images that must be filtered from HTML. Below I have an example snippet of HTML content:

<span style="font-size:18px">
<span style="font-family:helvetica-light"><span style="color:rgb(140, 190, 207)">Cultura</span></span></span></p>
<p>&nbsp;</p>
<h2 style="text-align:center"><span style="font-size:42px"><span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">O CIRCO CHEGOU!</span></strong></span></span></h2>
<p style="margin-left:80px; margin-right:80px; text-align:center"><span style="color:rgb(71, 71, 71); font-family:helveticaneue-light; font-size:30px">Cirque du Soleil apresenta espet&aacute;culo &ldquo;Amaluna&rdquo; em S&atilde;o Paulo e no Rio de Janeiro, na sexta passagem da maior companhia circense do mundo pelo Brasil</span></p>
<p style="text-align:center">&nbsp;</p>
<p style="text-align:center"><span style="font-size:22px"><span style="font-family:helveticaneue"><span style="color:#8cbecf"><em>Por Melissa Schr&ouml;der -&nbsp;Edi&ccedil;&atilde;o de Andr&eacute; Schr&ouml;der</em><br />
    25/09/2017</span></span></span></p>

I would like to remove the title, for example, from the specific attributes, as in the example below:

<span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">

I have a method that does almost this.

 public function htmlToTextTags($Document) {
        if (preg_match('/<img(.+)? style=\".+?height:(4\d|5\d|6\d|7\d)(%|px);.+?\"[^>]*>/', $Document, $matches)) {
            if(count($matches)) {
                $Document = str_replace($matches[0], "", $Document);
            }
        }
        $Rules = array (
            '@<script[^>]*?>.*?<\/script>@si',
            '@<style[^>]*?>.*?<\/style>@si',
            '@<h2 style=\"text-align:center\">.*?<\/h2>@si',
            '@<span style=\"color\:rgb(\(71\, 71\, 71\)); font-family:helveticaneue-light.*?\"?>*.?<\/span>@si',
            '@<span.*?><span style=\"color\:#8cbecf\">.*?</span></span>@si',
            '@<p style=\"text-align:center\"><img.*?></p>@si',
            '@([\r\n])[\s]+@',
            '@&(quot|#34);@i',
            '@&(amp|#38);@i',
            '@&(lt|#60);@i',
            '@&(gt|#62);@i',
            '@&(nbsp|#160);@i',
            '@<div style=\"transform: rotate(\(\-90deg\)); \-webkit\-transform: rotate(\(\-90deg\))\;.+\"?>*.?<\/div>@'
        );
        $Replace = array (
            '',
            '',
            '',
            '',
            '',
            '',
            '',
            '"',
            '&',
            '<',
            '>',
            ' ',
            ''
        );
        return html_entity_decode(utf8_decode((preg_replace($Rules, $Replace, $Document))));
    }

However, it does not work for all cases, it always has different content, so I need to be creating rules.

My question, is whether there is another better and more efficient way of doing this, can anyone tell me?

    
asked by anonymous 22.01.2018 / 13:12

0 answers