Regular expression to remove certain content

Question

Regular expression to remove certain content

Navigation

0

I have some texts that are in HTML, which may have specific style attributes, I would like to make a method that removes these tags and their contents, since they are titles ... and specific images that must be filtered from HTML. Below I have an example snippet of HTML content:

<span style="font-size:18px">
<span style="font-family:helvetica-light"><span style="color:rgb(140, 190, 207)">Cultura</span></span></span></p>
<p>&nbsp;</p>
<h2 style="text-align:center"><span style="font-size:42px"><span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">O CIRCO CHEGOU!</span></strong></span></span></h2>
<p style="margin-left:80px; margin-right:80px; text-align:center"><span style="color:rgb(71, 71, 71); font-family:helveticaneue-light; font-size:30px">Cirque du Soleil apresenta espet&aacute;culo &ldquo;Amaluna&rdquo; em S&atilde;o Paulo e no Rio de Janeiro, na sexta passagem da maior companhia circense do mundo pelo Brasil</span></p>
<p style="text-align:center">&nbsp;</p>
<p style="text-align:center"><span style="font-size:22px"><span style="font-family:helveticaneue"><span style="color:#8cbecf"><em>Por Melissa Schr&ouml;der -&nbsp;Edi&ccedil;&atilde;o de Andr&eacute; Schr&ouml;der</em><br />
    25/09/2017</span></span></span></p>

I would like to remove the title, for example, from the specific attributes, as in the example below:

<span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">

I have a method that does almost this.

 public function htmlToTextTags($Document) {
        if (preg_match('/<img(.+)? style=\".+?height:(4\d|5\d|6\d|7\d)(%|px);.+?\"[^>]*>/', $Document, $matches)) {
            if(count($matches)) {
                $Document = str_replace($matches[0], "", $Document);
            }
        }
        $Rules = array (
            '@<script[^>]*?>.*?<\/script>@si',
            '@<style[^>]*?>.*?<\/style>@si',
            '@<h2 style=\"text-align:center\">.*?<\/h2>@si',
            '@<span style=\"color\:rgb(\(71\, 71\, 71\)); font-family:helveticaneue-light.*?\"?>*.?<\/span>@si',
            '@<span.*?><span style=\"color\:#8cbecf\">.*?</span></span>@si',
            '@<p style=\"text-align:center\"><img.*?></p>@si',
            '@([\r\n])[\s]+@',
            '@&(quot|#34);@i',
            '@&(amp|#38);@i',
            '@&(lt|#60);@i',
            '@&(gt|#62);@i',
            '@&(nbsp|#160);@i',
            '@<div style=\"transform: rotate(\(\-90deg\)); \-webkit\-transform: rotate(\(\-90deg\))\;.+\"?>*.?<\/div>@'
        );
        $Replace = array (
            '',
            '',
            '',
            '',
            '',
            '',
            '',
            '"',
            '&',
            '<',
            '>',
            ' ',
            ''
        );
        return html_entity_decode(utf8_decode((preg_replace($Rules, $Replace, $Document))));
    }

However, it does not work for all cases, it always has different content, so I need to be creating rules.

My question, is whether there is another better and more efficient way of doing this, can anyone tell me?

php regex filtro

asked by anonymous 22.01.2018 / 13:12

0 answers

Jquery of my page does not work after being called by an XMLHttpRequest Error in setValues: std :: bad_alloc when executing the aggregate function in a raster file