How to select with Regular Expression a full xml / html tag even though there are equal tags internally?

4

I'm trying to do the following treatment in a string in javascript using ER (Regular Expression):

With this input: um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>. , I'd like to get the tag <b> complete, with all its content up to its closing </b> , which is the expected result: <b>negrito<b>negrito interno</b>externo</b> and <b>negrito</b> .

But I am not able to consider that a tag can contain another same internally, and I was able to reach the maximum until this result (which does not consider the possibility of a tag same internally, as can be seen in the first result where is <b>negrito<b>negrito interno</b> instead of <b>negrito<b>negrito interno</b>externo</b> :

var entrada = 'um <b data-remove>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';
var regex = /<(b)>.*?<\/>/g;

// limpa DOM para imprimir
document.body.innerHTML = "";

entrada.replace(regex, function(match) {
  console.log(match);
  // para imprimir do DOM
  document.body.appendChild(document.createTextNode(match));
  document.body.appendChild(document.createElement("br"));
  return match;
});
body {
  white-space: pre;
  font-family: monospace;
}

My knowledge of ER is limited, and has reached the limit in this situation. So I'm waiting for some precious tip from some expert on ER, or a "Forget it's not possible with ER = (".

Edit 2 Expected solution:

The form I look for and do not know how to do would be something that counts / accumulates the occurrences of opening tags and ignoring the closures until it is the closing pair for opening (equivalent to the first opening tag). >

  

If there is any doubt please comment!

Edit 1: My real case for a better understanding of the problem:

  

This real example is only meant to demonstrate the context where I am using the function in question, and why not be able to do this via jQuery or any other parser in the browser DOM. Well I need to leave the correct DOM so that CSS is applied correctly and only after the conversion to style inline I can remove what it was just for the Browser to render correctly and then get the result of my expected template. >

$(function() {
  $('#btnGenerateHtmlMail').click(function(ev) {
    var $report = $('#report');
    convertCssToInlineStyle($report);
    var reportHtml = $report.html();
    reportHtml = reportHtml
      /* remove class attribute */
      .replace(/class=('|").*?/g, "")
      /* remove id attribute */
      .replace(/id=('|").*?/g, "")
      /* remove comments html */
      .replace(/<!--.*?-->/g, "")
      /* remove tab, enter and whitespace */
      .replace(/\s\s+/g, ' ')
// ----->>>   // esse é o meu caso de problema, nesse exemplo não da problema pois nnão há tags iguais dentro do tr, mas sei que isso seria um bug que quero resolver para tornar a ferramenta generica
      .replace(/<(tr) data-remove="true".*?>.*?<\/>/g, function replacer(match) {
        console.log(match);
        return match.match(/{{.*?}}/g);
      });
    $('#result').text(reportHtml);
  });
});


/* Metódos irrelevantes para o problema */

function getCssDeclared($elem) {
  var sheets = document.styleSheets,
    o = {};
  for (var i in sheets) {
    var rules = sheets[i].rules || sheets[i].cssRules;
    for (var r in rules) {
      if ($elem.is(rules[r].selectorText)) {
        o = $.extend(o, css2json(rules[r].style), css2json($elem.attr('style')));
      }
    }
  }
  return o;
}

function css2json(css) {
  var s = {};
  if (!css)
    return s;
  if (css instanceof CSSStyleDeclaration) {
    for (var i in css) {
      if ((css[i]).toLowerCase) {
        s[(css[i]).toLowerCase()] = (css[css[i]]);
      }
    }
  } else if (typeof css == "string ") {
    css = css.split("; ");
    for (var i in css) {
      var l = css[i].split(": ");
      s[l[0].toLowerCase()] = (l[1]);
    }
  }
  return s;
}

function convertCssToInlineStyle($root) {
  $root.each(function() {
    var $item = $(this);

    var style = getCssDeclared($item);
    $item.css(style);

    // recursive call chields
    convertCssToInlineStyle($item.children());
  });
}
table {
	border-collapse: collapse;
	border-spacing: 0;
	-webkit-box-sizing: border-box;
	-moz-box-sizing: border-box;
	box-sizing: border-box;
	width: 100%;
}

table td, table th {
	padding: 8px;
	padding-top: 3px;
	padding-bottom: 3px;
	line-height: 1.428571429;
	border: 1px solid #ddd;
}

table > tfoot {
	font-weight: bold;
	text-align: center;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js"></script><divid="report">
  <table>
    <thead>
      <tr data-remove="true">
        <th>{{theadContent}}</th>
      </tr>
    </thead>
    <tbody>
      <tr data-remove="true">
        <th>{{tbodyContent}}</th>
      </tr>
    </tbody>
    <tfoot>
      <tr data-remove="true">
        <th>{{tfootContent}}</th>
      </tr>
    </tfoot>
  </table>
</div>
<div id="tools">
  <button id="btnGenerateHtmlMail">
    Gerar HTML E-mail
  </button>
  <div contenteditable="true" id="result" style="width: 99%;resize: none;border: 1px solid #ccc;padding: 0.5%;"></div>
</div>
  

Note: In this example (real) not the problem because there are no equal tags inside the tr, but I know this would be a bug that I want to solve to make the tool generic.

    
asked by anonymous 12.06.2015 / 14:48

5 answers

3

As stated by @ ctgPi :

  

HTML is not a regular language and therefore can not be rendered by a regular expression.

It is therefore necessary to write functions to perform HTML processing.

Here is a sample code to work with (use regular expressions).

// String com seu HTML
var string = '<table><thead><tr data-remove="true"><th></th><th><th>{{theadContent}}</th></th><th></th></tr></thead><tbody><tr data-remove="true"><th>{{tbodyContent}}</th></tr></tbody><tfoot><tr data-remove="true"><th>{{tfootContent}}</th></tr></tfoot></table>';

// Converte a String em Objeto JQuery
var $element = $(string);

//Itera sobre as raízes realizando as substituições necessárias
$('*[data-remove=true]', $element).each(function(index) {
  $(this).replaceWith($(this).html().replace(/.*?(\{\{[^\}]*\}\}).*/, '$1'));
});

// Converte o objeto JQuery em String
var string_processada = $element.get(0).outerHTML;

// Imprime na tela
$('body').text(string_processada);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js"></script>

ItcutsDOMbrancheswhoseroothastheattributedata-removewithvaluetrue;leavingonlythepartinvolvedin"{{" e "}}".

It may contain bugs that I have not seen.

    
12.06.2015 / 21:10
3

The problem is that you are trying to process a language that is not regular (HTML) with regular expressions. The solution is you write a recursive function that does the cleaning, something like:

var attributeWhiteList = ['style'];  // atributos que você quer deixar
var elementWhiteList = ['#text', 'TABLE', 'THEAD', 'TBODY', 'TFOOT', 'TR', 'TH', 'TD', 'P', 'B', 'DIV'];  // elementos que você quer deixar

function cleanHTMLForEMail(node) {
    if (node.nodeName === '#text') {
        // aqui você editar node.textContent pra tirar espaço em branco
        return node;
    }

    // listar atributos
    var attributeNames = [];
    for (var i = 0; i < node.attributes.length; i++) {
        attributeNames.push(node.attributes[i].name);
    }

    // tirar todos os atributos fora da whitelist
    for (var i = 0; i < attributeNames.length; i++) {
        if (attributeWhiteList.indexOf(attributeNames[i]) === -1) {
            node.removeAttribute(attributeNames[i]);
        }
    }

    // listar filhos
    var children = [];
    for (var i = 0; i < node.childNodes.length; i++) {
        children.push(node.childNodes[i]);
    }

    // tirar todos os filhos fora da whitelist
    // e limpar os que estão dentro
    for (var i = 0; i < children.length; i++) {
        if (elementWhiteList.indexOf(children[i].nodeName) === -1) {
            node.removeChild(children[i]);
        } else if (children[i].nodeName === 'TR' && children[i].dataset.remove === 'true') {
            node.removeChild(children[i]);
        } else {
            node.replaceChild(cleanHTMLForEMail(children[i]), children[i]);
        }
    }

    return node;
}

Edited: JSFiddle , fixing small implementation errors, and working in the example you wanted solve.

    
12.06.2015 / 16:17
1

Hello,

I'm not sure I understand your problem. We usually only use ER for data validation.

Regular expressions: introduction

link

When we need to change add or remove some element of HTML I usually use jquery selectors.

link

    
12.06.2015 / 15:07
1

You can (and in my opinion should) use the browser's own parser:

buffer = document.createElement('div');
buffer.innerHTML = 'um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';
console.log(buffer.querySelectorAll('b'));

If you only want b on the first level, you can create two div , one inside the other, and only search div > b .

(from what I've tested, this seems to be immune to XSS as long as you drop the resulting node and do not insert it directly into the document)

    
12.06.2015 / 15:08
1

Basically this Regex works for your problem. On some samples it should fail. But for your problem it works.

\<(minhaTag)(?: .*?)?\>(?:[^\<]|\<(.*?)\>[^\<]*\<\/\>)*\<\/\>

Regex Tested Here

Replace minhaTag with the desired tag name. This regex will reference the most superficial element of the specified tag and its contents. The element can contain attributes.

Tips:

  • Beware of the *? and * operators study their differences.

  • Remember to include \n (newline) to class . through the single line modifier ( s ), in case this modifier is supported.

  • Use the global modifier ( g ) in case you want all of the surface tags specified in the sample the link above).

12.06.2015 / 15:51