How to select with Regular Expression a full xml / html tag even though there are equal tags internally?


I'm trying to do the following treatment in a string in javascript using ER (Regular Expression):

With this input: um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>. , I'd like to get the tag <b> complete, with all its content up to its closing </b> , which is the expected result: <b>negrito<b>negrito interno</b>externo</b> and <b>negrito</b> .

But I am not able to consider that a tag can contain another same internally, and I was able to reach the maximum until this result (which does not consider the possibility of a tag same internally, as can be seen in the first result where is <b>negrito<b>negrito interno</b> instead of <b>negrito<b>negrito interno</b>externo</b> :

var entrada = 'um <b data-remove>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';
var regex = /<(b)>.*?<\/>/g;

// limpa DOM para imprimir
document.body.innerHTML = "";

entrada.replace(regex, function(match) {
  // para imprimir do DOM
  return match;
body {
  white-space: pre;
  font-family: monospace;

My knowledge of ER is limited, and has reached the limit in this situation. So I'm waiting for some precious tip from some expert on ER, or a "Forget it's not possible with ER = (".

Edit 2 Expected solution:

The form I look for and do not know how to do would be something that counts / accumulates the occurrences of opening tags and ignoring the closures until it is the closing pair for opening (equivalent to the first opening tag). >


If there is any doubt please comment!

Edit 1: My real case for a better understanding of the problem:


This real example is only meant to demonstrate the context where I am using the function in question, and why not be able to do this via jQuery or any other parser in the browser DOM. Well I need to leave the correct DOM so that CSS is applied correctly and only after the conversion to style inline I can remove what it was just for the Browser to render correctly and then get the result of my expected template. >

$(function() {
  $('#btnGenerateHtmlMail').click(function(ev) {
    var $report = $('#report');
    var reportHtml = $report.html();
    reportHtml = reportHtml
      /* remove class attribute */
      .replace(/class=('|").*?/g, "")
      /* remove id attribute */
      .replace(/id=('|").*?/g, "")
      /* remove comments html */
      .replace(/<!--.*?-->/g, "")
      /* remove tab, enter and whitespace */
      .replace(/\s\s+/g, ' ')
// ----->>>   // esse é o meu caso de problema, nesse exemplo não da problema pois nnão há tags iguais dentro do tr, mas sei que isso seria um bug que quero resolver para tornar a ferramenta generica
      .replace(/<(tr) data-remove="true".*?>.*?<\/>/g, function replacer(match) {
        return match.match(/{{.*?}}/g);

/* Metódos irrelevantes para o problema */

function getCssDeclared($elem) {
  var sheets = document.styleSheets,
    o = {};
  for (var i in sheets) {
    var rules = sheets[i].rules || sheets[i].cssRules;
    for (var r in rules) {
      if ($[r].selectorText)) {
        o = $.extend(o, css2json(rules[r].style), css2json($elem.attr('style')));
  return o;

function css2json(css) {
  var s = {};
  if (!css)
    return s;
  if (css instanceof CSSStyleDeclaration) {
    for (var i in css) {
      if ((css[i]).toLowerCase) {
        s[(css[i]).toLowerCase()] = (css[css[i]]);
  } else if (typeof css == "string ") {
    css = css.split("; ");
    for (var i in css) {
      var l = css[i].split(": ");
      s[l[0].toLowerCase()] = (l[1]);
  return s;

function convertCssToInlineStyle($root) {
  $root.each(function() {
    var $item = $(this);

    var style = getCssDeclared($item);

    // recursive call chields
table {
	border-collapse: collapse;
	border-spacing: 0;
	-webkit-box-sizing: border-box;
	-moz-box-sizing: border-box;
	box-sizing: border-box;
	width: 100%;

table td, table th {
	padding: 8px;
	padding-top: 3px;
	padding-bottom: 3px;
	line-height: 1.428571429;
	border: 1px solid #ddd;

table > tfoot {
	font-weight: bold;
	text-align: center;
<script src=""></script><divid="report">
      <tr data-remove="true">
      <tr data-remove="true">
      <tr data-remove="true">
<div id="tools">
  <button id="btnGenerateHtmlMail">
    Gerar HTML E-mail
  <div contenteditable="true" id="result" style="width: 99%;resize: none;border: 1px solid #ccc;padding: 0.5%;"></div>

Note: In this example (real) not the problem because there are no equal tags inside the tr, but I know this would be a bug that I want to solve to make the tool generic.

asked by anonymous 12.06.2015 / 14:48

5 answers


As stated by @ ctgPi :


HTML is not a regular language and therefore can not be rendered by a regular expression.

It is therefore necessary to write functions to perform HTML processing.

Here is a sample code to work with (use regular expressions).

// String com seu HTML
var string = '<table><thead><tr data-remove="true"><th></th><th><th>{{theadContent}}</th></th><th></th></tr></thead><tbody><tr data-remove="true"><th>{{tbodyContent}}</th></tr></tbody><tfoot><tr data-remove="true"><th>{{tfootContent}}</th></tr></tfoot></table>';

// Converte a String em Objeto JQuery
var $element = $(string);

//Itera sobre as raízes realizando as substituições necessárias
$('*[data-remove=true]', $element).each(function(index) {
  $(this).replaceWith($(this).html().replace(/.*?(\{\{[^\}]*\}\}).*/, '$1'));

// Converte o objeto JQuery em String
var string_processada = $element.get(0).outerHTML;

// Imprime na tela
<script src=""></script>

ItcutsDOMbrancheswhoseroothastheattributedata-removewithvaluetrue;leavingonlythepartinvolvedin"{{" e "}}".

It may contain bugs that I have not seen.

12.06.2015 / 21:10

The problem is that you are trying to process a language that is not regular (HTML) with regular expressions. The solution is you write a recursive function that does the cleaning, something like:

var attributeWhiteList = ['style'];  // atributos que você quer deixar
var elementWhiteList = ['#text', 'TABLE', 'THEAD', 'TBODY', 'TFOOT', 'TR', 'TH', 'TD', 'P', 'B', 'DIV'];  // elementos que você quer deixar

function cleanHTMLForEMail(node) {
    if (node.nodeName === '#text') {
        // aqui você editar node.textContent pra tirar espaço em branco
        return node;

    // listar atributos
    var attributeNames = [];
    for (var i = 0; i < node.attributes.length; i++) {

    // tirar todos os atributos fora da whitelist
    for (var i = 0; i < attributeNames.length; i++) {
        if (attributeWhiteList.indexOf(attributeNames[i]) === -1) {

    // listar filhos
    var children = [];
    for (var i = 0; i < node.childNodes.length; i++) {

    // tirar todos os filhos fora da whitelist
    // e limpar os que estão dentro
    for (var i = 0; i < children.length; i++) {
        if (elementWhiteList.indexOf(children[i].nodeName) === -1) {
        } else if (children[i].nodeName === 'TR' && children[i].dataset.remove === 'true') {
        } else {
            node.replaceChild(cleanHTMLForEMail(children[i]), children[i]);

    return node;

Edited: JSFiddle , fixing small implementation errors, and working in the example you wanted solve.

12.06.2015 / 16:17


I'm not sure I understand your problem. We usually only use ER for data validation.

Regular expressions: introduction


When we need to change add or remove some element of HTML I usually use jquery selectors.


12.06.2015 / 15:07

You can (and in my opinion should) use the browser's own parser:

buffer = document.createElement('div');
buffer.innerHTML = 'um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';

If you only want b on the first level, you can create two div , one inside the other, and only search div > b .

(from what I've tested, this seems to be immune to XSS as long as you drop the resulting node and do not insert it directly into the document)

12.06.2015 / 15:08

Basically this Regex works for your problem. On some samples it should fail. But for your problem it works.

\<(minhaTag)(?: .*?)?\>(?:[^\<]|\<(.*?)\>[^\<]*\<\/\>)*\<\/\>

Regex Tested Here

Replace minhaTag with the desired tag name. This regex will reference the most superficial element of the specified tag and its contents. The element can contain attributes.


  • Beware of the *? and * operators study their differences.

  • Remember to include \n (newline) to class . through the single line modifier ( s ), in case this modifier is supported.

  • Use the global modifier ( g ) in case you want all of the surface tags specified in the sample the link above).

12.06.2015 / 15:51