How to ignore escaped elements in a rule in regular expression?

5

I want to do with regex (regular expression), as for example (if it's javascript):

var str = '[abc\[0123\]] [efg\[987\]h] [olá \[mundo\]!] [foo [baz]]';
str.match(/\[(.*?)\]/g);
  

Output: ["[abc[0123]", "[efg[987]h", "[olá [mundo]!", "[foo [baz]"]

Or

var str = '{abc\{0123\}} {efg\{987\}h} {olá \{mundo\}!} {foo {baz}}';
str.match(/\{(.*?)\}/g);
  

Output: ["{abc{0123}", "{efg{987}", "{olá {mundo}", "{foo {baz}"]

But I would like the first one to ignore non-escaped places like [foo [baz]] and take only [baz] and escaped, like this:

 ["[abc[0123]]", "[efg[987]h]", "[olá [mundo]!]", "[baz]"]

And in the second it returns this:

 {"{abc{0123}}", "{efg{987}h}", "{olá {mundo}!}", "{baz}"]

My intention initially is to study, but I also intend to use in things like a structure that is similar to CSS selectors, so for example:

  input[name=\[0\]], input[name=foo\[baz\]\[bar\]]

I would return this:

  [0], [1]

Or a map of URLs that I want to create:

  /{nome}/{foo\{bar}/{baz\{foo\}}/

And I would return this:

 {nome}, {foo{bar}, {baz{foo}}

What I want is to ignore the escaped characters, how can I do this? You can provide an example in any language, the most important is Regex

    
asked by anonymous 09.05.2016 / 18:32

1 answer

6

You need to make the content to be married consume both the backslash and the subsequent character as one thing:

\.|.

That is, it marries a backslash followed by anything (2 characters), and only if the first one is not a backslash, it marries a single character.

For the last example (where you want only the innermost bracket), you can achieve this in this particular case (but not in general, since balancing parentheses / brackets / braces does not a regular language ) requiring that the married content does not contain an open bracket, unless escaped:

\.|[^\[]

The complete regex would look like this:

\[((?:\.|[^\[])*?)\]
\{((?:\.|[^{])*?)\}

Example:

var str = '[abc\[0123\]] [efg\[987\]h] [olá \[mundo\]!] [foo [baz]]';
var regex = /\[((?:\.|[^\[])*?)\]/g;
   
document.getElementById("saida").innerHTML += "<pre>" + str.match(regex) + "</pre><br/>"

var str = '{abc\{0123\}} {efg\{987\}h} {olá \{mundo\}!} {foo {baz}}';
var regex = /\{((?:\.|[^{])*?)\}/g;

document.getElementById("saida").innerHTML += "<pre>" + str.match(regex) + "</pre><br/>"
<div id="saida"></div>

Notes:

  • In the example, I had to use two \ in the string because otherwise the backslash would not be considered an escape character.

  • The output includes the bars; if you want to remove them, you would have to process the output of match using maybe replace :

    str.match(regex).replace(/\([\[\]{}])/g, "$1");
    
  • The ?: has been placed so that the parenthesis does not become a catch group. If you are not using groups, it can be omitted.

  • 09.05.2016 / 20:29