Capturing string in square brackets

3

I need to capture strings in brackets within a string. I found a solution that did not solve my problem fully: \[(.*?)\]

Usage like this:

Matcher mat = Pattern.compile("\[(.*?)\]").matcher(stringlToVerify);

if(mat.find()) {
   // Faz o que quero
}

So, if I run the regex with: 'ol[a' + 'm]undo'

It will get: [a' + 'm]

But in that case it is not to catch because the two strings are being concatenated, so it does not make sense.

Example of what I need:

  Entrada             Captura

1 + [aa]                  [aa]
[bb] + 2                  [bb]
'a' + [cc]                [cc]
['ola' + 'mundo']      ['ola' + 'mundo']
'[a' + 'b]'            
'[' + ']'        

[]                        []   (ou nada, também serve)
'Ola [world] legal'         
Oi ['[aa]'] ola           '[aa]'

In the latter case, if you can not do it simply, that's fine. I made a method that removes all strings in single quotation marks.

    
asked by anonymous 12.01.2017 / 17:02

2 answers

2

Regular expression:

\G[^\[']*(?:'[^']*'[^\['*]*)*(\[[^]']*(?:'[^']*'[^]']*)*\])


Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\G"                    // Início do texto ou fim do casamento anterior
                   + "[^\[']*"               // Texto sem colchetes nem aspas simples
                   + "(?:'[^']*'[^\['*]*)*"  // Opcional: Texto em aspas + texto sem "[" nem "'"
                   + "(\["                   // Grupo 1: Colchete de abertura
                   +     "[^]']*"             //        + texto sem "]" nem "'"
                   +     "(?:'[^']*'[^]']*)*" //        + texto em aspas + texto sem "]" nem "'"
                   + "\])";                  //        + colchete de fechamento
final Pattern pat = Pattern.compile(regex);
Matcher mat;

final String[] entrada = {
    "1 + [aa]",
    "[bb] + 2",
    "'a' + [cc]",
    "['ola' + 'mundo']",
    "'[a' + 'b]'",
    "'[' + ']'",
    "[]",
    "'Ola [world] legal'",
    "Oi ['[aa]'] ola"
};

//Loop cada string na entrada
for (String stringlToVerify :  entrada) {
    mat = pat.matcher(stringlToVerify);
    System.out.println("\nEntrada: " + stringlToVerify);

    if (mat.find())
        do { // Loop cada texto entre colchetes casado
            System.out.println("Captura: " + mat.group(1));
        } while (mat.find());
    else
        System.out.println("Não há colchetes fora das aspas");
}

Result:

Entrada: 1 + [aa]
Captura: [aa]

Entrada: [bb] + 2
Captura: [bb]

Entrada: 'a' + [cc]
Captura: [cc]

Entrada: ['ola' + 'mundo']
Captura: ['ola' + 'mundo']

Entrada: '[a' + 'b]'
Não há colchetes fora das aspas

Entrada: '[' + ']'
Não há colchetes fora das aspas

Entrada: []
Captura: []

Entrada: 'Ola [world] legal'
Não há colchetes fora das aspas

Entrada: Oi ['[aa]'] ola
Captura: ['[aa]']

You can test here: link


Description:

\G[^\[']*(?:'[^']*'[^\['*]*)*(\[[^]']*(?:'[^']*'[^]']*)*\])
  • \G - Anchor (or atomic assertion) marking the beginning of the chain of characters or end of the previous marriage ( continuing at the end of the previous match ).

    This is the most important building in this regex. It is to ensure that every wedding attempt begins only where the engine stopped at the previous marriage. Thus, a marriage can not begin in the middle of the text, avoiding a catch in, for example:

    '....   [a' + 'b]  .....'
            ^       ^
            |- Aqui-|
    
  • [^\[']* - List that matches all characters that are not brackets or quotes simple.

  • (?:'[^']*'[^\['*]*)* - This is a group that is repeated 0 or more times, marrying :

    • '[^']*' - Text in quotation marks
    • [^\['*]* - followed by more characters that are not brackets or quotation marks.
      

    This construction uses a technique known as " unrolling the loop >

    So far, we can match all the characters of the string before the brackets.

  • (\[[^]']*(?:'[^']*'[^]']*)*\]) - Capture group ( capturing group ) that allows referencing the married text (using Matcher#group(int) ) with:

    • \[ - opening bracket

    • [^]']* - more characters that are not brackets or quotation marks

    • )
    • (?:'[^']*'[^]']*)* - closing bracket.

15.01.2017 / 10:20
1

The regex below will capture letters, numbers or "_" that are enclosed in brackets. If you need a more restrictive version just change "\ w +" to [a-z] +, for example.

\[(\w+)\]

I have an example that you can check: link

    
12.01.2017 / 17:30