Different regex strategies to get the same result

5

I have the following input:

Detalhamento de Serviços nº: 999-99999-9999

I need to get the number in a group, for this I would use:

Detalhamento de Serviços nº: (\d+-\d+-\d+)

But I can not trust whether or not there is the string nº: (OBS: not the phone number, but this string itself). So I would have 2 options:

 1. Detalhamento de Serviços.+(\d+-\d+-\d+) 
 2. Detalhamento de Serviços[\D]+(\d+-\d+-\d+)

Both regex would return the same result, the doubt is:

What's the difference between using the "any character" and "non-digit" class in this case? What is the best practice and why? Which one has the best performance and why?

    
asked by anonymous 18.05.2018 / 20:05

1 answer

3

Actually, the two regex you indicated do not return the same result. I tested the JDK 1.7.0_80 , and you can also see them working (differently) here and here .

I created a very simple method to test a regex:

public void testregex(String input, String regex) {
    Matcher matcher = Pattern.compile(regex).matcher(input);
    if (matcher.find()) {
        System.out.println(matcher.group(1));
    }
}

Then I tested the same input using the two regex (detail that \ should be escaped, so it is written as \ ):

String input = "Detalhamento de Serviços nº: 999-99999-9999";
testregex(input, "Detalhamento de Serviços.+(\d+-\d+-\d+)");
testregex(input, "Detalhamento de Serviços\D+(\d+-\d+-\d+)");

The result was:

9-99999-9999
999-99999-9999

This happens because the quantifiers + and * are "greedy" and try to catch as many characters as possible. In the first case, it also takes the first two digits 9 , because the remainder of String ( 9-99999-9999 ) also satisfies the last part of the regex ( \d+-\d+-\d+ ).

In the second case, it does not take the first two 9 because \D ensures that it will not get digits.

So, some possible solutions are:

  • Use \D : thus, you guarantee that, even though the quantifier is greedy, it will not pick up a digit in error
  • Use ? soon after quantifier + , as this cancels "greedy" behavior . The regex looks like this: Detalhamento de Serviços.+?(\d+-\d+-\d+) - note the use of .+? to remove "greed"
  • Set the number of digits, using {} . For example, if the number of digits is always "3-5-4", you can use Detalhamento de Serviços.+?(\d{3}-\d{5}-\d{4}) . If the number of digits varies, use the {min,max} syntax. For example, if there is a minimum of 2 digits and a maximum of 3, use {2,3} (and use "grease canceler", or \D to guarantee). Adapt according to your needs.
18.05.2018 / 21:35