Using REGEX in PHP to capture any number that is not within single quotes

4

I have been studying regex for some time and now I have a problem: Capturing all numbers, including decimals, that are not within single quotation marks .

I'm creating a sort of viewer for PHP code to learn how to use regex better. I have the following regex working, it returns to me all the decimal numbers of a given string:

preg_match_all('/(\d+\.\d+)/', $text, $matches, PREG_SET_ORDER, 0);

What I would like to do is to return not only decimals but all numeric characters that are not enclosed in single quotation marks . Any idea how I could do this? Thanks for any enlightenment, because I'm totally in the dark, I've tried different combinations of regex and none worked. I always test on regex101.com .

NOTE: I can return all the numeric IN characters of the quotes, not just the ones that are outside them:

preg_match_all('/(\'(\d+)\')/', $text, $matches, PREG_SET_ORDER, 0);
    
asked by anonymous 19.12.2018 / 17:09

2 answers

6

Although it is possible to do a regex - possibly quite complicated - involving lookaheads and lookbehinds , I find it easier to use a little" trick " use capture groups .

Basically, if you have a string like this:

$texto = "123 abc '456' def789'112' ghi";

As far as I understand, you only want to catch 123 and 789 , since they are numbers that are not enclosed in single quotation marks ( ' ). So you could have an expression like this:

preg_match_all("/\'\d+\'|(\d+)/", $texto, $matches);

This regex uses toggle ( | ) to say you want a thing or strong> other. These "things" are:

  • number in single quotation marks: '\d+' , or
  • number (without the quotation marks) and within parentheses, to form a catch group: (\d+)
  • Remembering that some characters in the regex are properly escaped with \ because they are within a string.

    With this, a match of the regex may fall into one of two cases:

    • If the number is in single quotation marks, it falls in the first stretch
    • otherwise, it falls on the second stretch

    If you fall in the first case, the catch group is not filled, and if it falls in the second stretch, the catch group is filled.

    So, to get numbers that are not enclosed in single quotation marks, just check that the capture group is filled in. And for the array to return in an easier format to check this, we can use the PREG_SET_ORDER option:

    $texto = "123 abc '456' def789'112' ghi";
    preg_match_all("/\'\d+\'|(\d+)/", $texto, $matches, PREG_SET_ORDER, 0);
    var_dump($matches);
    

    This code produces the following output:

    array(4) {
      [0]=>
      array(2) {
        [0]=>
        string(3) "123"
        [1]=>
        string(3) "123"
      }
      [1]=>
      array(1) {
        [0]=>
        string(5) "'456'"
      }
      [2]=>
      array(2) {
        [0]=>
        string(3) "789"
        [1]=>
        string(3) "789"
      }
      [3]=>
      array(1) {
        [0]=>
        string(5) "'112'"
      }
    }
    

    Notice that in the matches that fall in the second case (number is not in single quotation marks), the array has 2 positions. The first one corresponds to the match , and the second matches the capture group (in this case they are the same, but depending on the expression, it may not be).

    In cases where the number is enclosed in quotation marks, its array only has one position, because in these cases the capture group is not filled.

    Then just go through the matches array and check which of the internal arrays has the set capture group (ie just see if the size is greater than 1):

    foreach ($matches as $m) {
        if (count($m) > 1) { // grupo de captura preenchido (número não está entre aspas)
            echo $m[1]. "\n";
        }
    }
    

    The output of this foreach is:

    123
    789
    

    If you want numbers with decimal places, just change \d+ to \d+\.\d+ (which within the string would be \d+\.\d+ ) or any other expression you are using to capture numbers.

    If the boxes after the comma are optional, for example, you can use \d+(?:\.\d+)? . It's not the specific focus of the question, but validating numbers can become tricky, since it all depends on what cases you want to consider .

    As reminded by @fernandosavio in

    19.12.2018 / 17:52
    1

    I've created a REGEX that I believe meets your needs:

    (?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)
    

    Take a test:

    12,2 12.1021 14 '51' '1' '23323' 12
    

    The only rule for it to work is that the numbers are separated by spaces.

    Explanations there @GuilhermeNascimento:

    (?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)
      ^       ^            ^       ^
      .       .            .       ................ tem que ser o final da string ou ter espaços
      .       .            ............... pega apenas numeros
      .       .
      .       ................. pega numeros que possam ter (. ou ,) com numeros depois
      .
      ............... positive lookbehind (se houver espaçamento antes) ou é o inicio da string
    

    The numbers that will be redeemed are:

    12,2 
    12.1021 
    14
    12
    

    It would look like this:

    $string = "12,2 12.1021 14 '51' '1' '23323' 12";
    
    preg_match_all("/(?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)/", $string, $output_array);
    
    print_r($output_array);
    

    See the Running

        
    19.12.2018 / 18:09