Regex - Operator "." - Capture of meta character

5

Situation

I'm doing a search with regex in a specific word inválido , but by preference I decided to use inv.lido . Which I knew I had in the test string, but did not return.

Tests

vr = var_dump
pr = print_r

$string = 'até, atenção, Hipótese, você, português, café, órgão';

vr(preg_match('~at.~', $string, $match));
pr($match);
vr(preg_match('~aten..o~', $string, $match));
pr($match);
vr(preg_match('~Hip.tese~', $string, $match));
pr($match);
vr(preg_match('~voc.~', $string, $match));
pr($match);
vr(preg_match('~portugu.s~', $string, $match));
pr($match);
vr(preg_match('~caf.~', $string, $match));
pr($match);
vr(preg_match('~.rg.o~', $string, $match));
pr($match);

Out

int(1)
Array([0] => at�)

int(0)
Array()

int(0)
Array()

int(1)
Array([0] => voc�)

int(0)
Array()

int(1)
Array([0] => caf�)

int(0)
Array()

Question

As you can see, he did not catch the words, except some, but even the ones he captured, I do not know what is, because even using utf8_decode or even utf8_encode it does not return the character correct.

For the little bit that I know of C and of binary, I suppose it has to do with the fact that these characters are two-way tracer 8bits, however they are present in the ASCII table and so I know regex follows the ASCII table.

Why did this happen?

    
asked by anonymous 20.08.2015 / 15:29

2 answers

5

PHP regular expressions do not support Unicode by default, unless you use the u flag:

preg_match('~aten..o~', $string, $match);
print_r($match);
  

Array   (   )

preg_match('/aten..o/u', $string, $match);
print_r($match);
  

Array   (       [0] = > attention   )

Example in ideone .

As for the results you are getting (eg at� ), the reason is that accented characters are usually represented by more than one byte, for example in UTF-8 encoding. A pattern:

at.

Without the u flag will match 3 bytes, the first one a , the second one t and the third the first byte of é . Since this first byte is not a valid ASCII (or Unicode) character, the print_r function does not know how to represent it, so it prints a . Already the default:

aten..o

When applied to the word atenção , place the first . with the first byte of ç , the second point with the second byte of ç , and when you try to match o with the first byte of% ã can not, and marriage fails.

By activating the u flag, the engine takes full characters (and not only bytes) into the marriage, so the first dot matches ç , the second dot ã , and the result is correct as expected.

    
20.08.2015 / 16:10
4

As already mentioned by @mgibsonbr, by default, PHP does not support unicode in regular expressions of preg .

In addition to the solution already presented, what can be done is to use regular expression functions from Multibyte String .

Example:

$str = 'inválido';

var_dump(mb_ereg_match('inv.lido', $str)); // bool(TRUE);

Note :

According to this answer in SOEN, mb_ereg_* functions are not marked as obsolete. So it's okay to use them.

    
20.08.2015 / 17:16