Why should we use functions that start with mb_?

11

Sometimes, problems arise in PHP in relation to some string functions, because of the string's qualification.

An example is strlen .

$a = strlen('str');

$b = strlen('stré');

var_dump($a, $b); // Imprime 3 e 5

See in IDEONE

As we see, in the case of $b , it was printed that it has 5 characters, not 4 .

I know from experience that to solve this we should use mb_strlen , which are multbyte functions of PHP.

Example:

var_dump(mb_strlen('stré', 'utf-8')); // Imprime 4
  • What exactly does this multibyte mean?

  • As is very common to use UTF-8 here in Brazil, should we always use functions of type mb_ instead of common functions to work with strings?

asked by anonymous 05.08.2015 / 17:06

1 answer

8

The PHP functions whose nomenclature starts with "mb_" belong to the functions MBString

MB stands for "Multibyte", ie functions for manipulating multibyte strings.

Encodes as UTF8 are multibyte (multi-byte). In the official documentation, see the list of supported encodings: link

Practical example

<?php
date_default_timezone_set('Asia/Tokyo');

ini_set('error_reporting', E_ALL);
error_reporting(E_ALL);
ini_set('log_errors',TRUE);
ini_set('html_errors',FALSE);
ini_set('display_errors',TRUE);

define( 'CHARSET',   'UTF-8' );

ini_set( 'default_charset', CHARSET );

if( PHP_VERSION < 5.6 ){
    ini_set( 'mbstring.http_output', CHARSET );
    ini_set( 'mbstring.internal_encoding', CHARSET );
}

header( 'Content-Type: text/html; charset=' . CHARSET );

/*
Retorna 6
Cada caracter "coração" está ocupando 3 bytes.
Caso queira contar a quantidade de bytes, strlen() é o mais indicado.
*/
echo strlen('I♥NY') . PHP_EOL . '<br />';

/*
Retorna 4
Caso queira contar a quantidade de caracteres, utilize a função equivalente em MBString 
*/
echo mb_strlen('I♥NY');


/*
Note que mesmo os caracteres latinos são multibyte
*/
echo strlen('ação') . PHP_EOL . '<br />';
echo mb_strlen('ação');
?>

Another term rarely used to refer to multibyte characters is "variable-width encoding".

link

Additional note

It is not always necessary to use mbstring functions. An example of a case is when it is known that a given string does not have multibyte characters.

Example:

echo strlen('123') . PHP_EOL . '<br />';
echo mb_strlen('123');

As the example shows, in this case it is unnecessary, however, we can delve deeper with another numerical example.

echo strlen('123') . PHP_EOL . '<br />';
echo mb_strlen('123');

In this example, they are numbers, however, multibyte.

There are many well-developed systems that "think" to be internationalized, but the vast majority do not test with the real world, as if the global term is simply the American and European continent.

More than 60% of the planet (Arabs, Greeks, Russians, Indians, Asians) uses multibyte characters and each language has such peculiarities as this example of multibyte numbers in the Japanese language table.

Therefore, it is recommended to use the MBString functions if you want to build a system that offers the greatest possible compatibility with the various existing encodings.

Another important note: UTF8 is not an encode compatible with all languages. And the MBString functions are not limited to UTF8.

For example, Chinese characters are best supported by the Big5 encode. There is also the use of UTF16 or UTF32.

However, even for Chinese characters, UTF8 is also used with some certainty, as it is "rare" for the Chinese themselves to use all the ideograms. There are more than 60 thousand.

    
07.08.2015 / 19:31