I understand how "mb_strlen" works, but I did not understand an example:
<?php mb_strlen($string, '8bit'); ?>
What would this "8-bit" be?
I understand how "mb_strlen" works, but I did not understand an example:
<?php mb_strlen($string, '8bit'); ?>
What would this "8-bit" be?
8bit
is one of the internal character encodings supported in functions Multibyte String - mb_[função]
.
This code basically tells the Multibyte functions how the string should be converted to run correctly.
For example, if you run the code below you will get the following outputs:
<?php
$string = 'ὼ'; // Caractere especial qualquer
echo strlen($string); // Saída: 3
echo mb_strlen($string, '8bit'); // Saída: 3
echo mb_strlen($string, 'UTF-8'); // Saída: 1 - CORRETO!
In conclusion, the strlen()
q function works fine for characters from the ASCII table and the 8bit
encoding returns incorrectly relative to UTF-8
. The UTF-8
( Unicode ) pattern is the most efficient and recommended by W3.org .
To find out the default encoding set in your project, you can run:
<?php
echo mb_internal_encoding(); // Aqui retornou: UTF-8
Or to set the internal encoding for the UTF-8
pattern:
<?php
mb_internal_encoding('UTF-8');
Here you can see the list of supported encodings.
The second parameter is the character encoding that you are using. Most likely you'll want this parameter set to UTF-8
,
If you'd like to understand the function better, I suggest you take a look at the reference by clicking here
Summary: strlen
is not trusted, but using mb_string (..., '8bit') is not always possible.
The question is interesting, because 8bit
is not typically common, as stated in the other answers. But I think the answer of @Paul Imon, leads to the mistake in several cases. There is nothing wrong with mb_strlen('ὼ', '8bit')
result 3
, you are just ignoring the encoding used, this response is correct for 8bit
.
Imagine that, for example, you have the following two information:
0xDF 0xBF
11011111 10111111
This is any two bytes, which may or may not have been uniformly generated. If you are interested in bytes, it matters little your coding. UTF-8 has a kind of "signaling" for next bytes, so the first byte indicates how many bytes there are, so we can treat it as a single character.
UTF-8, for example, will always be an ASCII when using a single byte (0xxxxxxx), but when it has two it will be (110xxxxx) and all bytes that are not the first one must be (10xxxxxx).
This character DOES NOT EXIST in UTF-8, try:
echo "\xDF\xBF"; //= ߿
But its coding indicates that it has two bytes, so execute:
echo mb_strlen("\xDF\xBF", 'UTF-8'); //= 1
Returns% w /%, even if the character does not even exist. But, this character exists in UTF-16LE, this set of bytes represents 1
in UTF-16LE:
echo iconv('UTF-16LE', 'UTF-8', "\xDF\xBF"); //= 뿟
However using 뿟
will result in 2, after all there are 2 bytes. I believe "wrong" is not the word that best describes it, because all forms are correct, depending on where you will apply it, of course.
The 8bit
will treat each byte individually, regardless of encoding, it will treat each byte as one byte, in the simplest possible way, it can even use values outside of ASCII, such as 8bit
.
0xFF
should be used to prevent problems with the mb_strlen(..., '8bit')
function, which only now has become obsolete . This problem is not applicable if you do not have the Multibyte String installed.
Then the answer from @Paul Imon is wrong again. Using a native language feature set at mbstring.func_overload
modifies php.ini
entirely:
mbstring.func_overload = 2
Test:
echo strlen("\xDF\xBF"); //= 1
See, the behavior of strlen()
is no longer the same as strlen
, if you use mb_strlen(..., '8bit')
.
Summary, if you want to deal with bytes:
$texto = "\xDF\xBF";
if (extension_loaded('mbstring') && defined('MB_OVERLOAD_STRING') && ini_get('mbstring.func_overload') & MB_OVERLOAD_STRING) {
echo mb_strlen($texto, '8bit');
}else{
echo strlen($texto);
}
This will use mbstring.func_overload = 2
by default, but if overload is being used, then we use strlen
to ensure that we will not use modified mb_strlen
. Remember that not all have mbstring installed, so using default strlen
is not always possible. If you are sure that mbstring is installed you can only use mb_string(..., '8bit')
. ;)