Accent characters are considered as two characters [duplicate]

Question

Accent characters are considered as two characters [duplicate]

Navigation

#1 by (4 votes)
#2 by (1 votes)
#3 by (0 votes)

1

Characters with an accent are considered as two (I imagine the amount of bytes), how can I fix this?

$t = "á";
if(strlen($t) == 1){
    echo "UM CARACTER";
}
if(strlen($t) == 2){
    echo "DOIS CARACTER";
}
if(strlen($t) == 3){
    echo "TRES CARACTER";
}

Another problem I'm facing is $string{0} unable to identify accents.

$text = "á25";

echo $text{0}."<br>"; //retorna �
echo $text{1}."<br>"; //retorna �
echo $text{2}."<br>"; //retorna 2
echo $text{3}."<br>"; //retorna 5

And putting in ISO-8859-1 stands

$text = "á25";

echo $text{0}."<br>"; //retorna Ã
echo $text{1}."<br>"; //retorna ¡
echo $text{2}."<br>"; //retorna 2
echo $text{3}."<br>"; //retorna 5

php

asked by anonymous 30.07.2017 / 01:10

3 answers

1

You're correct, strlen() returns the number of bytes. To return the number of characters, use mb_strlen() or iconv_strlen() :

$t = "à";
print strlen($t); // 2
print mb_strlen($t); // 1
print iconv_strlen($t); // 1

30.07.2017 / 01:56

0

The strlen() function works fine for iso-8859-1 (text with no accent). stlen() does not count the number of characters but the number of bytes.

When text has accent (multibite fetch) use mb_strlen() which returns the number of characters.

The mb_strlen() function allows you to define a parameter called encoding .

$t = "á";
tam  = mb_strlen($t, 'utf8');
echo $tam;//resultado 1

test here

Emphasized lyrics is a problem, check out this article use accents in an SMS message

30.07.2017 / 02:47

Connecting PHP with the MySQL database How to avoid repeating html and css commands?

score 4 · Accepted Answer

It depends on the coding as you yourself noticed. UTF-8, which is the most common, ranges from 1 byte (7 useful bits) to 4 bytes (21 usable bits). All ASCII uses only 7 bits, ie the most significant bit of it is always zero (0xxxxxxx) to complete a byte.

Now accented characters are beyond ASCII, it does not exist in it. For this reason there are other encodings to support accents. UTF-8 uses more than one byte for this, while ISO-8859-1, also known as Latin 1, still uses one byte, but using the 8 bits.

When you use á you have to say what it is, in most cases UTF-8 will be used, which in turn will use 2 bytes.

One solution is to use:

mb_strlen('á', 'UTF-8');
// = 1

It's important to set the second parameter, because the behavior can be changed even by mbstring.func_overload .

If you want to cut a slice you can use:

mb_substr('á25', 0, 1, 'UTF-8');
// = á

If you want to create an array with multi-byte values:

preg_split('//u', 'á25', null, PREG_SPLIT_NO_EMPTY);
// = array(3) { [0]=> string(2) "á" [1]=> string(1) "2" [2]=> string(1) "5" }