str_split does not work well in UTF-8 containing string?

Question

str_split does not work well in UTF-8 containing string?

Navigation

#1 by (9 votes)
#2 by (7 votes)
#3 by (2 votes)
#4 by (1 votes)
#5 by (-1 votes)

8

I want to iterate a string with foreach . For this, I learned that I should use the str_split function, which separates each character from the string into a array . But this operation does not work as expected when using strings that contain accents, for example (utf-8 characters).

Example:

str_split('coração da programação');

The result for this is:

Array
(
    [0] => c
    [1] => o
    [2] => r
    [3] => a
    [4] => �
    [5] => �
    [6] => �
    [7] => �
    [8] => o
    [9] =>  
    [10] => d
    [11] => a
    [12] =>  
    [13] => p
    [14] => r
    [15] => o
    [16] => g
    [17] => r
    [18] => a
    [19] => m
    [20] => a
    [21] => �
    [22] => �
    [23] => �
    [24] => �
    [25] => o
)

How do I split a string in the same way as str_split does, but keeping utf-8 ?

php string

asked by anonymous 01.02.2016 / 19:50

5 answers

9

As already mentioned, most of the standard PHP functions do not support multibyte strings, and for these cases the multibyte string functions . More specific in the case of your question, mb_split is ideal.

01.02.2016 / 20:09

2

PHP does not support all unicode characters, however you can force them through REGEX.

preg_split('//u', 'coração da programação');

u is the modifier for unicode.

01.02.2016 / 20:03

1

You can do this by using the preg_split () function.

A regular expression that provides greater compatibility is /(?<!^)(?!$)/u

   $str = 'coração da programação';
   preg_split("/(?<!^)(?!$)/u", $str);

I will show with the other answers are flaws or insecure regarding functionality.

Testing regular expressions proposed in other responses using string 日本語 :

   $str = '日本語';
   /*
   Essa é a expressão regular que provê maior segurança
   */
   print_r(preg_split("/(?<!^)(?!$)/u", $str));
   /** 
   retorno:

   Array
   (
       [0] => 日
       [2] => 本
       [3] => 語
   )
   */

   /*
   Essa expressão está numa das respostas (atualmente marcada como aceita)
   */
   print_r(preg_split("/./u", $str));
   /*
   Funciona bem com caracteres romanos, porém, não retorna corretamente com um simples kanji

   Array
   (
       [0] => 
       [1] => 
       [2] => 
       [3] => 
   )
   */

   print_r(preg_split("//u", $str));
   /*
   Essa outra consegue separar os caracteres, porém, retorna índices vazios no começo e no fim.

   Array
   (
       [0] => 
       [1] => 日
       [2] => 本
       [3] => 語
       [4] => 
   )

   Caso queira usar a expressão "//u", deve-se adicionar alguns parâmetros caso não queira os índices com valores vazios:
   */
   print_r(preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY));
   /**
   Retorno:

   Array
   (
       [0] => 日
       [1] => 本
       [2] => 語
   )
   */

Optional for quantity control of characters:

$str = '日本語';

$l = 1;
print_r(preg_split('/(.{'.$l.'})/us', $str, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE));

Finally, a simple routine, just traversing each character of the string and populating an array:

$str = '日本語';
$l = mb_strlen($str);
for ($i = 0; $i < $l; $i++) {
    $arr[] = mb_substr($str, $i, 1);
}
print_r($arr);
// Dependendo do caso, esse pode ser mais performático que todos os outros.
// Basta saber como e quando usar os recursos da linguagem.

Note: The above examples are for environments where the character set is correctly configured.

02.02.2016 / 11:56

-1

From php.net

<?php
function str_split_unicode($str, $l = 0) {
    if ($l > 0) {
        $ret = array();
        $len = mb_strlen($str, "UTF-8");
        for ($i = 0; $i < $len; $i += $l) {
            $ret[] = mb_substr($str, $i, $l, "UTF-8");
        }
        return $ret;
    }
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
?>

01.02.2016 / 19:56

How to create and read ".conf" files in C, for linux environment? How to divide integers and get value with decimal part?

score 7 · Accepted Answer

As some php functions do not support multibyte characters, how amazing it seems the solution is regex O.o, because in that library they are supported.

You can use the meta character dot ( . ) to break the string in an array and get the same result as str_split() , so remember that you need to use the u of the PCRE. >

$str = 'ação';
preg_match_all('/./u', $str, $arr);

echo "<pre>";

Saida:

Array
(
    [0] => Array
        (
            [0] => a
            [1] => ç
            [2] => ã
            [3] => o
        )

)