Problems with str_pad function and accent

7

I'm using the str_pad function to get a string and fill it with 10 0 characters.

It's working perfectly, see the example:

echo str_pad("dda", 10, "0", STR_PAD_LEFT);

It writes 0000000dda .

The problem occurs when I put accentuation, eg:

echo str_pad("ddã", 10, "0", STR_PAD_LEFT);

Instead of writing 0000000ddã it writes 000000ddã , meaning it loses 0 . Does anyone know how to solve this?

    
asked by anonymous 12.09.2016 / 22:21

2 answers

7

The problem is that the str_pad function assumes that each character occupies one byte. When you use characters that are longer than one byte in length (such as ã ), the function starts to go wrong.

In StackOverflow in English there is a question about it and there are 4 answers to this problem. Judging by the comments, two of the answers have problems (including the accepted answer) and the other two should be appropriate (I have not tested them yet). All the answers given there consist of creating a different function capable of handling multibyte characters.

Here's Wes's solution :

function mb_str_pad($str, $pad_len, $pad_str = ' ', $dir = STR_PAD_RIGHT, $encoding = NULL)
{
    $encoding = $encoding === NULL ? mb_internal_encoding() : $encoding;
    $padBefore = $dir === STR_PAD_BOTH || $dir === STR_PAD_LEFT;
    $padAfter = $dir === STR_PAD_BOTH || $dir === STR_PAD_RIGHT;
    $pad_len -= mb_strlen($str, $encoding);
    $targetLen = $padBefore && $padAfter ? $pad_len / 2 : $pad_len;
    $strToRepeatLen = mb_strlen($pad_str, $encoding);
    $repeatTimes = ceil($targetLen / $strToRepeatLen);
    $repeatedString = str_repeat($pad_str, max(0, $repeatTimes)); // safe if used with valid unicode sequences (any charset)
    $before = $padBefore ? mb_substr($repeatedString, 0, floor($targetLen), $encoding) : '';
    $after = $padAfter ? mb_substr($repeatedString, 0, ceil($targetLen), $encoding) : '';
    return $before . $str . $after;
}

Here's the Ja͢ck solution :

function mb_str_pad($input, $pad_length, $pad_string = ' ', $pad_type = STR_PAD_RIGHT, $encoding = 'UTF-8')
{
    $input_length = mb_strlen($input, $encoding);
    $pad_string_length = mb_strlen($pad_string, $encoding);

    if ($pad_length <= 0 || ($pad_length - $input_length) <= 0) {
        return $input;
    }

    $num_pad_chars = $pad_length - $input_length;

    switch ($pad_type) {
        case STR_PAD_RIGHT:
            $left_pad = 0;
            $right_pad = $num_pad_chars;
            break;

        case STR_PAD_LEFT:
            $left_pad = $num_pad_chars;
            $right_pad = 0;
            break;

        case STR_PAD_BOTH:
            $left_pad = floor($num_pad_chars / 2);
            $right_pad = $num_pad_chars - $left_pad;
            break;
    }

    $result = '';
    for ($i = 0; $i < $left_pad; ++$i) {
        $result .= mb_substr($pad_string, $i % $pad_string_length, 1, $encoding);
    }
    $result .= $input;
    for ($i = 0; $i < $right_pad; ++$i) {
        $result .= mb_substr($pad_string, $i % $pad_string_length, 1, $encoding);
    }

    return $result;
}
    
12.09.2016 / 22:32
6

This happens because ã is a character multi-byte , see:

echo strlen("a"); // 1
echo strlen("ã"); // 2

The function str_pad interprets ã as a character of two bytes instead of a multi-byte , to get around this use the mb_strlen " to enter the string size, so ã will be interpreted as a multi-byte character , see:

echo mb_strlen("a"); // 1
echo mb_strlen("ã"); // 1

You can implement mb_strlen this way ( credits ):

function mb_str_pad( $input, $pad_length, $pad_string = ' ', $pad_type = STR_PAD_RIGHT, $encoding = "UTF-8") {
    $diff = strlen( $input ) - mb_strlen($input, $encoding);
    return str_pad( $input, $pad_length + $diff, $pad_string, $pad_type );
}

Use this:

echo mb_str_pad("ddã", 10, "0", STR_PAD_LEFT); // 0000000dda
echo str_pad("ddã", 10, "0", STR_PAD_LEFT);    // 000000ddã

See DEMO

    
12.09.2016 / 22:44