Problem when using substr in text with PHP

12

When using substr in a variable with text, it is returning a special character " " could anyone help me?

I'm using the following code:

$excerpt = get_the_content();
$excerpt = strip_shortcodes($excerpt);
$excerpt = strip_tags($excerpt);
$the_str = substr($excerpt, 0, 335);
echo $the_str . '...'; 
    
asked by anonymous 06.03.2014 / 14:24

3 answers

16

Your string is probably encoded as UTF-8, which is desirable, because that way you can represent a huge amount of special characters. In UTF-8, certain characters, including all accented characters, occupy more than one byte. However, the substr function considers that each character occupies only one byte. What is happening is that substr is cutting a character in the middle, taking only the first byte of it. When the browser will display the output of substr , this byte is considered an invalid character.

The solution is to use the mb_substr function, which is designed to handle multibyte characters:

$the_str = mb_substr($excerpt, 0, 335);
    
06.03.2014 / 14:48
10

This other question recalls that in PHP it is not enough to use the correct function, which, well suggested by @bfavaretto , is mb_substr() instead of substr() : we also need to set PHP correctly for multibyte in> do not cause surprises.

What I suggest as a setting, to always use in Portuguese , is

setlocale(LC_ALL,'pt_BR.UTF8');
mb_internal_encoding('UTF8'); 
mb_regex_encoding('UTF8');

Use UTF-8 and Compatible Functions for Everything!

The ISO ISO Latin I (formally ISO-8859-1 ) was retired years ago, the W3C has been suggesting use of UTF-8 (see RFC-3629 ) in all recommendations.

Similarly, for Brazilian sites, the e-PING recommendation is the UTF-8 charset ... The "de facto standard" , most popular for the minimally serious and "tuned" Portuguese language sites: idem, it's UTF-8. If you check large Brazilian portals or even protugueses, you will soon see in the HTML header that the adopted standard is UTF8 (ex. <meta http-equiv="Content-Type"../> source code of UOL ).

Historical legacy

Who works with PHP deals with two historical legacies that still cause some confusion, and so I think it is important to remember them:

ISO-Latin-1 has long been in Brazil and in Portugual the "official standard" for HTML pages, TXT files, XML, SGML, etc.
  • It's natural, because UTF-8 came after ISO-Latin, and it justly houses in its structure, unchanged, like Unicode Block of Latin-1 Supplement .
    PS: Microsoft since Windows 3.x, to isolate its users from any standardization initiative, always forced the "ISO Latin Microsoft" (known as " Windows-1252 code ), and even today some Brazilian programmers and web-designers publish HTML with this charset . It is an insult to international standards and the user.

  • PHP has tried to overcome this annoying thing with duplicate string functions - a mb_* library for variable length (multibyte) UTF-8 charsets and other fixed ISO 8 bit charsets - with the proposed PHP6 , but never succeeded (although languages such as Python had done this long before). This causes inconvenience (we are wasting time here with this question!) Until nowadays for the Portuguese language programmers.

    Where else do you get "catchy" for UTF8?

    Regular expressions

    Once again the multitude of options to do the same thing in PHP, cause certain confusion. I've done a lot of work with regular expressions and I am fully convinced that the best (most powerful accepted as default in other languages) library is PCRE (Perl Compatible Regular Expressions) . I never had to use the multibyte functions "mb_ereg_ *" . The family preg_* account for the message. Just keep an eye out for two details,

    • Use the /u modifier when accent or special character is used in the regular expression itself.
    • Your PHP script needs to be in UTF8 to understand its regular expression in UTF8.

    Word Count

    The str_word_count () function, like so many of PHP, has some flaws for the "general case" of UTF8 ... See discussion here .

    Your PHP scripts ... Are UTF8?

    Another common problem is your own PHP script, which must also be in UTF8 (!). Check out some serious and reliable publisher (never Windows Notepad!) Such as SublimeText or Textpad .

    Ditto databases, XML files, etc. It needs to be all in the same charset , and, easy: just always configure everything with the "universal standard", which is UTF8.

        
    09.03.2014 / 14:29
    3

    Php places this strange character automatically when it does not recognize the character set that this character belongs to. To solve the problem you need to transform your string into utf8 universal character pattern

    Try to use the string utf8_encode($sua_string);

    For more details link

    Or try:

    $string= mb_convert_encoding(utf8_encode($sua_string), 'ISO-8859-1', 'UTF-8');
    
        
    06.03.2014 / 14:29