This other question recalls that in PHP it is not enough to use the correct function, which, well suggested by @bfavaretto , is mb_substr()
instead of substr()
: we also need to set PHP correctly for multibyte in> do not cause surprises.
What I suggest as a setting, to always use in Portuguese , is
setlocale(LC_ALL,'pt_BR.UTF8');
mb_internal_encoding('UTF8');
mb_regex_encoding('UTF8');
Use UTF-8 and Compatible Functions for Everything!
The ISO ISO Latin I (formally ISO-8859-1 ) was retired years ago, the W3C has been suggesting use of UTF-8 (see RFC-3629 ) in all recommendations.
Similarly, for Brazilian sites, the e-PING recommendation is the UTF-8 charset ...
The "de facto standard" , most popular for the minimally serious and "tuned" Portuguese language sites: idem, it's UTF-8. If you check large Brazilian portals or even protugueses, you will soon see in the HTML header that the adopted standard is UTF8 (ex. <meta http-equiv="Content-Type"../>
source code of UOL ).
Historical legacy
Who works with PHP deals with two historical legacies that still cause some confusion, and so I think it is important to remember them:
ISO-Latin-1 has long been in Brazil and in Portugual the "official standard" for HTML pages, TXT files, XML, SGML, etc.
It's natural, because UTF-8 came after ISO-Latin, and it justly houses in its structure, unchanged, like Unicode Block of Latin-1 Supplement .
PS: Microsoft since Windows 3.x, to isolate its users from any standardization initiative, always forced the "ISO Latin Microsoft" (known as " Windows-1252 code ), and even today some Brazilian programmers and web-designers publish HTML with this charset . It is an insult to international standards and the user.
PHP has tried to overcome this annoying thing with duplicate string functions - a
mb_*
library for variable length (multibyte) UTF-8 charsets and other fixed ISO 8 bit charsets -
with the proposed PHP6 , but never succeeded (although languages such as
Python had done this long before). This causes inconvenience (we are wasting time here with this question!) Until nowadays for the Portuguese language programmers.
Where else do you get "catchy" for UTF8?
Regular expressions
Once again the multitude of options to do the same thing in PHP, cause certain confusion. I've done a lot of work with regular expressions and I am fully convinced that the best (most powerful accepted as default in other languages) library is PCRE (Perl Compatible Regular Expressions) . I never had to use the multibyte functions "mb_ereg_ *" . The family preg_*
account for the message. Just keep an eye out for two details,
- Use the
/u
modifier when accent or special character is used in the regular expression itself.
- Your PHP script needs to be in UTF8 to understand its regular expression in UTF8.
Word Count
The str_word_count () function, like so many of PHP, has some flaws for the "general case" of UTF8 ... See discussion here .
Your PHP scripts ... Are UTF8?
Another common problem is your own PHP script, which must also be in UTF8 (!). Check out some serious and reliable publisher (never Windows Notepad!) Such as SublimeText or Textpad .
Ditto databases, XML files, etc. It needs to be all in the same charset , and, easy: just always configure everything with the "universal standard", which is UTF8.