What are the differences between utf8 and utf8mb4?

12

When importing my mysql database into a windows server after I created it on a local server (xampp), I could not import the script I exported from the database into the server. So I decided to copy the table scripts per table, and I noticed that only part of the script gave an error:

ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Removing all these instances of the exported script, I was able to upload the database to the windows server. But some problems are occurring, like some pages of the website that have accentuation exchanged for symbols and other problems that I do not know if they are due to the absence of the aforementioned line.

I wanted to understand the difference (s) that exists between utf and utf8mb4, to see if this could be causing website problems.

    
asked by anonymous 22.04.2016 / 21:14

1 answer

15

Programming languages used to support only ASCII encoding that defines 128 symbols. This coding is excellent for English, producing very compact texts where each letter spends only one byte. With the growth of the internet and an increasingly globalized world, problems began to quickly arise, as people in Brazil can not use accents in their words. It was there that initiatives began to create a codification that would gather all the symbols used all over the world.

ASCII only defines 128 symbols, which causes the first bit of every byte to be zero in this encoding. The UTF-8 standard took advantage of this and set the first 128 symbols exactly the same as the ASCII. When a character that is not present in this pattern is required, UTF-8 places the value of the first bit as 1 and sets codes that tell whether the character will have 1, 2, 3, or 4 bytes. Therefore a program that uses UTF-8 will have full compatibility with any ASCII text.

The problem is that MySQL has not fully adhered to the UTF-8 standard. It implemented only symbols up to 3 bytes and forgot the rest. What is declared in MySQL as utf8 is not really UTF-8, it's just a piece of it. To fix this error, starting with version 5.5, MySQL implemented the complete standard from 1 to 4 bytes and as it had already used the name utf8 called its new implementation of utf8mb4. Summarizing the utf8 of MySQL is not UTF-8 and utf8mb4 completely follows the UTF-8 standard.

Still, utf8 and utf8mb4 have great compatibility, most of the characters will be the same on both systems. If you switch from one to the other you probably will not see a difference. Unless, of course, Chinese people start to use animals as letters, then they will get upset when they appear # û & ý instead of kittens. Even if you use all the existing accents would not be any problem!

The point is that the MySQL standard is the Latin1 encoding, also known as ISO 8859-1, which defines all Latin characters and can be very well used in Portuguese. When you stopped declaring UTF-8mb4, MySQL used that encoding and as your application is probably UTF-8 these patterns do not represent the accents in the same way, but represent ASCII in the same way, so the error only appears in the accents.

Maybe this part of the script went awry because the MySQL version used does not support utf8mb4. If this is the case just use utf8 instead, which accents will be compatible.

    
07.05.2016 / 03:28