Use UTF-8 or Latin1?

-2

I started a new project and when creating the database (MySQL), I did not think twice, I put a CHARSET=utf8 . The application will support Portuguese and English and users should use only these two languages.

In a specific module users can compose a procedure. That is, a relatively long text, which will use a WYSIWYG HTML editor. Users format their texts and I write HTML to the database. For this column I chose VARCHAR(65535) , so I'd rather use the space in the bank.

Of course, MySQL has reported that the maximum I can get in VARCHAR is 21845 per UTF-8 account (maximum 3 bytes).

Question : Is it still worthwhile to use Latin1 , ensuring that each character will have only 1 byte? Or is this obsolete and better done with UTF-8 ?

    
asked by anonymous 24.04.2018 / 18:02

2 answers

4

There is no correct answer to choosing encoding . The choice must be made according to your need. This is why banks accept many types.

If your system has no chance of receiving any special characters, as in the case you are describing where the content will always be a HTML , you can, a priori, change all the special characters by their Unicode (ie &#nnnn; where nnnn is the unicode code), then you probably do not need to store that data in UTF-8 . You can even have your entire database as a collation UTF-8 and only that HTML field with a different collation .

However, you often have no control over how HTML will be written to the field, you do not have a filter to convert to cases where the user pastes any special characters, etc. If this is the case, then the best strategy is to use Unicode .

Another issue is that you choose a varchar field or a text field to store this type of information. Each field type has its advantages and disadvantages, especially if you have any intention of applying filters or ordinances on that content. The text fields can also be indexed, but have a limit (prefix) that you should choose for character comparison. There are also FULL TEXT SEARCH functions in MySQL that can be applied to both types of fields.

If it is just a matter of storing and retrieving the data, I would indicate the use of a field of type text where you would not worry about size limitations, in case you do not have this user input control .

Another aspect is that today the concern of the field occupying 1 byte and 2 bytes per character does not make much sense given the cost-per-byte of disk storage . Only if you have a system with a very large amount of data you need to replicate in multiple instances and the cost of storing your ISP is expensive.

If this is your primary concern and you are unsure whether the content will use Unicode , choose UTF-8 . This will make your bank scripts easier, your conversions when you read in the program, and to display on pages HTML .

    
24.04.2018 / 19:11
0

In MySQL you have the types MEDIUMTEXT (16M characters) and LONGTEXT (4B), so there is no need to worry about limitations imposed by encoding. Standardize on UTF-8 and be happy :)

    
24.04.2018 / 18:29