There is no correct answer to choosing encoding . The choice must be made according to your need. This is why banks accept many types.
If your system has no chance of receiving any special characters, as in the case you are describing where the content will always be a HTML
, you can, a priori, change all the special characters by their Unicode (ie &#nnnn;
where nnnn
is the unicode code), then you probably do not need to store that data in UTF-8
. You can even have your entire database as a collation UTF-8
and only that HTML
field with a different collation .
However, you often have no control over how HTML
will be written to the field, you do not have a filter to convert to cases where the user pastes any special characters, etc. If this is the case, then the best strategy is to use Unicode .
Another issue is that you choose a varchar
field or a text
field to store this type of information. Each field type has its advantages and disadvantages, especially if you have any intention of applying filters or ordinances on that content. The text
fields can also be indexed, but have a limit (prefix) that you should choose for character comparison. There are also FULL TEXT SEARCH functions in MySQL that can be applied to both types of fields.
If it is just a matter of storing and retrieving the data, I would indicate the use of a field of type text
where you would not worry about size limitations, in case you do not have this user input control .
Another aspect is that today the concern of the field occupying 1 byte and 2 bytes per character does not make much sense given the cost-per-byte of disk storage . Only if you have a system with a very large amount of data you need to replicate in multiple instances and the cost of storing your ISP is expensive.
If this is your primary concern and you are unsure whether the content will use Unicode , choose UTF-8
. This will make your bank scripts easier, your conversions when you read in the program, and to display on pages HTML
.