Encoding utf-8 allows accents?

8

If we do

# encoding: utf-8

In the first line of a Python program, can we make accents in the whole code?

    
asked by anonymous 05.06.2015 / 03:01

2 answers

15

In reality, it all depends on the configuration of your text editor. Most of text editors saves files by default in UTF-8 encoding (and at least for non-ISO writers in ISO-8859-1).

Why does it matter?

To sum up a very complicated story, which began there in the time of the first telegraphs, the Latin alphabet character codes were standardized quite early (the ASCII was standardized in 1960), but the" special "characters - cedillas, accents - were" standardized "by each country (or group of countries) separately. Western Europe (and, therefore, Lusophone countries, Spanish-speaking countries, ...) converged in the standard ISO 8859-1. ISO 8859-1 . / p>

The problem is that this pattern does not contain Greek, Cyrillic, etc. alphabets, so it is impossible to have a document in this pattern that, for example, mixes excerpts from Portuguese to Greek (and the situation gets even worse when you include Japanese , Chinese, ...)

The invention of Unicode

In order to unify these encodings and allow polyling texts (and to avoid ambiguities in exchanging texts between computers with different encodings), Unicode was invented, which aims to assign different codes to all characters of all languages in the world.

Unicode texts can be encoded in many different ways - internally, .NET and Java use UTF-16; Python 3 chooses between ASCII, UTF-16 and UTF-32 depending on the characters that are in the text you are processing.

Still, UTF-8 is the most popular coding for text files (eg Python source files)

Why this line is required

Because a byte can only have 256 distinct values, and the set of all languages in the world is longer than 256 characters, UTF-8 must use more than one byte to represent some characters. In general, accented characters like in the word "blessing" are represented in 2 bytes in UTF-8 (as opposed to only 1 in ISO 8859-1):

             b |    ê   |  n |    ç   |    ã   |  o
ISO 8859-1: 62 |   EA   | 6E |   E7   |   E3   | 6F
     UTF-8: 62 | C3  AA | 6E | C3  A7 | C3  A3 | 6F

This is a problem when you try to read a text written in a coding as if it were another encoding: if the text was written as UTF-8 but read as ISO 8859-1, it appears as "blessing" ; the opposite appears as "b n o" (or, in the case of Python, causes UnicodeDecodeError ).

Python 2, as a special case, detects the presence of this line and uses it to detect the file encoding. In the absence of this line, Python goes into a more conservative mode, and only accepts ASCII characters (no accent), throwing an error if it encounters any "weird" characters (the details of this mechanism are described in <

Summary of the opera

If you want to use accents in your Python 2 file, put one of the following three lines at the top of your files:

# encoding: utf-8
# encoding: iso-8859-1
# encoding: win-1252

In roughly decreasing order of probability, these are the encodings that your editor probably uses.

You can also migrate to Python 3, where the code below is perfectly legal ...

fmoreira@saucer tmp $ cat encoding.py 
π = 3.14159265359
半径 = 2.5
área = π * 半径 ** 2
print('مساحة = {}'.format(área))

fmoreira@saucer tmp $ python3 encoding.py 
مساحة = 19.6349540849375

... but I obviously do not recommend this technique.

    
05.06.2015 / 03:51
6

The encoding declaration line

#encoding: utf-8

allows the Python parser to understand the accents in the source code - that is, putting any accented characters is no longer a "syntax error" in Python 2. Other encodings, used by default in Windows, are more limited than utf-8, to allow only 256 distinct characters - so it's important to put that line and configure your editor to use utf-8.

But this is not enough to use accentuation at will in a Python 2.x program. A big change that was implemented in the mid-2000s, and many people still did not realize, is that TEXT data in Python 2 has to be unicode, not str type. In Python3, the type "str" already has an internal representation in Unicode.

The biggest difference between the two is that for a string of bytes (the simple str of Python2) a sequence element is one byte. When one speaks of text (unicode type) an element of the sequence always corresponds to a character.

Do the following experiment - (it can be in the terminal, if it is set to utf-8):

>>> a = "maçã"
>>> for letra in a: print letra,
... 
m a � � � �
>>> a = u"maçã"
>>> for letra in a: print letra,
... 
m a ç ã
>>> 

What happens is that standard latin1 encoding uses semrpe one byte per character, and then you do not realize that - but you're going to have a problem if you try to pass a sharp string to uppercase, even with this type of coding. For example:

>>> a=  "maçã".upper()
>>> print a
MAçã
>>> a= u"maçã".upper()
>>> print a
MAÇÃ
>>> 

The recommendation is to understand well what Unicode is and what are the encodings in the article link , always use utf-8 in programs, and - always use the technique called "unicode sandwich":

When reading text from some external source of your program - be it a file, user input, database, sensor, it will be in bytes, and with some coding

  • You decode this text to unicode (with the "decode" method)
  • works with the text in your Python program
  • encode back into the encoding used by the data output (terminal, file, database, printer, etc ...) with the "encode" method.

Python 3 and some of the libraries - even those used in Python2 - already do the encoding / decoding step transparently for you. But it's still vital to understand what's happening.

    
05.06.2015 / 14:37