Why is 'ç' converted to% C3% A7 URL, not% E7?

3

When I was coding the 'ç' character for the query format (where the parameters are) from the URL, I got:

  

% C3% A7

% specifies a hexadecimal byte, but why almost all characters (including 'ç') must be specified by 2 hexadecimal bytes?

How could %C3%A7 represent the 'ç' character? 'ç' could not be specified with only this byte %E7 (231)?

To clarify: the intention is to know how the 'ç' character is encoded, how it becomes %C3%A7 .

    
asked by anonymous 07.10.2016 / 15:49

2 answers

6

The RFC 3986 does not specify which encoding should be used for non-ASCII characters.

URL encoding involves a pair of hexes, which is equivalent to 8 bits. It would be possible to represent the non-ASCII characters all within that context. However, what made it impractical is that many languages have their own standard to represent their respective 8-bit characters. What's more, in languages like Chinese, many characters do not fit into 8-bits.

Therefore, the RFC 3629 specification was adopted, which proposed standardizing non-ASCII characters with UTF-8 encoding .

It is important to understand that within the non-ASCII group there are reserved and non-reserved characters.

In the non-reserved character table, we have

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ .~

These are the reserved ones:

! * ' ( ) ; : @ & = + $ , / ? % # [ ]

Note that ~ is not reserved, however, it can be encrypted. However, the recommendation is not to encode it.

What happens in the example you posted from the cedilla ç ? Obviously, since ç is not ASCII, it is treated as UTF8 by the% recommendation of% mentioned above.

This in itself already explains why it is encoded in UTF-8, representing 2 hexadecimal pairs.

The "ç" is encoded in UTF-8 with 2 bytes RFC 3629 (hex) and C3 (hex), being represented in this format "% c3" and "% a7" respectively. The scope% HH% HH. The A7 pair is what it identifies as UTF-8.

Browsers only print the decoded form. And many protocols transmit UTF-8 without formatting for the% HH scope, either 1 or 2 pairs.

* byte! = bit * url encoded! = html entities

Out of curiosity, browsers have supported multibyte characters in the URL for a number of years.

    
07.10.2016 / 19:51
5

This% C3% A7 string is the UTF-8 encoding of the 'ç' character for use in URLs.

Reference:
link
link

Another interesting page:
link

Online tools:
link
link

Official definition.
link

Transforming E7 into C3A7

E7: 11 100111
    ^^ ^^^^^^

110x xxxx | 10xx xxxx
1100 0011 | 1010 0111 --> C3A7
       ^^     ^^ ^^^^
    
07.10.2016 / 15:54