The RFC 3986 does not specify which encoding should be used for non-ASCII characters.
URL encoding involves a pair of hexes, which is equivalent to 8 bits.
It would be possible to represent the non-ASCII characters all within that context. However, what made it impractical is that many languages have their own standard to represent their respective 8-bit characters. What's more, in languages like Chinese, many characters do not fit into 8-bits.
Therefore, the RFC 3629 specification was adopted, which proposed standardizing non-ASCII characters with UTF-8 encoding .
It is important to understand that within the non-ASCII group there are reserved and non-reserved characters.
In the non-reserved character table, we have
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ .~
These are the reserved ones:
! * ' ( ) ; : @ & = + $ , / ? % # [ ]
Note that ~
is not reserved, however, it can be encrypted. However, the recommendation is not to encode it.
What happens in the example you posted from the cedilla ç ?
Obviously, since ç
is not ASCII, it is treated as UTF8 by the% recommendation of% mentioned above.
This in itself already explains why it is encoded in UTF-8, representing 2 hexadecimal pairs.
The "ç" is encoded in UTF-8 with 2 bytes RFC 3629
(hex) and C3
(hex), being represented in this format "% c3" and "% a7" respectively. The scope% HH% HH. The A7
pair is what it identifies as UTF-8.
Browsers only print the decoded form. And many protocols transmit UTF-8 without formatting for the% HH scope, either 1 or 2 pairs.
* byte! = bit * url encoded! = html entities
Out of curiosity, browsers have supported multibyte characters in the URL for a number of years.