The type char
in C, and consequently in C ++, does not have a good name there. Actually I think he should call byte
, because that's what he is. It being used as a character is just a detail.
Contrary to popular belief, C is a weak typing language. She is statically typed, but weak. People do not quite understand this terms . C can interpret the same given as if it were of a type or form different from the one originally intended. This can be seen in this code:
char a = 'a';
printf("%c\n", a);
printf("%d\n", a);
See running on ideone . Also put it on GitHub for future reference .
The same data can be displayed as number or character.
Some C functions allow you to do this interpretation as a character, in general you have to say that it should be so. So there is %c
. It indicates that the data must be treated as a character. In general a char
is treated as a number itself.
Any character encoding that can be stored in 1 byte can be stored in char
. When C was created there was only the same ASCII (at least relevantly).
More complete encodings have appeared that use the entire byte to represent more characters. It got tricky, they created charset pages. To "simplify" and enable more characters created the multi-byte character. At this time it was no longer possible to use char
as the type to store the character, since it was guaranteed that it should only have 1 byte.
Na prevents you from using a string of char
s to say that it is only one character, but it will be your solution, that its functions know what to do. The third-party C library, including operating systems, do not know how to handle it. Then nobody does. A lot of people do not understand that C is a language for working with things grossly, you can do whatever you want any way you want. Exiting the pattern is your problem.
When we need the multi-byte character, we usually use wchar_t
. It can have a variable size depending on the implementation. The specification leaves free. In some cases we use char16_ t
and char32_ t
that have their sizes guaranteed by specification. This is standardized.
Let's run this code to better understand:
char a = 'a';
char b = 'ã';
wchar_t c = 'a';
wchar_t d = 'ã';
cout << sizeof(char) << "\n";
cout << sizeof(a) << "\n";
cout << sizeof('a') << "\n";
cout << sizeof(b) << "\n";
cout << sizeof(c) << "\n";
cout << sizeof(d) << "\n";
cout << sizeof('ã') << "\n";
See running on ideone . Also I put it in GitHub for future reference .
Have you noticed that the accent does not occupy more bytes? I will declare b
as char
it has only one byte, even having accent? And what c
has 4 bytes even has a character that fits in ASCII? Size is determined by the data type, or the variable. Where I explicitly said that it is char
it used 1 byte. Where it can infer that a char
is sufficient it used 1 byte, where I explicitly said that it is a wchar_ t
, occupied 4 bytes. Where he inferred that he needed more than one byte to represent the character he adopted 4 bytes. Then your sizeof('ã')
gave 4 bytes because there was an inference that it would be wchar_ t
.
It is clear that in this compiler wchar_t
has 4 bytes.
Every C and C ++ library understands wchar_ t
as a type to store characters and not numbers, although always there are numbers, computers do not know what characters are, it only uses a trick to show This is for people who want to see this.
Again in C you do as you wish. If you want to make all characters have a byte you can do, even if they have an accent. Of course there are only 256 possible values in a byte. You can not have all possible characters in this situation.