One character length (ASCII vs. other encodings) in bytes

4

Viewing this question came up with a question, coming from PHP and in the past having "problems" derived from character encoding, eg: srtpos " vs mb_strpos , I knew that all characters ASCII have 1 byte, but I thought that special characters would have more, I would associate it with the fact that a character is special this would also be multi byte .

That is, if I write a simples.txt with an "a" character, for example, it is 1-byte in size, but if you record with a "a" character, this is two bytes long. But this example indicates that the special character has 4 bytes.

#include <iostream>
using namespace std;

int main() {
    char a = 'a';
    cout << sizeof(char) << "\n"; // 1
    cout << sizeof(a) << "\n"; // 1
    cout << sizeof('ã')  << "\n"; // 4
}

What are we left with?

    
asked by anonymous 19.01.2017 / 12:11

2 answers

4

The type char in C, and consequently in C ++, does not have a good name there. Actually I think he should call byte , because that's what he is. It being used as a character is just a detail.

Contrary to popular belief, C is a weak typing language. She is statically typed, but weak. People do not quite understand this terms . C can interpret the same given as if it were of a type or form different from the one originally intended. This can be seen in this code:

char a = 'a';
printf("%c\n", a);
printf("%d\n", a);

See running on ideone . Also put it on GitHub for future reference .

The same data can be displayed as number or character.

Some C functions allow you to do this interpretation as a character, in general you have to say that it should be so. So there is %c . It indicates that the data must be treated as a character. In general a char is treated as a number itself.

Any character encoding that can be stored in 1 byte can be stored in char . When C was created there was only the same ASCII (at least relevantly).

More complete encodings have appeared that use the entire byte to represent more characters. It got tricky, they created charset pages. To "simplify" and enable more characters created the multi-byte character. At this time it was no longer possible to use char as the type to store the character, since it was guaranteed that it should only have 1 byte.

Na prevents you from using a string of char s to say that it is only one character, but it will be your solution, that its functions know what to do. The third-party C library, including operating systems, do not know how to handle it. Then nobody does. A lot of people do not understand that C is a language for working with things grossly, you can do whatever you want any way you want. Exiting the pattern is your problem.

When we need the multi-byte character, we usually use wchar_t . It can have a variable size depending on the implementation. The specification leaves free. In some cases we use char16_ t and char32_ t that have their sizes guaranteed by specification. This is standardized.

Let's run this code to better understand:

char a = 'a';
char b = 'ã';
wchar_t c = 'a';
wchar_t d = 'ã';
cout << sizeof(char) << "\n";
cout << sizeof(a) << "\n";
cout << sizeof('a') << "\n";
cout << sizeof(b)  << "\n";
cout << sizeof(c)  << "\n";
cout << sizeof(d)  << "\n";
cout << sizeof('ã')  << "\n";

See running on ideone . Also I put it in GitHub for future reference .

Have you noticed that the accent does not occupy more bytes? I will declare b as char it has only one byte, even having accent? And what c has 4 bytes even has a character that fits in ASCII? Size is determined by the data type, or the variable. Where I explicitly said that it is char it used 1 byte. Where it can infer that a char is sufficient it used 1 byte, where I explicitly said that it is a wchar_ t , occupied 4 bytes. Where he inferred that he needed more than one byte to represent the character he adopted 4 bytes. Then your sizeof('ã') gave 4 bytes because there was an inference that it would be wchar_ t .

It is clear that in this compiler wchar_t has 4 bytes.

Every C and C ++ library understands wchar_ t as a type to store characters and not numbers, although always there are numbers, computers do not know what characters are, it only uses a trick to show This is for people who want to see this.

Again in C you do as you wish. If you want to make all characters have a byte you can do, even if they have an accent. Of course there are only 256 possible values in a byte. You can not have all possible characters in this situation.

    
19.01.2017 / 13:55
2

TL; DR; It depends on the encoding and some language / platform details.

Each UTF-8 character occupies 1 to 6 bytes,

Each UTF-16 character occupies 16 bits

Each UTF-32 character occupies 32 bits

Each character of an ascii string occupies 1 byte

Font

Well, I think it's good to just remember that when you work with a language / platform it's free to decide how you're going to allocate space in memory for each of the sloppy types.

C

In the case of C it does the minimum work, it allocates enough space for the type and can add a few extra bytes for padding, to be more user-friendly in caching and reading / writing in memory.

See this question for more information on C

C #

In the case of C# for example all non-primitive objects have an 8 or 16-byte overhead, #

Python

Objects

Python also uses a similar technique as C# . The answer to this question in the SOEN indicates that all objects in Python occupy 16bytes extra (in 64bit). It appears that all objects store a reference count and a reference to the object type. There is official python documentation that explains how an object is structured .

I found a very detailed article on this subject

It seems that Python also does object padding up to 256bytes, if you allocate a 10bytes object it will actually occupy 16.

Strings

It also gives more details about the size of a string.

An empty string occupies 37 bytes, each additional character adds one byte to its size. Unicode strings are similar but they have a 50-byte overhead and each additional character occupies 4 bytes (I believe there was an error given by the author). In python 3 the overhead is 49 bytes.

The information seems to be a bit contradictory to what is given in a SOEN question. But this will depend on the version of python you are using, so it's here for reference.

This other question in SOEN also has a table that explains how much space each object occupies.

Bytes  type        empty + scaling notes
24     int         NA
28     long        NA
37     str         + 1 byte per additional character
52     unicode     + 4 bytes per additional character
56     tuple       + 8 bytes per additional item
72     list        + 32 for first, 8 for each additional
232    set         sixth item increases to 744; 22nd, 2280; 86th, 8424
280    dict        sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
64     class inst  has a __dict__ attr, same scaling as dict above
16     __slots__   class with slots has no dict, seems to store in 
                   mutable tuple-like structure.
120    func def    doesn't include default args and other attrs
904    class def   has a proxy __dict__ structure for class attrs
104    old class   makes sense, less stuff, has real dict though.
    
19.01.2017 / 13:32