What is the difference between files encoded with BOM and without BOM? [duplicate]

5

In a long time I faced problems with formatting files between ISO , UTF-8 , ANSII and others, search for ways to solve these problems, I found several different ways, both using tools, and using programming languages, but one thing I never deepened to know is:
What is the difference between a BOM and BOM difference?     

asked by anonymous 21.05.2014 / 16:06

3 answers

7

UTF-8 in conjunction with BOM ( Byte order mark ) is encoded with the bytes EF BB BF at the beginning of the file. There is no difference, at least unofficial between UTF-8 and UTF-8 with BOM . While there is use, according to Padrão Unicode , the Byte order mark for UTF-8 files is not recommended .

In the 3.10 Unicode Encoding Schemes section, item D95 says, in free translation:

  

Its use at the start of a UTF-8 data stream is not necessary   recommended by Unicode Standard, but its presence does not affect the   compliance with the UTF-8 encoding scheme.

    
21.05.2014 / 16:32
9

BOM (byte order mark) was created to solve a UTF-16 problem (and also UTF-32, although this format is little used for saving files).

Since each character in UTF-16 is composed of 2 bytes (or in rarer cases for a pair of units of 2 bytes each), it is possible to sort them in different ways: byte 1, byte 2; or byte 2, byte 1 (on the order of bits, no one argues, at least ...). So little-endian architectures will prefer to use UTF-16LE (LE = little endian), which has the order "byte 2, byte 1" which is the most natural for the processor. And big-endian architectures will prefer to use UTF-16BE.

In order to differentiate the two types of UTF-16, the BOM is used at the beginning of the file, which is a character that can not be confused with its "inverse", so when you read it, you will be able to find out the order of the rest of the file.

UTF-8 has been developed differently, where byte order does not depend on the computer's architecture. This is why many consider it unnecessary to use BOM in UTF-8 files.

The BOM, which in UTF-16 occupies 2 bytes, when encoded in UTF-8 takes the form of 3 bytes. So some programs, despite the non-recommendation to use BOM in UTF-8, have adopted it anyway, because when they open a file and find those 3 special bytes, they will know that it is probably a UTF-8 file it is very rare for a text to begin with  , which is how the BOM appears if it is read as the cp1252 encoding).

Now, whether or not you should use BOM in your files, the debate gets a bit philosophical because there are pros and cons ...

    
21.05.2014 / 17:03
9

BOM means Byte Order Mark .

In our world people can not understand each other, even if the lower-value bits of a byte should be left or right aligned. Believe me, there are heated discussions and full of personal aggression about which way is best.

With certain encodings something similar happens. Some characters are represented by more than one byte . In UTF-32, for example, four bytes are used per character. There are people who prefer bytes with smaller values to be left or right aligned on each character.

As it is not possible to adopt either way as the universal one, sometimes we need to inform a parser the order in which the bytes should be read. We do this using BOM . If you do not enter BOM , the parser has to literally guess the reading form. That is why, without it, sometimes the texts get "broken".

It is common for BOM of a text to be indicated by the preamble, which are the first three bytes of a text. The parser uses them to determine what the encoding is used for.

As noted by DBX8 in its response, this should be irrelevant to UTF-8, which uses only one byte per character. The only advantage of informing BOM of UTF-8 is that it helps the parser to recognize the encoding used.

    
21.05.2014 / 16:20