What calculation to convert a 'CodePoint' to UTF-16?

1

I have a 32-bit integer representing a unicode character and would like to convert this single character to its utf-16 representation, that is, one or more 16-bit integers.

    
asked by anonymous 27.06.2017 / 14:52

1 answer

3

The unicode transformation format, 16 bits (UTF-16) is defined in the section 2.5 of the Unicode standard , as well as the RFC 2781 . It works as follows:

  • Be the codepoint U of the value you want to encode. If U is less than 65,536, issue it normally.
  • If U is greater than or equal to 65,536, get U' = U - 65536 . This U' , by the Unicode rules, will have the 12 most significant bits equal to zero (since the last valid
  • Issue two bytes, in order:
  • The first has the six most significant bits 0x10FFFF and the least significant ten equal to the ten most significant bits of 1101 10 .
  • The second has the six most significant bits U' and the least significant ten equal to the ten least significant bits of 1101 11 .
  • In C:

    void
    utf_16(uint32_t codepoint, FILE * out) {
        uint32_t U;
        uint16_t W;
    
        assert(codepoint <= 0x10FFFF);
        if (codepoint < 0x10000) {
            W = (uint16_t) codepoint;
            fwrite(W, sizeof(W), 1, out);
        } else {
            U = codepoint - 0x10000;
            W = 0xD800 | (U >> 10);
            fwrite(W, sizeof(W), 1, out);
            W = 0xDC00 | (U & 0x3FF);
            fwrite(W, sizeof(W), 1, out);
        }
    }
    
        
    27.06.2017 / 15:38