I have a 32-bit integer representing a unicode character and would like to convert this single character to its utf-16 representation, that is, one or more 16-bit integers.
I have a 32-bit integer representing a unicode character and would like to convert this single character to its utf-16 representation, that is, one or more 16-bit integers.
The unicode transformation format, 16 bits (UTF-16) is defined in the section 2.5 of the Unicode standard , as well as the RFC 2781 . It works as follows:
U
of the value you want to encode. If U
is less than 65,536, issue it normally. U
is greater than or equal to 65,536, get U' = U - 65536
. This U'
, by the Unicode rules, will have the 12 most significant bits equal to zero (since the last valid
Issue two bytes, in order:
The first has the six most significant bits 0x10FFFF
and the least significant ten equal to the ten most significant bits of 1101 10
.
The second has the six most significant bits U'
and the least significant ten equal to the ten least significant bits of 1101 11
.
In C:
void
utf_16(uint32_t codepoint, FILE * out) {
uint32_t U;
uint16_t W;
assert(codepoint <= 0x10FFFF);
if (codepoint < 0x10000) {
W = (uint16_t) codepoint;
fwrite(W, sizeof(W), 1, out);
} else {
U = codepoint - 0x10000;
W = 0xD800 | (U >> 10);
fwrite(W, sizeof(W), 1, out);
W = 0xDC00 | (U & 0x3FF);
fwrite(W, sizeof(W), 1, out);
}
}