Developers’ Weblog

Sponsored by
HostEurope Logo

Developers’ Weblog

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

PSA: Referring to Unicode codepoints.

If your Unicode codepoint is, numerically, between 0 and 65533, inclusive, convert it to hexadecimal and zero-pad it to four nibbles. For example, the Euro sign € is Unicode codepoint #8364 which is 20AC hex; the Eszett ß is 223 which is DF hex, padded 00DF.
Then write an uppercase ‘U’, a plus sign ‘+’, and the four nibbles: U+20AC U+00DF
In mksh, JSON, etc. it’s a backslash ‘\’, a lower-case ‘u’ and four nibbles.

Otherwise, your Unicode codepoint will be, numerically, between 65536 and 1114111, inclusive, that is hex 10000 to 10FFFF. (There’s nothing on 65534 and 65535, nor above these figures.) In this case, convert it to hex, zero-pad it to eight nibbles and write it as an uppercase ‘U’, a hyphen-minus ‘-’ and the eight nibbles. In C-like escapes for environments supporting the Unicode SMP, that’s a backslash ‘\’, an upper-case ‘U’ and eight nibbles. Do not, in either case, use less (or more) hex digits than specified here. For example, there’s a famous Unicode codepoint U-0001F4A9 “PILE OF POO”. That’s not the same as U+1F4A9. The latter reads as U+1F4A “GREEK CAPITAL LETTER OMICRON WITH PSILI AND VARIA” and a digit 9 (Ὂ9). Be educated.

Since this wlog runs on MirBSD, which limits itself to the Unicode BMP voluntarily, and as nōn-BMP is not widespread anyway, I cannot reproduce the “PILE OF POO” here, but you can just duckduckgo it.

MirOS Logo