Monday, November 02, 2015

Unicode - UTF 8, 16 and 32 Details

  • UTF8 is variable 1 to 4 bytes.
  • UTF16 is variable 2 or 4 bytes.
  • UTF32 is fixed 4 bytes.
In brief, UTF32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.
UTF16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.
And UTF8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters. (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value) All other characters are encoded as sequences of up to 4 bytes (if memory serves).
Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:
  • UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
  • UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
  • UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage

No comments: