Quote:
I would have to take the *real* unicode which is normally, "0x00A9", I.e., block 0, character A9, and instead encode it as, "0xE289A0", according to the page I read.
No, where did that idea come from? Click on Insert Unicode in the MUSHclient "debug simulated world input" box, and type in a9, and you get: \C2\A9
The whole point of UTF-8 is to be reasonably friendly to "legacy" applications that still use 8-bit character encoding. Bear in mind that for many C applications, hex 0x00 (ie. zero) is a string terminator. Thus, the string 0x00A9 will either be an empty string (string terminator, followed by A9, or A9 followed by the string terminator (depending on the endian-ness of the CPU). So either way you couldn't imbed 0x00A9 into the middle of a document. Imagine also the Unicode character 0x010A - the 2nd byte looks like a linefeed (0x0A).
UTF-8 is specifically designed so that you don't get 0x00 bytes or indeed anything that looks like a "control" code in the text stream.
To use "straight" Unicode for (say, downloading HTML documents, or talking to a MUD), both ends would have to agree that each character on the screen needed 2 bytes (or maybe 3 or 4), so you knew how many bytes represented a "character".
Quote:
Ok, I can see where that is useful in an OS, where specific bytes have special meanings, but for text transmission, it just bloats the data stream by anywhere from 1-4 extra characters for "everything" that isn't basic text. Hardly what I would call efficient.
Yes, up to a point. However if you decide on 2 bytes for everything, then you are already taking up one of those extra bytes, so you are hardly better off, and you are worse off if a lot of the text is normal English text (eg. program code).
Probably if all of your text was in Unicode characters that required 3 bytes in UTF-8, but only 2 in 16-bit encoding (for example, Japanese), then you are better of - at least as far as space goes - to use 16-bit encoding.
There is a lot of detail in Unicode, try looking at:
http://www.unicode.org/
They give heaps of explanation there.
|