Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are
spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the
password reset link.
Due to spam on this forum, all posts now need moderator approval.
Entire forum
➜ MUSHclient
➜ International
➜ Full Unicode support
It is now over 60 days since the last post. This thread is closed.
Refresh page
Pages: 1
2
3
4
5
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #60 on Mon 09 Jun 2008 12:20 AM (UTC) |
Message
| Yes I tried that, and the problem is not easily solved. For example, some things like the PCRE regexp-matcher don't use Unicode, they use 8-bit strings. It accepts UTF-8, but that means you need to convert back and forwards from wide strings to UTF-8. And then there are the MUDs, most of which send 8-bit text, not UTF-8 nor wide strings. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Atltais
(8 posts) Bio
|
Date
| Reply #61 on Mon 09 Jun 2008 12:47 AM (UTC) Amended on Mon 09 Jun 2008 12:56 AM (UTC) by Atltais
|
Message
| Isn't Unicode (e: Well, UTF8 that is, causing additional fun/grief because Windows uses UTF16, which is a bit different.) more or less backwards compliant with ASCII characters below 0x7F anyways?
Granted, you would have to go from UTF16 to UTF8 for regexp, true enough.
But, for most MUDs, it shouldn't be a problem if they don't use characters over 0x7F, but if they do (if it's a non-unicode, non-english MUD), you end up with a bit of a problem. Which, I suppose, is one reason why other clients aren't unicode. | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #62 on Mon 09 Jun 2008 06:33 AM (UTC) |
Message
| Unicode isn't one single thing for a start. Just check out www.unicode.org to see what I mean. Basically the idea is to represent various characters (glyphs) in a consistent way by assigning a different number to each one. But how that number is stored can vary somewhat. UTF-8 uses an encoding system that is indeed identical to non-Unicode for characters <= 0x7F, however once you move to higher values you have heaps of options. Do you want 16-bit characters? 32-bits? Which orders are the bytes? Big-endian or little-endian?
Under the Windows compiler, enabling the UNICODE define switches the representation of characters from char (8 bit) to long (16 bit). Straight away this won't work for Unicode characters > 0xFFFF. Also you can't just copy stuff from the MUD (8 bit characters) into the internal spaces (16-bit characters) without using a special call.
It's a can of worms, one I don't propose to open in the near future. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Atltais
(8 posts) Bio
|
Date
| Reply #63 on Mon 09 Jun 2008 07:10 AM (UTC) Amended on Mon 09 Jun 2008 10:09 PM (UTC) by Atltais
|
Message
| It's a whole range, yes, and UTF8 is a superset of ASCII, which, as I understand, is one of its biggest advantages. The larger problem, I suppose, is that (as perviously stated) Windows uses UTF16 internally (source: http://msdn.microsoft.com/en-us/library/ms776459(VS.85).aspx), which complicates matters somewhat. (additionally, you get into the endianness issue) With UTF8 you get 'pretty much' any character in regular use. (the entirety of the BMP, past this most fonts don't even have representations anyways, but that's getting wildly off topic. e:Plus, to my understanding, UTF8 supports up to U+10FFFF anyways.)
All in all, I suppose it's a relatively minor issue (since those honestly needing client-side UTF8 support can't be all that numerous) and development time may be better spent elsewhere.
edit: That is to say, endianness doesn't matter in UTF8 as it does in UTF16/32, since UTF8 is byte oriented. One thing to note though, in both UTF8 [i]and[/i] UTF16, is that characters aren't fixed width. (as in size) Therefore, UTF16 can handle codes above U+FFFF (and indeed, so can UTF8)
UTF8 is as widely supported as it is simply because it's (more or less) backwards compatible with ASCII right out of the box, so it can take a standard ASCII string (if the characters are all <=0x7F, that is) and be happy with it. In any case, it's quite an undertaking to convert a program as big as MUSHclient into a 100% UTF8 program. | Top |
|
Posted by
| Fiendish
USA (2,534 posts) Bio
Global Moderator |
Date
| Reply #64 on Sun 05 Jun 2011 06:10 PM (UTC) |
Message
| |
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #65 on Sun 05 Jun 2011 09:40 PM (UTC) |
Message
| Changed to date_modified, thanks. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).
To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.
230,906 views.
This is page 5, subject is 5 pages long:
1
2
3
4
5
It is now over 60 days since the last post. This thread is closed.
Refresh page
top