I got confuse about Unicode stuff |
Let's start with what Unicode actually is.
Unicode is just a character set. All it does it assign unique numbers to glyphs. Examples:
A=0x41 (aka U+0041)
û=0xFB (aka U+00FB)
π=0x3C0
拢=0x62E2
etc
The numbers being assigned are referred to as "codepoints".
These codepoints conceptually range between 0-Infinity, though realistically, the range is smaller than a 32-bit variable (ie: a single 32-bit unsigned integer can represent any Unicode character).
Now there are multiple ways that these numbers can be represented. The most straightforward way would be to have a 32-bit value for each character in the string. That way 1 value = 1 character. Simple, but horrendously inefficient with space, since most characters used need only 1 or maybe occasionally 2 bytes. Because of this, there are multiple
encodings. The most common are:
- UTF-8
- UTF-16
- UTF-32
UTF-32 is the simplest... and is what I said above: One 4-byte value = one character.
UTF-16 uses 2-byte values... and
99.99% of the time one 2-byte value will be one character... but for the codepoints above U+FFFF, it has to be spread across two 2-byte values.
UTF-8 is probably used most often. It uses 1-byte values... and for normal English text, that will mean that one byte = one character. However for codepoints above U+007F it has to use more bytes. Ultimately UTF-8 can use up to 4 bytes per character.
So UTF-8 is the most space efficient... but is also the most "variable" since multiple-byte characters are extremely common for everything except basic English.
Got it? Good. ;)
Now let's move onto the Windows mess.
(note the info I give below is specific to Windows, and the size of these data types and what they mean may differ on other platforms)
WinAPI is cursed with having to support a legacy program base, so all this stuff is especially confusing on Windows.
Nearly all WinAPI functions and structs (those which take strings) come in 3 forms:
TCHAR - the "normal" function. IE:
MessageBox
char - the 'A' function. IE:
MessageBoxA
wchar_t - the 'W' function. IE:
MessageBoxW
'char's are just a simple 8-bit character you're probably used to dealing with.
'wchar_t's are the same thing, but are 16-bits which allow them to hold a wider range of characters.
'TCHAR' is a retarded typedef that could be either char or wchar_t. Its inconsistency makes it very difficult to use properly... and the weak typing of C/C++ makes it very easy to misuse on accident. I recommend avoiding it.
Windows treats wchar_t's as a
UTF-16 encoded string. And Windows does everything 'under the hood' using UTF-16. Therefore the Wide version of these functions/structs is the most straight-forward way to deal with Windows and have support for the Unicode character set.
Unfortunately... Windows
does not treat chars as UTF-8. Again, this is due to legacy reasons. Instead, it has to use the user's locale setting to expand non-ascii characters. So it uses a different character set.
The TCHAR thing... don't get me started. It's a disaster.
I want to do this entirely with Unicode, if possible. I also use the TxtMessage() (the second function) to grab the first line from a "hi.txt" file and paste it in a MessageBox(), but it doesn't work with all texts, like this lenny face ( ͡° ͜ʖ ͡°). |
You certainly
can print that lenny face with MessageBoxW(). You just have to make sure your text is UTF-16 encoded.
Which begs the question...
how is your text file encoded?. You will need to know this before you can convert that encoding to UTF-16. It's
probably UTF-8.... but double check. What editor are you using to make the file, and how are you saving it?
Windows supplies a function to convert char strings to UTF-16 strings called MultiByteToWideChar:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx
Usage example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
const char* narrow = "A string you want to convert to UTF-16";
wchar_t wide[100]; // a buffer large enough to hold the wide version of that string
MultiByteToWideChar(
CP_UTF8, // tell it we want to convert from UTF-8 to UTF-16
0, // some extra flags (we don't really care about any of these)
narrow, // the string we want to convert
-1, // the length of that string (we can use -1 to autodetect, since the string is null terminated)
wide, // the buffer to put our generated UTF-16 string
100 // the size of our buffer
);
// at this point.. 'wide' contains the UTF-16 encoded string. So we can send it to
// MessageBoxW:
MessageBoxW( mywnd, wide, L"Alert!", MB_OK );
| |