First of all, if you want UTF-8 output from your console application, I think you should
explicitly set your
stdout
to UTF-8 mode. If you don't do this, then the C-Runtime will just use whatever default "ANSI" Codepage happens to be configured on the system! This is probably something like Windows-1252. The reason Notepad++ detects this output as “UTF-8” is because it makes the best guess, and because pure US-ASCII is valid UTF-8 too, and because UTF-8 is preferred these days. So it just shows “UTF-8” as long as it
might be valid UTF-8.
But you will see the difference, if you try to print some characters that don't fit into Windows-1252:
1 2 3 4 5 6 7
|
int main()
{
_setmode(_fileno(stdout), _O_U8TEXT);
fputws(L"Line #1\n", stdout);
fputws(L"\u2668\ufe0f\n", stdout);
fputws(L"Line #2\n", stdout);
}
| |
Try the above with and without the
_setmode()
command, and you'll know what I mean 😏
And please try this with a
cmd.exe
shell, to avoid confusion.
_____
Secondly, I think UTF-16 is always supposed to have a BOM, because UTF-16 can be LE (little endian) or BE (big endian) and the BOM at the start is used to signal the actual endianess. Meanwhile, UTF-8 does
not need a BOM because it is just individual bytes (maybe several bytes in a row that make up a single character, but still individual bytes), so it does
not have an "endianess". It is still possible to have a sort of “fake” BOM in UTF-8, just to signal that it is UTF-8, but that is
not recommended. The encoding should be signalled “out of band”.
_____
Finally, PowerShell. You have to note that there is “Windows PowerShell” (Version 5.x) and there is “PowerShell” (Version 7.x). The “Windows PowerShell” (Version 5.x) is included with Windows for
legacy reasons, but is
not developed anymore and very outdated. Meanwhile, “PowerShell” (Version 7.x) is the
modern variant and the one that you need to install separately.
“Windows PowerShell” (Version 5.x) has a lot of problems with how it is interpreting the output (stdout) of a command and how it converts that output into Unicode strings for internal processing. Also, this conversion always happens 😖
So, most people recommend you should just do
this in Windows PowerShell to get around undesired conversions:
PS> cmd.exe /c "program.exe > out.txt"
At the same time, with “PowerShell” (Version 7.x) it just works as it is supposed to work, as you can see here:
https://i.imgur.com/rEZXDt0.png
_____
Another method that seems to work with the classic “Windows PowerShell” (Version 5.x):
1 2 3 4 5 6 7 8 9 10
|
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
if (!_isatty(_fileno(stdout))) {
fputws(L"\uFEFF", stdout); // <-- important, because otherwise PowerShell gets the endianess wrong!
}
fputws(L"Line #1\n", stdout);
fputws(L"\u2668\ufe0f\n", stdout);
fputws(L"Line #3\n", stdout);
}
| |
...and then:
PS> .\Application.exe | Set-Content out.txt -Encoding utf8
So, we have to make our application output UTF-16 (with a BOM!) for Windows PowerShell to properly parse it and
not mangle our output. In the second step, we then use
Set-Content
with the
-Encoding
switch to explicitly convert it to UTF-8 for storage.