• Forum
  • Lounge
  • vscode adds UTF-16 BOM to stdout [Window

 
vscode adds UTF-16 BOM to stdout [Windows]

Posting in lounge since this isn't directly a C++ issue. Google and GPT-whatever have failed me.

In Visual Studio Code, I am attempting to run the .exe I just built (with Ctrl + Shift + B), and pipe its output to another program.

The issue is, when I run my program in the vscode terminal, the terminal adds the '0' UTF-16 BOM characters to the beginning of the stdout stream, which is then being incorrectly interpreted by my other program, which expects simple, BOM-less input.

Furthermore, this is an issue even if I do: my_program.exe > temp.txt
PS C:\moo\code\test_kissfft> .\test_kissfft.exe  > temp.txt


If I redirect to file from vscode's terminal, the file encoding of temp.txt is UTF-16 BOM (according to Notepad++), but if I run the same program through cmd, the file encoding of temp.txt is UTF-8 (no BOM, half the file size), which is what I want.

From VS Code, I have tried going to File > Preferences > Settings, but the "Files: Encoding" option is already UTF-8, not UTF-16 BOM.

So why is it forcing UTF-16 BOM on my stdout? I appreciate any help.
I'm trying to use new technology and it just frustrates me...
Last edited on
Of course, as soon as I post this, I have an epiphany: It isn't vscode messing up my standard output stream, it's the Powershell terminal that vscode terminal is running on top of. I think Microsoft just hates its 'power' users.

https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom

And so now, my plea for help is for someone to tell me how to get Powershell to, by default, output UTF-8 and not UTF-16 BOM. Either that, or I'll just try to get vscode to not use powershell.

Update:
I briefly went down the powershell rabbit hole, but decided it wasn't worth descending into that circle of hell.
https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8

I updated vscode to use cmd instead of powershell:
https://stackoverflow.com/questions/42729130/visual-studio-code-how-to-switch-from-powershell-exe-to-cmd-exe
modified from Hariprasath Yadav wrote:
1. Press Ctrl + Shift + P to show all commands.
2. Type profile in the displayed text box to filter the list.
3. Select Terminal: Select Default Profile.
4. Select Command Prompt (cmd.exe)
5. Click the Delete Icon (garbage can) in the shell pane to remove the existing terminal.
6. Press Ctrl + ` (or menu View → Terminal in menu) to open a new terminal pane.


and my problem is solved. Maybe this will help someone else in the future.
Last edited on
First of all, if you want UTF-8 output from your console application, I think you should explicitly set your stdout to UTF-8 mode. If you don't do this, then the C-Runtime will just use whatever default "ANSI" Codepage happens to be configured on the system! This is probably something like Windows-1252. The reason Notepad++ detects this output as “UTF-8” is because it makes the best guess, and because pure US-ASCII is valid UTF-8 too, and because UTF-8 is preferred these days. So it just shows “UTF-8” as long as it might be valid UTF-8.

But you will see the difference, if you try to print some characters that don't fit into Windows-1252:

1
2
3
4
5
6
7
int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);
    fputws(L"Line #1\n", stdout);
    fputws(L"\u2668\ufe0f\n", stdout);
    fputws(L"Line #2\n", stdout);
}


Try the above with and without the _setmode() command, and you'll know what I mean 😏

And please try this with a cmd.exe shell, to avoid confusion.

_____

Secondly, I think UTF-16 is always supposed to have a BOM, because UTF-16 can be LE (little endian) or BE (big endian) and the BOM at the start is used to signal the actual endianess. Meanwhile, UTF-8 does not need a BOM because it is just individual bytes (maybe several bytes in a row that make up a single character, but still individual bytes), so it does not have an "endianess". It is still possible to have a sort of “fake” BOM in UTF-8, just to signal that it is UTF-8, but that is not recommended. The encoding should be signalled “out of band”.

_____

Finally, PowerShell. You have to note that there is “Windows PowerShell” (Version 5.x) and there is “PowerShell” (Version 7.x). The “Windows PowerShell” (Version 5.x) is included with Windows for legacy reasons, but is not developed anymore and very outdated. Meanwhile, “PowerShell” (Version 7.x) is the modern variant and the one that you need to install separately.

“Windows PowerShell” (Version 5.x) has a lot of problems with how it is interpreting the output (stdout) of a command and how it converts that output into Unicode strings for internal processing. Also, this conversion always happens 😖

So, most people recommend you should just do this in Windows PowerShell to get around undesired conversions:
PS> cmd.exe /c "program.exe > out.txt"

At the same time, with “PowerShell” (Version 7.x) it just works as it is supposed to work, as you can see here:
https://i.imgur.com/rEZXDt0.png

_____

Another method that seems to work with the classic “Windows PowerShell” (Version 5.x):
1
2
3
4
5
6
7
8
9
10
int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    if (!_isatty(_fileno(stdout))) {
        fputws(L"\uFEFF", stdout); // <-- important, because otherwise PowerShell gets the endianess wrong!
    }
    fputws(L"Line #1\n", stdout);
    fputws(L"\u2668\ufe0f\n", stdout);
    fputws(L"Line #3\n", stdout);
}

...and then:
PS> .\Application.exe | Set-Content out.txt -Encoding utf8

So, we have to make our application output UTF-16 (with a BOM!) for Windows PowerShell to properly parse it and not mangle our output. In the second step, we then use Set-Content with the -Encoding switch to explicitly convert it to UTF-8 for storage.
Last edited on
Try the above with and without the _setmode() command, and you'll know what I mean 😏

without _setmode, it shows:
Line #1
Line #2


with _setmode, it shows:
Line #1
⍰⍰
Line #2
(yes, I replaced the actual unrenderable data with literal question mark boxes)

But I see that if I paste it into a browser, the part that doesn't fit into Windows-1252 is actually a "♨️" character. Interesting, good to know. Thank you.
Last edited on
Registered users can post here. Sign in or register to post.