(By Disch) Don't write any variable larger than 1 byte to binary files

Score: 3.5/5 (43 votes)

HI everyone !
I had some problems with binary files and I created a topic and Disch did a great help and I thought it's better for that post not to be just in that topic . (link to topic : bottom of the article)
This article is a background to this one :
Disch's tutorial to good binary files
In this article instead of "HOW TO write data to binary files" you will see "WHY shouldn't we write variables and data larger than 1 byte in binary files .
Here we go :

When you do a raw write of a block of memory, write() will look at the pointer you give it and blindly start copying X bytes to the file. This sort of works for POD (plain old data) types... but it utterly fails for complex types (like strings).

Let's take a look at why.

****Why you should not read/write complex non-POD structs/classes****

Reason #1: Complex types may contain dynamically allocated memory or other pointers

here's a simplistic example:

class Foo
{
private:
    int* data;

public:
    Foo() { data = new int[10]; }
    ~Foo() { delete[] data; }
};

Here... our Foo class conceptually contains information for 10 ints (~40 bytes). Yet if you do sizeof(Foo)... it'll probably give you the size of one pointer (~4 bytes).

This is because the Foo class does not contain the data it's referring to... it merely contains a pointer to it. Therefore... a naive write to a file would simply write the pointer and not the actual data.

Attempting to read that data later would just result in having a pointer that points to random memory.

This is similar to what is happening with strings. The string data is actually not in the string class... but rather it is allocated dynamically.

#2: Non POD types may contain VTables and other "hidden" data that you absolutely must not touch

Trivial example:

class Foo
{
public:
    virtual ~Foo() { }
    int x;
};

sizeof(Foo) is likely going to be larger than sizeof(int) because Foo is now polymorphic... meaning it has a VTable. VTables are black magic and you absolutely must not tinker with them or you risk destroying your program.

But again... a naive read/write doesn't acknowledge that... and will simply try to read/write the full object... vtable and all. Resulting in massive screw ups.

So yeah. Naive reads/writes do not work with complex types unless they are POD.

But if you notice before I said POD types only "sort of" work. What do I mean by that?

****Why you should not read/write POD structs/classes****

Well let's take a look at another trivial example:

struct Foo
{
    char a;  // 1 byte
    int b;   // 4 bytes
    char c;  // 1 byte
};

Here we have a POD struct. It would not suffer from any of the problems previously mentioned. I added comments to show how many bytes each individual var might take (technically this may vary, but it's typical).

So if a struct is just a collection of all these vars... you would expect the size of the struct to be equal to the sum of all of them... right? so sizeof(Foo) would be 6?

Well... on my machine sizeof(Foo) is 12. SURPRISE!

What's happening is that the compiler is adding padding to the struct so that variables are aligned on certain memory boundaries. This makes accessing them faster.

So when you do a naive, raw write to a file, it will also write the padding bytes. Of course when you read it... you'll read the padding bytes and it'll work as you'd expect.

So why did I say it only sorta works?

Well consider the following situation.

- You run your program and save a bunch of files.
- You port your program to another platform and/or change or update your compiler
- This new compiler happens to assign different padding to the struct
- You run the newly compiled program and try to load the files you saved in the old version of your program

Since the padding changed, the data is read differently (more or less data is read, or the padding is in different spots) - so the read fails and you get garbage.

There are ways you can tell the compiler to leave off the padding. But that raises other problems I won't get into now. Let's just say that memory alignment is important.

So okay... simply put... it's not a great idea to read/write structs in full. So just reading/writing individual vars works... right?

Well.....

****Why you should not read/write any variable larger than 1 byte****
There are 2 things you have to watch out for.

#1: ill-defined size of variables. int might be 4 bytes depending on your platform/compiler... or it might be 2 bytes or it might be 8 bytes.

So reading/writing a full int suffers the same problems as the 'padding' scenario above. If you have a file saved with version X of your program, then rebuild in version Y where int suddenly changed size on you.... your file will not load any more.

This can be solved by using the <cstdint> types like uint8_t, uint16_t, etc which all are guaranteed to have a certain byte size.

#2: endianness. Memory consists of a series of bytes. How an int is stored in memory is how it is stored in the file when you do a raw write. But how the int is stored in memory varies depending on the machine you're running on.

x86/x64 machines are little endian. So if you have int foo = 1;, foo will look like this in memory:
01 00 00 00
So if you save 'foo' to a file on your x86 machine.. then hand that file off to your buddy who is running a big endian machine... he'll read it back the same way.

However.. on a big endian machine.. 01 00 00 00 is not 1.... it's 0x1000000.. or 16777216
So yeah... your load fails and your program explodes.

This is why I make it a point to never read/write anything larger than a single byte to a binary file. Do so ensures that your file will always work.

With that in mind.... I wrote an article that explains how to do all your binary file IO with just reading/writing individual bytes. This includes how to read/write strings.

The article is here:

http://www.cplusplus.com/articles/DzywvCM9/

And this is the original forum post made by Disch :
http://www.cplusplus.com/forum/beginner/108114/#msg587223

(By Disch) Don't write any variable larger than 1 byte to binary files

C++

Articles