File formatting: chars as numbers

I'm serializing a suffix tree. Some of the data is stored as chars to reduce memory requirements. As a result, when I go to write chars to files and read numbers from files into chars, the insertion and extraction operators like to interpret them as characters, not numbers (there are reasons I'm not doing file io in binary mode and using .write/.read).

The obvious fix for writing is to cast the chars as shorts. And for reading, first read into a short, and then cast it as a char and save it. This feels a bit hackish, though, especially since reading goes from a single statement to two statements and requires an intermediate variable.

If this was C, I'd just use fprintf and fscanf with "%d" so that the data is always interpreted the way I want. Example:
fscanf (filePtr, "%d", &num);

It seems like there should be a similar method for C++, where some flag or something indicates my wish to interpret data as an integer, not a character.

Any ideas?

kbw (9488)

1
2

char c = 7;
std::cout << int(c);

zytrex (9)

I appreciate your reply, but I don't think you noticed the details of my question.

Last edited on

helios (17607)

This feels a bit hackish

"Hackish"? Not at all.
AFAIK, there's no way to read formatted input the way you want without first passing through a proper integer type. The operator>>() overload will assume that, since you passed a char, you want to interpret the input as a character, which is what makes the most sense in most situations.

By the way, I believe that lying in your format string (such as passing "%d" and then a floating point, or "%d" and then something of a difference size) gets you undefined behavior.

zytrex (9)

That's unfortunate if it's true. It seems like this ability should be among the other formatting features of the insertion and extraction operators. Along with "left", "setw" and "hex" should be, perhaps, "integer" and "character".

As for lying in my format string, I think you misunderstood. I was not suggesting reading data of different sizes into different types. I was saying that if I'm reading from a text file into a char variable, using fscanf I could use "%d" so that the byte would be interpreted as an integral value or "%c" so the byte would be interpreted as an ascii character.
http://www.cplusplus.com/reference/clibrary/cstdio/fscanf/

Last edited on

kbw (9488)

You'r asking the fscanf to treat your char as an int. Are you sure it's writing to an int?

helios (17607)

As for lying in my format string, I think you misunderstood. I was not suggesting reading data of different sizes into different types. I was saying that if I'm reading from a text file into a char variable, using fscanf I could use "%d" so that the byte would be interpreted as an integral value or "%c" so the byte would be interpreted as an ascii character.

But this is what you want to do, right?

1
2

char c;
fscanf(file,"%d",&c);

That's what I meant by lying. You didn't pass an int *, you passed a char *. fscanf() would then proceed to

1
2
3

/* parse into (int)input */
int *p=(int *)variadic_parameter;
*p=input; /* Whoops! Buffer overflow! */

zytrex (9)

I guess I didn't properly explain. Char, short, int, long long, etc. are all used to store integral values, that is 0, 1, 2, etc. The difference is the range of integral values they can store. Char is 1 byte (2^8 values), short is 2 bytes (2^16 values), int is usually 4 bytes (2^32), long long is 8 bytes (2^64).

Char is special in that when the insertion and extraction operators sees a char being written to ostream or read from istream, they want to interpret the number 65, for example, as the ascii character 'A'.

1
2

char c = 65;
cout << c;

1
2
3

char c;
cin >> c;
// If user types A, c will store the value 65

For my program, I'm using the type "unsigned char" to store several integral values (0-255) in nodes in a suffix tree. I'm doing this to save memory since there are hundreds of millions of nodes. The problem arises when I want to save and load this data to and from disk.

(As I said before, there are certain reasons I'm not doing this in binary mode which have no impact on my actual question.)

So, if I have some value in a node that's being written to disk, the following happens.

/* ... program ... */

Node nd = Node();
nd.begin = 5; // nd.begin is an unsigned char type (0-255) used to index into a string.

/* ... program ... */

//// function "serialize" ////

fout << nd.begin;
// Because nd.begin is a char type, a character will be written to disk,
// not the number 5. See www.asciitable.com

// Simple fix is the following:

fout << int(nd.begin);
// Casts the char type as an int, that way the number 5 will be written to disk

/* ... program ... */

//// function "unserialize" ////

// Say, for example, the next thing that will be read from disk is the number 8.
// That is, if you open the file in a text editor, you will see an 8 there

fin >> nd.begin;
// Because nd.begin is a char, instead of storing the integral value 8 in nd.begin,
// it will store the value 56, which corresponds to the ascii character 8. Again,
// see www.asciitable.com

// Naive solution

int temp;
fin >> temp;
nd.begin = temp;
// By first reading into the int, the text '8' in the file will be stored in "temp" as
// the integral value 8. Then nd.begin is assigned that value from "temp". The
// problem is this requires the use of an extra intermediate variable, "temp".

Yes, I realize it is only one extra assignment, but the nodes have several of these char variables and there are hundreds of millions of nodes. And even if you think that doesn't matter, I would still like to know of a better solution at least for academic reasons.

If this were simply C, I could to the following:

/* ... program ... */

Node nd = Node();
nd.begin = 5; // nd.begin is an unsigned char type (0-255) used to index into a string.

/* ... program ... */

fprintf( file, "%d", nd.begin );
// Even though nd.begin is a char, the "%d" will make the data be written as,
// an integral value. So in the output file you will see a 5.

/* ... program ... */

// Again, for example, say the next thing to be read from the file is an 8.

fscanf( file, "%d", &(nd.begin) );
// Again, even though nd.begin is a char, the "%d" will make the value
// 8 be stored in nd.begin, instead of 56.

Since this is not C, I won't be using fprintf and fscanf, as they use a different type of file handle and I need to use fstream for other things as well.

My point is, in C is a very simple mechanism for making a value be interpreted as an ascii character or a 1 byte integral value. It seems like there should be a similar method in C++, one that doesn't force you to cast a char as something else when writing nor use an intermediate variable when reading.

Kyon (912)

I'm not seeing the problem in casting to an int (or similar). There's no way you can say that it's "hacky" or "unprofessional", if you made the choice to use a char (for the obvious "benefit" of less memory space), you had your reason. If anyone would tell you to scrap parts that are easy to understand, to the point and more efficient because it doesn't look professional, eat them.

helios (17607)

Yes, I realize it is only one extra assignment, but the nodes have several of these char variables and there are hundreds of millions of nodes.

A MOV operation takes 1 cycle (possibly less). For a 1 GHz CPU, 10^8 MOV operations is .1 seconds.

fscanf( file, "%d", &(nd.begin) );
// Again, even though nd.begin is a char, the "%d" will make the value
// 8 be stored in nd.begin, instead of 56.

Do you take me for an idiot? I know what the printf()/scanf() family of functions do. That line you have there is unsafe.
Consider this function:

void f(size_t s,void *p){
    switch (s){
        case 1:
            *(uint8_t *)p=0x12;
            break;
        case 2:
            *(uint16_t *)p=0x3456;
            break;
        case 4:
            *(uint32_t *)p=0x789ABCDE;
            break;
        default:
    }
}

If you don't pass the right value through s, the function produces undefined behavior. This is almost exactly what printf() does.

int a;
short b;
char c;
f(sizeof(int),&a);   //ok
f(sizeof(short),&b); //ok
f(sizeof(int),&c);   //wrong!

zytrex (9)

Helios, please don't get upset. I do appreciate your efforts in helping me resolve this. If I thought you guys were idiots, I wouldn't bother asking you questions. About the actual time involved in the extra statement, I realize it's nearly immeasurable. As I said before, my interest in this is also academic.

But I'm confused because in two of your responses you explained to me that it's unsafe to read an int (4 bytes) into a char (1 byte). I'm not an idiot either. I never once suggested doing that. What I have been saying all along is I want to read 1 byte of data into a char, and have that data be seen as an integral value, not an ascii character.

1
2

char c;
f( sizeof(int), &c ); //wrong!

Yes, that's wrong. But I'm not doing that. Using your example, the corresponding function call would be:

1
2

char c;
f( sizeof(char), &c );

The problem has never been about matching types with the correct number of bytes. It's simply about my desire to be able to use a 1 byte type (char) without automatic translation to ascii characters.

You're not suggesting that the "%d" in fscanf implies that the destination type is larger than 1 byte, are you? You can use fscanf to read integral values into shorts, ints, and long longs just fine. Clearly it must take the destination type into account. So why would it not be able to write a 1 byte integral value into a char variable?

helios (17607)

Clearly it must take the destination type into account.

If it did, you wouldn't need the format string to being with. This is fscanf()'s prototype: int fscanf(FILE *stream,const char *format,...);
The function uses the format string, and only the format string to figure out what you're passing. If you lie in the format string, there's no telling what can happen.
For example, run this. I assure you you will get different output depending on whether the CPU is, to name two architectures, an x86 or a PowerPC:

#include <cstdio>

int main(){
	char a[4]={0,1,2,3};
	int b[4];
	for (int i=0;i<4;i++)
		b[i]=(unsigned char)a[i];
	printf("%02X%02X %02X%02X\n",b[0],b[1],b[2],b[3]);
	sscanf("4022250974","%d",&(a[0]));
	for (int i=0;i<4;i++)
		b[i]=(unsigned char)a[i];
	printf("%02X%02X %02X%02X\n",b[0],b[1],b[2],b[3]);
	return 0;
}

zytrex (9)

DEAD BEEF. Very nicely done.

Anyway, it seems you are demonstrating what happens when you write a number into a variable too small to hold it. 4022250974 obviously cannot be stored in a char. But as I've said several times, the integral values I'm storing in these unsigned chars are in the range of 0 to 255.

1
2

char a;
sscanf("123", "%d", &a);

What's wrong with that?

Disch (13742)

when you do "%d", scanf writes an int (4 bytes), even if you give it a smaller type. scanf has no way to know the actual size of the type you passed it, so it assumes it's an int. If you give it something smaller, it will overflow. That's why this is unsafe.

helios example again hopefully simplified:

char a[4] = {0,1,2,3};
sscanf("0","%d",&a[0]);  // you might think this changes only a[0]

printf("%02X %02X %02X %02X",a[0],a[1],a[2],a[3]); // surprise!

Last edited on

helios (17607)

Anyway, it seems you are demonstrating what happens when you write a number into a variable too small to hold it. 4022250974 obviously cannot be stored in a char. But as I've said several times, the integral values I'm storing in these unsigned chars are in the range of 0 to 255.

It seems you haven't the faintest idea about what the compiler does or how it does it, and yet still seem determined to teach me about it. I suggest you stop and experiment before you make any more mistakes, such as smashing your stack, or saying stupid things to me.

zytrex (9)

"when you do "%d", scanf writes an int (4 bytes)"

Thank you Disch, I understand now. I know there are modifiers (h, l) for %d for integral sizes other than 4 bytes, but your statement prompted me to double check. Now I see that, while there are modifiers for short and long, there is not one for 1 byte.

Helios, thanks again for the help. Calling me stupid sure has helped me learn. Though, if I might make a suggestion, it would have simplified things a great deal if you had simply said that scanf with %d will always write more than 1 byte, making writing to a char variable that way unsafe. You know, like what Disch said.

helios (17607)

Calling me stupid sure has helped me learn. Though, if I might make a suggestion, it would have simplified things a great deal if you had simply said that scanf with %d will always write more than 1 byte, making writing to a char variable that way unsafe.

I said the same thing three times in three different ways. The second time was exactly that, by the way:

1
2
3

/* parse into (int)input */
int *p=(int *)variadic_parameter;
*p=input; /* Whoops! Buffer overflow! */

With all that, you kept insisting you were right, in spite of the fact that you obviously have no idea how variadic parameters or even assignment work, and kept trying to correct me instead of verifying your facts.
I'm not just calling you stupid, I'm calling you an ignorant who doesn't know he's ignorant.

zytrex (9)

Helios, you only showed examples of code that would cause an overflow by writing an int into a char. You never once said that scanf %d cannot be used to write into a char because it will always write more than 1 byte. You just kept saying you cannot put an int into a char. So I kept saying that I'm not trying to put an int into a char. And because you kept insisting on that, I figured you must have misunderstood what I was trying to do. That's why I tried to explain to you more completely what I was doing.

If you're so smart and I'm so stupid, then why was someone else able to immediately understand what I was confused about and solve it so simply?

"when you do "%d", scanf writes an int (4 bytes), even if you give it a smaller type. scanf has no way to know the actual size of the type you passed it, so it assumes it's an int. If you give it something smaller, it will overflow. That's why this is unsafe."

Straightforward, concise and helpful. You never said that or anything close to that. You just kept giving examples of code that would lead to bugs without ever explaining the cause.

It's as if I asked, "Why does the following code cause an overflow?"

char a [4] = "1234";

And your response was, "Because you cannot fit 5 characters into a character array of size 4."
Then I said, "But I'm not trying to fit 5 characters."
To which you responded with something rude and unhelpful.

Eventually, someone other than you looked and said, "That "1234" actually gets a null terminator added to the end of it, "1234\0", making it five characters. You need to leave room in the array for the null terminator." A clear and thorough explanation, the opposite of what you provided.

While you clearly know what you're talking about, Helios, you seem completely ignorant of the fact that you can give poor explanations and sometimes lack the ability to understand what is confusing the person you are trying to help. And when someone does not understand your explanation, you become very rude and start talking down to them, never once considering you might be at fault for giving a poor and confusing explanation.

You also seem ignorant of the fact that these forums are here to help people, people who are struggling with this problem or that, and are looking for someone to provide a little guidance. And you keep calling me ignorant. Are you a fool? Of course I'm ignorant! That's why I'm asking for help! Nobody knows everything. We are all, by definition, ignorant of the things we do not know. What you are especially ignorant of is how to be gracious and courteous. Instead you are rude and abusive.

helios (17607)

This isn't the beginners section. I don't have to assume you don't know how to do something in particular other than what you asked about.

It's as if I asked, "Why does the following code cause an overflow?"
And your response was, "Because you cannot fit 5 characters into a character array of size 4."
Then I said, "But I'm not trying to fit 5 characters."
To which you responded with something rude and unhelpful.

You didn't say "but I'm not trying to write an int to a char". You said that, and then proceeded to explain to me what fscanf() does. Incorrectly, of course.
And "rude and unhelpful"? Aside from the first few sentences, the post was dedicated to give a simplified look into how variadic functions work, which I had just realized was what you didn't understand. Should I have assumed you also didn't know how assignment to pointers work? No! if you're a beginner, post in the beginners section.
And besides,

Of course I'm ignorant! That's why I'm asking for help!

When you were in school and the teacher said something that didn't make sense to you, did you get up from your seat and explain your version to him/her? No? Then may I ask WTF are you doing?
If you're ignorant, you don't question what you're told. If I say you're writing five characters, you're writing five goddamn characters. Not four or six. Instead of arguing with me and wasting my time, go and research why I'm saying that. Failing that, ask me "why do you say I'm writing five characters?" and I'll gladly explain to you why in as much detail as necessary until you understand why.

While you clearly know what you're talking about, Helios, you seem completely ignorant of the fact that you can give poor explanations and sometimes lack the ability to understand what is confusing the person you are trying to help.

I answer what I'm asked. No more and no less. You want more? Ask for more and it will be given to you. You want less? Ask for less, but don't complain about the answers being lacking.
If you're going to deal with programmers, you're going to have to learn eventually that we're very literal people. Ask the right questions and you'll get the right answers; ask the wrong questions and you'll get useless answers or confused looks. And we can't read your mind. Although sometimes it may seem otherwise, that's just undefined behavior and you shouldn't rely on it.

Last edited on

zytrex (9)

This isn't the beginners section. I don't have to assume you don't know how to do something in particular other than what you asked about.

Granted. But I'm not a beginner anyway. As I said in my original post, I'm serializing the data from a generalized suffix tree. That's hardly beginner stuff. But just because you have experience doesn't mean there aren't things you don't know, and of course things you don't even realize you don't know. And these are C++ forums, not C. I'm no expert in C. I mentioned fscanf merely as an analogy. The truth is, none of our discussion of fscanf actually addressed my original question.

You didn't say "but I'm not trying to write an int to a char".

I most certainly did. Here's the very beginning of my original question:

When saving the suffix tree to disk, the data stored in a char takes up 1 byte. So when later loading that data from disk, of course it will fit into a char. In another response I said:

As for lying in my format string, I think you misunderstood. I was not suggesting reading data of different sizes into different types.

And then, in my long example (some of the comments not shown here):

Node nd = Node();
nd.begin = 5; // nd.begin is an unsigned char type (0-255)

fout << nd.begin;
// Because nd.begin is a char type, a character will be written to disk, not the number 5.

fin >> nd.begin;
// Because nd.begin is a char, instead of storing the integral value 8 in nd.begin,
// it will store the value 56, which corresponds to the ascii character 8.

All of these make it crystal clear that I was never trying to write data into a char that could not actually fit into a char.

When you were in school and the teacher said something that didn't make sense to you, did you get up from your seat and explain your version to him/her? No? Then may I ask WTF are you doing?

What I was doing was explaining my current understanding of the topic. By doing so, the person with better understanding could tell me where I was incorrect.

If you're ignorant, you don't question what you're told.

A good programmer needs to understand, not simply memorize. Only a fool just accepts everything he's told without trying to understand.

Ask the right questions and you'll get the right answers; ask the wrong questions and you'll get useless answers or confused looks. And we can't read your mind. Although sometimes it may seem otherwise, that's just undefined behavior and you shouldn't rely on it.

If when referring to char a [4] = "1234"; a person says "I'm only writing 4 characters," it's perfectly obvious that person isn't aware that the null terminator is added automatically. And when referring to

1
2

char a;
sscanf("123", "%d", &a);

someone asks, "What's wrong with that?" It does not take a mind reader, but is rather completely obvious that person isn't aware that %d will always write more data than can fit into a char, regardless of what data is given to it.

Now that I know what I didn't know about %d, looking back at your explanations, I see how they make more sense. But without that critical bit of knowledge, those explanations were quite confusing since I could not see how they applied to what I was actually trying to do. That's why I kept trying to explain in different ways what I was doing.

I wasn't trying to piss you off, Helios. I respect you for spending time to try and help so many people. But you seem a little overly sensitive. When I didn't understand your explanation and tried to explain myself more completely, you acted like I was trying to insult your or teach you or took you for an idiot. Why would you jump to such a conclusion?

In school, the students who just sit quietly and don't bother asking questions when they don't understand something are the ones who do poorly. It can be difficult for teachers to gauge what parts of a lecture the students don't understand. That's why good teachers appreciate the students who ask a lot of questions, because those are often the same questions that the other students have as well.

Topic archived. No new replies allowed.

File formatting: chars as numbers

C++

Forum