Stripping whitespace

I have a problem with parsing, to do with stripping whitespace. I'm saving data for an application in JSON format (I'm creating a "schema" parser for JSON, that's why I'm not just getting a existing one), so the parser isn't too extensive, nothing big.

The problem I've run into is with dealing with whitespace. I was just calling "consume whitespace" wherever whitespace is valid, but that seemed like a horribly inefficient way to do it.

So, I changed it over to cull all of the whitespace in one go at the start. That all went good until I tried a large complex schema, and my parser proceeded to stumble over something, and give me an error it shouldn't half way through the file. Now, having a line number was rather essential to figuring out where the problem was, but wait! I already stripped out the whitespace, so I have no way to track where from the original file the character was.

Now, it's very possible for me to use hacks such as deleting all the whitespace in the schema, but I'd rather have a nice parser than a hack with no proper feedback when you give it bad input, even if I can solve the current error.

Is there some straightforwards ingenious way to track the current character/line number while still stripping whitespace at the start of the parse that I'm missing?, or will it probably be as tricky as I think and no more efficient than stripping whitespace during the parse?

//my character stream class, basic 1-character lookahead stream
class ParserInput {
public:
void setStream(...); //initialize from whatever input source
void consume() { //go to next character
    //how I would implement line/char tracking without whitespace stripping
    if (cur() == '\n') { 
        ++currentLine;
        currentChar = 0;
    } else {
        ++currentChar;
    }
    charPtr++;
}
char cur(); //current char
char peek(); //lookahead
//plus lots of convenience functions to calling on the above ones
};

helios (17607)

I don't know how you're doing it, but most parsers read from a stream one token at a time. The lexer (or "tokenizer") provides these tokens and at the same time automatically removes invisible characters, like comments and whitespace. Ignoring whitespace shouldn't be more than a loop inside the lexer.

See Parser::read() for a sample lexer:
http://www.cplusplus.com/forum/lounge/11845/#msg56331
Here's a more complex one (lines 175-212):
http://onslaught-vn.svn.sourceforge.net/viewvc/onslaught-vn/trunk/src/INIParser.ypp?revision=143&view=markup

Last edited on

stravant (53)

Right now I'm not using tokens, just reading directly from the input stream and translating that into data. Making a tokenizer seemed like overkill because what I'm doing is a relatively simple context free grammar.

I suppose I might as well make one, I can probably reuse it later for other stuff. Now that I think of it that would probably be the best way to do it even if it's a bit excessive, thanks.

Topic archived. No new replies allowed.

Stripping whitespace

C++

Forum