aka. How to create a parser in C++ without using Flex or Bison PART 1.
Revision: 2nd
Why the Fortran would anyone want to do this? (Pardon my language). It's because C++ parsers that you make yourself:
-Produce more human friendly code
For starters, the files that Flex and Bison output can be very hard to understand, even if they do have some decent comments. This can make them hard to integrate into other programs. However, since you wrote most of this and hopefully read this article, you have an idea of what's going on.
They also have a tendency of being large.
-Can have more tokens of lookahead when generating an LALR parser
Bison supports only one token of lookahead when not generating a GLR parser. That means that if you have a grammar like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
%{
#include <stdio.h>
%}
%token AND PLOW HORSE MULE OX CART
%%
PHRASE : work_animal AND PLOW
| cart_animal AND CART
;
cart_animal : HORSE
| MULE
;
work_animal : HORSE
| OX
;
%%
| |
...Bison will not be able to create a parser out of it.
girlfriend:~ albatross$ bison /Users/albatross/Desktop/impossible.y
/Users/albatross/Desktop/impossible.y: conflicts: 1 reduce/reduce
/Users/albatross/Desktop/impossible.y:16.15-19: warning: rule never reduced because of conflicts: work_animal: HORSE
girlfriend:~ albatross$
|
If you eliminate all instances of AND, Bison will happily chew it up with no problems. That's one token of lookahead. Sometimes, you will need more than that.
-Are far more customizable
Let your imagination run wild on this one.
This article will teach you how to create a simple arithmetic parser, which you may then use as a framework for other parsers. So sit back, get a cup of something to drink and keep you awake through my inconsequential ramblings of redundancy, and enjoy those ramblings... OR ELSE!
Prerequisites:
Must be VERY familiar with the C++ standard library (especially deques and strings), and understand C++/ENGLISH (in other words, be a semi-experienced C++ programmer).
Oh, and you must know what Flex and Bison are, what they generate, and how to use them.
WANRING: "Evil" features of C++ may be used in this article. If you do not feel comfortable with using said techniques, you may now kiss the bride and then forever hold your peace. (out)
-General idea
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
using namespace std;
//Given this struct:
struct uraldamage
{
deque<float> deque1;
deque<int> deque2;
deque<bool> isoper;
int charcount;
int prioritycount;
int maxpriority;
};
//Your program should be structured like this:
int preprocess(string & stringtheory);
int lexer(string & kitten, uraldamage & groindamage);
int parser(uraldamage & donotusethisasavariablename);
int main()
{
cout<<"Gimme an arithmetic problem!\n";
uraldamage data;
string storeage;
getline(cin, storeage, '=');
preprocess(storeage);
lexer(storeage, data);
parser(data);
cout << data.deque1.front();
return 1337;
}
| |
Reading from files will be discussed in the next part.
NOTE: We're using deques instead of maps for a reason. Can you guess what it is?
-Your Preprocessor
This is just a simple lexer and occasionally parser that will make your code a bit easier to understand for your main lexer. It's always a good idea to have one; it makes your code more readable.
In this case, we want to get rid of all the white spaces and commas; our grammar doesn't need 'em. Just use cctype's
iswhitespace(a_char);
along with a counter
a_char==',';
to determine which characters to erase(counter,1). Return zero.
-Your Lexer
A parser without a lexer is like a banana split with a pickle instead of a banana (eww).
Here, we are converting our string into tokens: little bits of data that have a value associated with them that our parser can manage better.
Note: when I say read into a deque, I mean use push_back().
G#: when I say delete from a deque, I mean use erase().
First counter/integer refers to groindamage.charcount.
Second counter/integer refers to groindamage.prioritycount.
Third counter/integer refers to groindamage.maxpriority.
First, check if the first character in the string is a minus sign. If it is, then add a zero just before it, then read 0 and the ASCII value of '-' into deque1, 4 and 4 into deque2, and false and true into isoper. Then delete the minus sign in the string, and continue. You will also need to ensure that the next token gets a value of 3 instead of 0 (see later).
If you didn't find the minus, skip that bit.
Use find_first_of("+-*/%^()") to determine the location of the first operator. Store it in the first integer. Then, use atof() to read everything from the first character to but excluding that first operator into deque1. Read 0 into deque2, and false into isoper.
Then read the next character's ASCII value into deque 1, 0 into deque2, and true into isoper.
Then erase everything in your string up to and including your operator. Repeat until your string is empty, then return zero.
-Your Parser
Now for the fun bits. Anyone who has done any arithmetic will know that multiplication holds priority over addition and subtraction. They will also know that raising something to a power holds priority over multiplication, and that statements in parenthesis are always evaluated first. Those of us who are also looking carefully will notice that we also need to ensure that multiplication by minus one must occur before any of these operations, and that in our current system it will not happen first but rather LAST. Now we know why we had deque2: to determine the order of operations.
You will need to run over the contents of deque1, and clear out any parenthesis. Read token by token, setting the value of the corresponding element of deque2 to the value of prioritycounter + its original value (which should have been intialized to zero... is it too late to tell you that?).
If at any point, prioritycounter > maxpriority, then maxpriority = prioritycounter.
If you see an addition or subtraction sign, then set the corresponding deque2 value to (original value + 0 + prioritycounter).
If you see a multiplication or division or modulo sign, then set the corresponding deque2 value to (original value + 1 + prioritycounter).
If you see a power sign, then set the corresponding deque2 value to (original value + 2 + prioritycounter).
If you see an opening parenthesis, increment prioritycounter by three, and delete the token.
If you see a closing parenthesis, decrement prioritycounter by three, and delete the token.
Once that's done, the parser is going to run through the program (maxpriority + 1) times, starting from the tokens that have the highest deque2 value. It will evaluate an expression in deque1 that has values in isoper of
false true false
, and evaluate it to get a float back out. It will then eliminate those three elements using erase, and write in their place to deque1 the resulting float, in deque 2 the original priority minus one, and to isoper false.
When the end of the deque is reached, scan and evaluate the next lower priority level.
When you get to priority 0, then you can just proceed linearly, evaluating three tokens, pop_front()ing them, and push_fronting the resulting value and restarting the operation. When your deques all reduce to a size of one, then return zero.
-Afterward
The first time you compile this, you might get loads of fatal errors. Take your time and work through them; this was no easy task. The good news is that the worst is over now that you created your first parser: You have a framework for creating new parsers from that.
If this article is popular enough, I will create a second article on parsing a more human language, and while I'm at it solve the tokens of lookahead problem.
This article is subject to change without notice.
-Albatross