Regular expression algorithms are notoriously difficult to get right, and I cannot recommend anything but a basic toy for a beginner. (Sorry.)
Remember that the symbols used in a regex directly relate to the DFA graph. For example, a "+" means that you have a loop on a single node (an edge that leaves a node and returns directly to the same node) or node grouping (an edge leaves the last node in a group and returns directly to the first node in the group). Hence:
A+ : One or more 'A's: "AAAAA"
(ABC)+ : One or more 'ABC's "ABCABCABC"
Your algorithm must be able to construct a graph that represents a valid DFA, which means that you may have any number of edges leaving from a single node. In other words, you need to think harder about your design.
Things get trickier when you consider negative look-ahead and look-back, if you want to implement things like that.
In order to avoid memory leaks and pointer hell, use C++'s ability to construct classes as you need them. I recommend using a structure like:
1 2 3 4 5 6
|
struct State
{
std::string match_ranges; // "AZaz" for example, or just "A" for a single character
std::vector<std::size_t> transitions; // lookup index into States
...
};
| |
Keep your states in a vector and reference them by index:
|
std::vector<State> States;
| |
This can be easily wrapped by a RegEx class of some sort.
Don't loop on EOF. Loop on whether or not your read attempt succeeded:
8 9
|
while(input >> str){
words.push_back(str)
| |
If performing regular expression matching on input text, though, you should get all input as a single string:
1 2 3 4 5 6 7
|
std::string s;
{
std::ifstream f("words.txt");
std::stringstream ss;
ss << f.rdbuf();
s = ss.str();
}
| |
This gets you
everything, spaces, tabs, newlines, etc, all in a single string you can play with.
Good luck!