Extracting string literals from C/C++ source is often a "test yourself" or homework kind of activity.
Is this homework? (Even if it isn't, you will be
much more satisfied if you solve it yourself.)
I will be glad to help as appropriate.
To get started, I did mine by reading line-by-line, but these kinds of algorithms are typically done as you started out (character by character). Either way works.
As you go, you need to count instances of the newline to know on what line the literal is encountered. You could also count characters if you like.
You must find and match all instances of the double-quote character (
"
) that are not:
■ in a comment
■ explicitly quoted
■ implicitly quoted
This means that your code must keep track of these conditions.
Comments
All comments in C/C++ are initiated with the 'forward-slash' character (
/
). Hence, whenever you encounter this character, you need to read the
next character to decide whether or not you are entering a comment. If the next character is another forward-slash, then just read characters until you hit the end of the line and then continue normally. If the next character is an asterisk, then you must read characters until you find the end of the comment (character sequence of
*/
).
String literals cannot, by definition, appear inside commentary.
Explicit quotes
The next thing you need to worry about is
explicitly-quoted things. An explictly-quoted item is preceded by the
escape character, or 'back-slash' (
\
).
An escaped double-quote is never counted as a string delimiter.
\"
An escape may escape itself. (Hence,
\\"
does contain a string delimiter.)
An escape may escape the end of a line. That is, a string literal may span multiple lines! For example:
1 2 3 4
|
const char* instructions = "usage:\
foo FILENAME\
\
Report the amount of fooey in FILENAME.";
| |
These kinds of strings should be understood to be the same as:
|
const char* instructions = "usage:\n foo FILENAME\n \n Report the amount of fooey in FILENAME.";
| |
Keep in mind that a trailing back-slash may occur
outside of a string literal as well, and should be accordingly ignored (the same as anything other than the above items that is explicitly escaped).
Implicit quotes
The last thing you need to worry about is the valid sequence
'"'
, which C programmers tend to write as
'\"'
, but is perfectly valid either way. This may occur as part of a multi-character constant, meaning you could get a character literal in source code that looks like
'12"4'
or some other such nonsense. So your code will have to recognize unquoted character delimiters and treat them as strings (except that they are not printed or the like).
My code does not handle multi-char constants very well. It could, but I didn't think of the possibility when writing the code. I suppose I'll have to amend it...
The organization of your program should, then, start assuming normal state. If it finds a comment, read to the end of the comment. If it finds a string, then output the current line number (and any other information you want, such as filename and/or column number), then begin displaying characters until the string is properly terminated. Et cetera until EOF.
Hope this helps.