Parsing Include Files Recursively

I need a C function that I can parse files recursively like a tree inside a file with a pattern import "anyfilename.anyextension".

The results will be stored in a char* array to be printed.

The order of the files inside a file should appear in order,

EXAMPLE: the parent file is root.txt with the content:

import "a.txt"
import "b.txt"

and b.txt contains:

import "c.txt"


It should then print and list

a.txt
c.txt
b.txt
root.txt

when I change the order in root.txt it should print:

c.txt
b.txt
a.txt
root.txt

basically it's just a recursive tree, but instead of printing the parent file name, it should list the files inside it in order.

It should also ignore whitespace and imports files on a single line, so this is valid:

import "a.txt" import "b.txt"

..as long as the pattern `import "filename.ext"` still holds it will check the file inside the double quotes.


So far I have the filename validation check here:

int is_valid_filename(const char *filename) {
int length = strlen(filename);
if (length == 0) {
return 0;
}
char *dot = strrchr(filename, '.');
if (dot == NULL || dot == filename || dot == filename + length - 1) {
return 0;
}
return 1;
}
Last edited on
what should it do on repeats, if a includes b included c includes a?
some of those cases will make it loop forever, and others may just pull the same file in many times.

generally speaking, though, if you want it backwards:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void RecFile(char* filename)
{
   ///open file, if that worked, read it, ...
   for(all the lines in the file)
   {
   if(strstr(fileline, "import")
       {
       //look deeper, verify pattern, check for multiple patterns on same line
         if(matched) 
           {
             RecFile(extracted_string); 
             print(extracted_string);
           }
       }
    }    
}


is that the order you want, though?
if a includes b and c, do you want b,c, a?
and if b includes d, .. d,b,c,a ?
there may be additional rules if you want some other order than this.
also the first file time you run it, its with a valid starter file, right? so here would run it with recfile("a.txt") ? That is what I assumed.
Last edited on
Repeats should be listed as well for debugging, and then discard the duplicates while the first time it is being added should be retained.

The important thing here is once a file is included, the rest can be tracked, but ignored since it is already included, like include once in PHP or ifdef include files in C/C++.

In short, it's not needed because it just needs to be imported once, so the first time it is being is included only in the array.
Last edited on
ok, so you need to store what you have seen and if you see it again, treat it differently...
@jonnin,

yes, to simplify it can be directly discarded if it gets repeated since it is include once.

The debugging part I mentioned was just for convenience or logging repeated instances if it is being called on the same file or another file.

So if the file mistakenly has import "a.txt" multiple times somewhere, it is only included and parsed once.
Last edited on
Do you need to spot cases such as

- relative directories leading to the same file?
1
2
import "a.txt"
import "../dir/a.txt"


- files either symlinked or hardlinked to the same file?
1
2
import "a.txt"
import "b.txt"

Where someone has done ln -s a.txt b.txt to create a symlink to a.txt called b.txt.

In both these cases, you end up with the same actual file imported twice.
Yes, I guess paths should be preserve since the contents will be read and used afterwards.

A crc32 check would be an option to check if there's a file duplicate in question, perhaps store the filenames with their crc32 for reference and then it can check after the filename is valid or exists.

I wouldn't go as far as symlinks, but I guess crc32 will take care of that as well.
Last edited on
Is there a need that you're restricted to c rather than using C++?
Yes, I need it in C, but would not hurt seeing a C++ solution just for reference and comparison.

In a nutshell, it's basically a recursive include once routine where it searches the pattern in the file:

import "filepath"

store and list it in heirarchy, the bottom last child is 1st up to the root parent last to be on the list.

If the file has multiple imported files declared inside the file, it must be preserved in order they appear in that file and also ignoring whitespaces and the filename must be valid or exists.
Last edited on
I did forget that the parent file should also be listed since it can contain code other than import statements

so it will be listed:

a.txt
c.txt
b.txt
root.txt

if c.txt contains imports d.txt and e.txt in order, it should list:

a.txt
d.txt
e.txt
c.txt
b.txt
root.txt

Since c.txt expands to d.txt and e.txt on top of it before adding it to the list.

so list what's inside it in order, before adding it to the list if nothing follows.



Last edited on
An ongoing list of evolving requirements.
Before coding for this is started, I'd strongly suggest that this program is first designed - and once an acceptable design is done then coding is done from the design. IMO this is definitely not one of those programs where you can just start coding without some serious thought as to how the program is to work and what structures (for output etc), functions etc are required.

As a starter, IMO you probably need a function that takes as a param a file name and that then reads text from that file and obtains the required file names following import. This itself has some pitfalls as they may be 1 or more space/tab chars following import and that a space is a valid filename char (at least in Windows) and that a filename may be import.txt.
I do not think you need to validate legal file names. Attempts to open illegal names will fail the same way that attempts to open files that do not exist do.

likewise you do need to watch for edge cases, but the pattern is import{any amount of whitespace as you define it} {" or ' and any other quotes like the MS word one you want to support} --text that you read until you reach end of line or another quote symbol. That will not trip up over import"import.txt". What will trip it is the reverse, import"import"anything.txt" unless you specifically back up to point of failure and start anew. It has to spot {import"anything.txt"} after failing on "import"import".

hopefully this makes sense. These edge cases are a pain, but to be honest if someone is hell bent on messing up a parser, they usually can unless its been overcoded to death. If there are security reasons to do that, fine, but otherwise, I am always very, very tempted to fail out with a "knock it off" message.
@kbw

Not really, just for clarity. it's still a node tree in reverse, because if the parent file is read first with a function that is in an import file, it will throw an error because it should read that import file first to know the function.

I have corrected my original post example to reflect this.

My example again:

root.txt contains
import "a.txt"
import "b.txt"

and b.txt contains:
import "c.txt"


Pass 1:
root.txt

Pass 2:
a.txt
b.txt
root.txt

Pass 3:
a.txt
c.txt
b.txt
root.txt



@seeplus

It's a simple requirement, list the child imports on top of the parent recursively, in order starts from top to bottom, so when the file nodes are actually parsed and read, the functions or code are read in order and properly. Include once or no duplicates.

The parent file should be included, cause it contains some code itself and not just imports.


@jonnin

yes, a file exists function will just be fine to check for valid filename.

as for edge cases, maybe regex could handle it..
Last edited on
regex is a good idea, esp if you are already decent at it. you can also rig the old compiler compiler tools to do stuff like this, if you know those. Lots of ways to do it, really, once you figure out exactly what "IT" is that you want to do :)
As a c starter, consider (with minimal checking) (compiles OK with MSVS as c17):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <string.h>

#define MAXLINE 1000

const char word[] = "import";
const size_t wlen = sizeof word - 1;

void read(const char* fnam) {
	FILE* f = fopen(fnam, "r");

	if (f == NULL) {
		printf("Cannot open file %s\n", fnam);
		return;
	}

	for (char line[MAXLINE]; fgets(line, MAXLINE - 1, f); )
		for (char* pos = line; pos && (pos = strstr(pos, word));) {
			for (pos += wlen; *pos == ' ' || *pos == '\t'; ++pos);

			if (*pos == '\"') {
				char* incnam = ++pos;

				if ((pos = strchr(pos, '\"'))) {
					*pos++ = 0;
					read(incnam);
				}
			}
		}

	puts(fnam);
	fclose(f);
}

int main() {
	read("root.txt");
}


Last edited on
Thanks seeplus!

It does not do an include once and should just ignore invalid files which I can just add or modify, but the main objective here is resolved!

Awesome! 👏
Last edited on
Well ignoring invalid files is easy - just comment out L15.

Having import only once for the same name is a bit harder in c as you need a list of already processed files. In C++ you'd probably use a std::set, but this isn't available in c and in the best traditions - this is left as an exercise to the reader...
A possible C++ version using crc checksum to ignore already processed files could be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <fstream>
#include <iostream>
#include <string>
#include <set>

#include "crc.hpp"

constexpr char word[] { "import" };
constexpr auto wlen { sizeof word - 1 };

void read(const std::string& fnam) {
	static std::set<uint32_t> crcs;

	if (std::ifstream f { fnam })
		if (crcs.insert(calcCrc32(f)).second) {
			for (std::string line; std::getline(f, line); )
				for (size_t pos {}; pos != std::string::npos && (pos = line.find(word, pos)) != std::string::npos; ) {
					for (pos += wlen; line[pos] == ' ' || line[pos] == '\t'; ++pos);

						if (line[pos] == '\"')
							if (size_t incnam = ++pos; (pos = line.find('\"', pos)) != std::string::npos)
								read(line.substr(incnam, pos - incnam));
				}

			std::cout << fnam << '\n';
		}
}

int main() {
	read("root.txt");
}

Thanks seeplus!
Topic archived. No new replies allowed.