read in big CSV file performance issue

Pages: 12
Hello all,
I need to read in many big CSV file to process in C++ (range from few MB to hundreds MB)
At first, I open with fstream, use getline to read each line and use the following function to
split each row"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template < class ContainerT >
void split(ContainerT& tokens, const std::string& str, const std::string& delimiters = " ", bool trimEmpty = false)
{
	std::string::size_type pos, lastPos = 0, length = str.length();

	using value_type = typename ContainerT::value_type;
	using size_type = typename ContainerT::size_type;

	while (lastPos < length + 1)
	{
		pos = str.find_first_of(delimiters, lastPos);
		if (pos == std::string::npos)
		{
			pos = length;
		}

		if (pos != lastPos || !trimEmpty)
			tokens.push_back(value_type(str.data() + lastPos,
			(size_type)pos - lastPos));

		lastPos = pos + 1;
	}
}

I tried boost::split,boost::tokenizer and boost::sprint and find the above give the
best performance so far.
After that, I consider read in the whole file into memory to process rather than keep the file opened,
I use the following function to read in the whole file with the following function:
1
2
3
4
5
6
7
8
9
10
11
12
13
void ReadinFile(string const& filename, stringstream& result)
{
	ifstream ifs(filename, ios::binary | ios::ate);
	ifstream::pos_type pos = ifs.tellg();

	//result.resize(pos);
	char * buf = new char[pos];
	ifs.seekg(0, ios::beg);
	ifs.read(buf, pos);
	result.write(buf,pos);
	delete[]buf;

}


Both functions are copied somewhere from the net. However, I find that
there is not much difference in performance between keep file opened or read in
the whole file.

Below please find the sample content of one type of file(s), I have 6 types need to handle. But all are similar.
1
2
3
4
5
6
7
8
9
10
a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0


The performance capture as follow:

    Process 2100 files with boost::split (without read in whole file) 832 sec
    Process 2100 files with custom split (without read in whole file) 311 sec
    Process 2100 files with custom split (read in whole file) 342 sec


My questions are:
1 Why read in whole file will perform worse than not read in whole file ?
2 Any other better string split function?
3 The ReadinFile function need to read to a buffer and then write to a stringstream to process,
any method to avoid this ? i.e. directly into stringstream
4 I need to use getline to parse each line (with \n) and use split to tokenize each row,
any function similar for getline for string ? e.g. getline_str ? so that
I can read into string directly
5 How about read the whole file into a string and then split the whole string into vector<string> with '\n' and then split each string in vector<string> with ',' to process ? Will this perform better ? And what is the limit (max size) of string ?
6 Or I should define a struct like this (based on the format)
1
2
3
4
5
6
struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

And read directly into a vector<MyStruct> ? How to do this ?

Thanks a lot.

Regds

LAM Chi-fung
Last edited on
is the tokens vector preallocated to have enough memory so that it does not reallocate?
If not, how does it perform once you do that?
It would be helpful if you provided a small sample of your input file and an example of how you're calling your split() function.





@jlb,@jonnin
each file contain lines like this:
 
a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0

and range from 10000 to 300000 lines.

and I use the split as follow:
1
2
3
4
5
6
7
8
9
		stringstream in;
		ReadinFile(FilterBase.BaseFileName, in);

		while (getline(in, line, '\n'))
		{
			split(wStr, line, ",");
			FilterBase.FilterEle.push_back({ stoi(wStr[LineNoPos]), int(stof(wStr[3])*2.0) });
			wStr.clear();
		}


and the peformance capture:

    Process 2100 files with boost::split (without read in whole file) 832 sec
    Process 2100 files with custom split (without read in whole file) 311 sec
    Process 2100 files with custom split (read in whole file) 342 sec

For simple file processing, your goal is to process the file as fast as you copy it. Put another way, you want the program to be I/O bound.

split() should clear tokens at the beginning of the function. Otherwise you are appending the tokens in the line to the collection and you may end up creating a collection of every token in the file rather than every token in the line.

Definitely process the file one line at a time rather than reading the while file into memory. If you read it into memory then there is inevitably a file size that will cause your program to run out of memory.

What type is wStr?

Did you try profiling the code to see where the bottlenecks are?
That did not really answer my question. Did you preallocate *ALL* the vectors involved that you are using push back on? If not, it seems very likely that at least one of your bottlenecks will be the vectors.

300 seconds?! I have programs that read and process ~1gb xml files in 30 seconds using (cough) just a dumb read the whole file into a char* C code (I know the files are about a gb and I know I have more than 10 times that in my ram, so it works for me). But that isnt the issue, reading in chunks is equally fast, something else is going on here (processing or memory allocation or some sort of massive inefficiency).

Last edited on
@jonnin
No, since I don't know how many line(s) the file has (all I can get is the file size). Therefore I push_back each time a line is processed. I try to count the "\n" in stringstream with std::count without success
Last edited on
Well, put something in there. Like 100k or 1 million, play with it, see what happens. If you put 1 million in there and it runs in 30 seconds instead of 300, you can continue this line of thought. If it still takes the same amount of time, that was not your problem. Profile would be nice, but this is pretty easy to test straight up.
Last edited on
each file contain lines like this:

Do you know how many fields are in each line and the type of each field? Are all of the file formats the same?

This seems to be a simple CSV type of file and should be easy and straight forward to parse using stringstreams and the extraction operators. IMO, the "split" function in post #1 is probably overly complicated and could probably be much simpler.


@jlb,@jonnin,

Eventually I count the line by count(vect_char.begin(),vect_char.end(),'\n')

I read in the whole file into vector<char>, count lines, then break it into a vector<string> Lines and resize the result vector to line count. Then break each line into another vector<string> by ',' and process it. Now, processing 6.3 GB (total 2100 files) took around 200 sec. See if there is any method(s) can further reduce the time.
Last edited on
You seem to be wasting most of your time counting lines, creating a vector<string>, doing some kind of resize, then breaking each line into another vector string then processing it. Looks like a lot of waste to me.

I recommend you read the file a line at a time, processing those lines as you go.

You haven't answered the questions I posed in my last post.

Also you really haven't shown enough content for your file contents, please post at least 10 representative lines so that we may be able to see some patterns.

@jlb
Below please find the sample content of one type of file(s), I have 6 types need to handle. But all are similar.

a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0


I tried read line by line (not whole file into memory) before, and it takes 312 sec to process 2100 files.
Or I should define a struct like this (based on the format)
1
2
3
4
5
6
struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

and read directly into a vector<MyStruct> ? How to do this ?
I have 6 types need to handle. But all are similar.

Similar doesn't cut it. In order to properly handle the files you really must know the exact format of the files. Also do you need to process every field in the above file or are you just interested in particular fields?

I tried read line by line (not whole file into memory) before, and it takes 312 sec to process 2100 files.

That's probably because you're handling the data too much. You should only read the file once. You should only parse the line once. In your last description of your file handling you were reading the file multiple times, parsing the strings multiple times this duplication wastes time. Instead of creating a vector<string> containing each line then making a vector<string> holding each "word" then converting those words to "numbers" just read the line into a string, then use a stringstream to parse that line into the proper types of variables.

You may also want to consider creating a structure to hold the data from each line, using meaningful names for each field in the line.

@jlb

That's probably because you're handling the data too much. You should only read the file once. You should only

no, my first try is open the file, read 1 line into stringstream (using getline) and
parse the line (use the split function and extract 4 items in it). But this still take 312 sec and so I consider read the whole file into memory (into string/vector<char>/char *) hope to improve performance. Later I find that if I read the whole file into string (or stringstream), and I still use getline, performance is bad. So I use split to break huge line into vector<string> and each string into items.
at some point you have to decide if its fast enough. As I said, a fast as it can go but sort of dumb program of mine took 30 sec for 1 gig. You do 6 gigs. 6*30 is 180 seconds. yours took 200 seconds. We can trim it, with extreme effort we might even cut it in half, but is it worth it, do you have a real time need?

It sounds like too much work (don't cont end of lines and break it up, just process it directly). But rewriting it again to get it from 200 down to 150 ... is it really worth it?

If your system can handle it (raid? SSD drives? Etc??) you might be able to multi thread it and do 4 files at a time in parallel, or however much your system can handle (4 feels good on most typical desktops right now). That might do it in 60 seconds or so, but only if your disks are pretty hot stuff.


Last edited on
no, my first try is open the file, read 1 line into stringstream (using getline) and
parse the line (use the split function and extract 4 items in it). But this still take 312 sec and so I consider read the whole file into memory (into string/vector<char>/char *) hope to improve performance.


Have you even profiled the code to see where the program is spending most of it's time?

@jlb
Eventually I get the job done with traditional C stuff:
 
fscanf(stream, "%*[^,],%d,%*d,%f,%*s",

I ignore the field don't use (sacrifice flexibility to gain performance) and now is well below 150 sec for 2100 files (6.3 GB data). I need to squeeze performance because I have over 200000 such files need to handle.
You might want to give stream a large buffer with setvbuf(). Sometimes that greatly improves performance. Start with 64k. Measure the performance, double the size and measure again. Keep going until the performance tops off or the extra space isn't worth it.
this is pretty much the same performance as
int x,y,z;
double d;
string s;

cin << x << y << d << s;

or some similar statement that directly reads into the correct type. C is very marginally faster than C++ for a few things (usually, when the c++ is using a container with hidden overhead and C is avoiding that with a POD struct or array), but I doubt those few nanoseconds are what was at play here. Feel free to use the c++ version if you find it cleaner looking.


Pages: 12