[try Beta version]
Not logged in

 
Unsolved problem, don't let it sink: Read huge txt's into memory efficiently?

Pages: 1234
Jan 2, 2011 at 7:33pm
I am using the method below to read a Large space delimited txt files(About 900 Mb). It took me 879s to load the data into memory. I am wondering if there is a more efficient way to read the txt file?

Another associated question is: is it a good idea to store such a huge data set using a 2D vector?


Here is my code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
void Grid::loadGrid(const char* filePathGrid)
{
        // 2D vector to contain the matrix
        vector<vector<float>> data;
        
        unsigned nrows, ncols;
        double xllcorner, yllcorner;
        int cellsize, nodataValue;
        const int nRowHeader = 6;
	string line, strtmp;

	ifstream DEMFile;
	DEMFile.open(filePathGrid);
	if (DEMFile.is_open())
	{			
		// read the header (6 lines)
		for (int index = 0; index < nRowHeader; index++)
		{
			getline(DEMFile, line);	
			istringstream  ss(line);
			switch (index)
			{
				case 0: 
					while (ss >> strtmp)
					{						
						istringstream(strtmp) >> ncols;						
					}
					break;
				case 1:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> nrows;						
					}
					break;
				case 2: 
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> xllcorner;						
					}
					break;					
				case 3:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> yllcorner;						
					}
					break;						
				case 4:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> cellsize;						
					}
					break;						
				case 5:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> nodataValue;						
					}
					break;					
			}			
		}

		// Read in the elevation values
		if (ncols * nrows > 0)
		{		
			// Set up sizes. (rows x cols)
			data.resize(nrows);
			for (unsigned row = 0; row < nrows; ++row)
			{
				data[row].resize(ncols);
			}

			// Load values in	
			unsigned row = 0;
			while (row < nrows)
			{							
				getline(DEMFile, line);
				istringstream ss(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}
			DEMFile.close();			
		}		
	}
	else cout << "Unable to open file"; 
}




Below is the sample data:
// header
ncols 19092
nrows 6219
xllcorner 581585.1569801
yllcorner 4612170.4651427
cellsize 2
NODATA_value -9999
//data body
....................
....................
Last edited on Jan 3, 2011 at 6:46pm
Jan 2, 2011 at 8:20pm
Have you tried with the highest optimization setting? The reason that I'm asking, is that stream usage often benefit a lot from optimization. Which compiler are you using?
Jan 2, 2011 at 8:54pm
With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also ahave a signmificant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.

Last edited on Jan 2, 2011 at 8:54pm
Jan 2, 2011 at 8:54pm
Have you tried with the highest optimization setting? The reason that I'm asking, is that stream usage often benefit a lot from optimization. Which compiler are you using?

Sorry, I don't know how to use those "optimization settings" you referred to... , and the compiler I am using is visual studio 2008.
Last edited on Jan 2, 2011 at 8:54pm
Jan 2, 2011 at 9:01pm
I know nothing about Visual Studio, but for now, you could check out this one:

http://efreedom.com/Question/1-1416891/Optimization-Options-Work-VSCPlusPlus-2008
Jan 2, 2011 at 9:03pm
I also recommend you try to change your implementation to use a one-dimensional vector as per my second post.
Jan 2, 2011 at 9:04pm
With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also ahave a signmificant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.

I am new to c++, I followed the suggestion given by a post in this forum(I could not find it now...) to use to 2D vector to contain the large data set. But I will try to follow your suggestion. Thanks for your help, and you have a nice day!
Last edited on Jan 2, 2011 at 9:07pm
Jan 2, 2011 at 9:59pm
I also recommend you try to change your implementation to use a one-dimensional vector as per my second post.

I modified my 2D vector into 1D, however, the speed is almost the same...
Jan 2, 2011 at 10:51pm
Alright,

Then I suggest that you change

1
2
3
4
5
6
7
8
9
10
			while (row < nrows)
			{							
				getline(DEMFile, line);
				istringstream ss(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}


to

1
2
3
4
5
6
7
8
9
10
11
			istringstream ss;
			while (row < nrows)
			{							
				getline(DEMFile, line);
				ss.str(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}


It could be quite expensive to recreate a string stream from scratch at every line.

If the does not help significantly, I urge you to find out how to try to max out you optimization settings, and see what that does.
Jan 3, 2011 at 4:55am
1
2
3
4
5
6
while (ss >> strtmp) //what are you doing here?
{
	istringstream(strtmp) >> ncols; //the value of ncols will be over written
}
while( ss>>ncols ) //quasi-equivalent code
  ;

Try to use a binary file instead of plain text
Jan 3, 2011 at 3:29pm
I am trying to read in a string (e.g. "1234"), and convert it to a number.
It seems that C++ is not capable to load large txt files into memory efficiently? Doubt about it...
Last edited on Jan 3, 2011 at 3:29pm
Jan 3, 2011 at 3:43pm
Don't use a std::vector. Use a std::deque.
The vector doesn't play well with high memory usage, but the deque does.

Another option is to memory map the file.
Jan 3, 2011 at 4:24pm
Another option is to memory map the file.

Could you please give a simple code about using "memory map" technique? Sorry, I am new to C++. Thanks!
Jan 3, 2011 at 6:49pm
Hi all,
Is this problem really a big challenge? Please help, don't let it sink before it is solved, thanks!
Last edited on Jan 3, 2011 at 6:49pm
Jan 3, 2011 at 8:05pm
+1 for using std:deque as Duoas proposes

Looking at your code, you load the data in parsed form, so it is never completely loaded in memory per-se.

std::deque is better suited for arbitrary growth, because it does not copy all elements every time it has to increase the space. It just allocates additional storage and links it. I think it is implemented as vector of pointers to fixed-size arrays. When it has to grow, it creates a new array and adds its pointer to the vector.

Regards
Jan 3, 2011 at 8:10pm
Reading 884MB with a simply
1
2
while( input.read(&c, sizeof(char)) ) //1 byte at the time
  v.push_back(c);
it took ~ 1m30s
with 4 bytes at the time -> 36s

Maybe if you get rid of those string to number conversions (or get a better hardware)
Jan 3, 2011 at 8:52pm
Reading 884MB with a simply
1
2


while( input.read(&c, sizeof(char)) ) //1 byte at the time
v.push_back(c);

it took ~ 1m30s
with 4 bytes at the time -> 36s


ne555, your response makes me see the hope! Could you please kindly give further help by providing the code to load the following sample data I made? I guess the code you posted is used to read in strings, right? While, I need to convert the strings of numbers to float numbers. I tried to mimic your code but failed.

ncols 5
nrows 3
xllcorner 581585.1569801
yllcorner 4612170.4651427
cellsize 2
NODATA_value -9999
1.0 1.1 1.2 1.3 1.4
3.2 3.5 2.3 3.1 4.4
2.3 2.5 2.6 2.9 5.1
Last edited on Jan 3, 2011 at 8:52pm
Jan 3, 2011 at 8:55pm
+1 for using std:deque as Duoas proposes

Looking at your code, you load the data in parsed form, so it is never completely loaded in memory per-se.

std::deque is better suited for arbitrary growth, because it does not copy all elements every time it has to increase the space. It just allocates additional storage and links it. I think it is implemented as vector of pointers to fixed-size arrays. When it has to grow, it creates a new array and adds its pointer to the vector.

Regards

Simeonz, and Duoas,
Thanks for your responses. I modified my vector to std::deque, however it seems that the code is even slower... Probably I am idiot, I am really new to c++.


Jan 3, 2011 at 9:37pm
Well, actually that is because the vector in your code doesn't grow. My mistake.

EDIT: Is there anything in the number structure that can be exploited. What is the range? Are all numbers given with the same precision? Like d.d?
Last edited on Jan 3, 2011 at 9:42pm
Jan 3, 2011 at 9:49pm
Simeonz, not really. Some of them are integers (-9999), and some of them are with four decimals(dd.ffff).
Pages: 1234