Reading a particular line from an output file

Pages: 12
Hi,

I have an output file which has more than 1,000,000,000 lines. I am accessing this file in another C++ program. Now while accessing the output file using cin, I want to jump, say, to the 5,000,000th line directly and start accessing data from there. Is this possible? Could someone please give me a small C++ code for the same?

Thanks in advance!
I would imagine you could only do this if the lines are of fixed length. Is that the case?
If not, you don't have any other choice than building an index with line offsets beforehand.
@ Galik: Yes, thankfully the lines are of fixed lenght.
@ Athar: Could you tell me how to build an index with the line offsets?
Actually you might have a problem with cin. Is there no way you can read directly from the file?

I don't think seekg() will work on cin. But it should work on a normal file.
You might have some luck using the ignore() function. I don't know how efficient it will be but it should be much faster than reading everything.

I am pretty sure it will work on cin too, at least it does in my small test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <iostream>
#include <string>

int main()
{
	size_t line_length = 11;
	size_t skip_lines = 500;

	if(!std::cin.ignore(skip_lines * line_length))
	{
		std::cout << "error: " << std::endl;
	}

	std::string line;
	std::getline(std::cin, line);
	std::cout << "line = " << line << std::endl;

	return 0;
}
Last edited on
That is a ridiculously large file. I would think you could even run into problems with different file systems not being able to handle them that large.
Also, if you could identify proximity to the required line from the content of another line then you could use some kind of successive approximation method to cut down access time considerably.

But that very much depends on each line having some kind of location bearing information relative to the ones below it and above it..
@mugga

Is this your particle data? Is it ordered according to some sequence (say) each entry is an increasing value of y?
@galik
i have posted my question please explain whats wrong with my code...i have posted my code also....
I'd recommend using system calls to seek and read the file (e.g. Windows' SetFilePointerEx() and ReadFile()), rather than standard C++ functions. 32-bit implementations tend to have problems handling files larger than 2 GiB.
@ Galik - Thanks for all the help.
Yes, this is my particle data that I had asked about previously. And yes, it is ordered in a particular sequence.
The ordering is done in increasing values of the paramter 'y'.
The output file starts off with y = 0. And i want the particles with y = 10 (say). Now if i read each line, it is taking a massive amount of time to reach till y = 10 because of the large no. of particles. I also know the no. of particles contained in each value of y ( = total no. of particles / no. of layers in the y direction). So if the value of y does not match the value i want, i can actually skip these many no. of particles!
Could you help me out with this?

@ Helios - thanks...will try that out as well
@mugga

You say the lines are of fixed length. From your previous data you have this:


21.2342 11.2430 23.5453 0.005 2.25 86


Can you absolutely guarantee that the number of digits in each tab-separated field is fixed? For instance can the last number ever go above 99? Or can any of the xyz values go above 99?

Because if the lines are absolutely fixed length this will be much easier. And I mean not just a fixed number of values but a fixed number of characters?
@ Galik

Well, the problem is that the values of x,y and z are floating point values. But sometimes the value of x may be 21.2342 and some other time it might be 21.23423. So basically, the no. of digits may not always remain fixed, and hence the no. of characters in each line might not remain fixed. Can this problem still be resolved?

Apologies for the late reply!
I suspected as much. The problem can still be solved in that the access can be made more time efficient. That is as long as your file does not break the physical capabilities of the iostream library implementation on your system!

Because you have variable length lines it is not possible to hit the exact line with a seekg(). However you can make a reasonable guess. If you were to undershoot then you could guarantee not to miss the data and still get reasonably close to it.

I have another question. Can you predict with accuracy the value of y for each line of data? Does y increment by a fixed amount?
Oh! I thought that I could possibly use seekg() as I was under the impression that it seeks based on the size of the parameters ( I thought that since the values of x, y, z etc are all floating point values, the memory size of those parameters would be the same i.e. 8 bytes, irrespective of the no. of digits it has, and hence seekg() would probably work! )

Anyways, yes we can predict the value of y of each line accurately. And yes, the value of y increases by a fixed amount each time. Will this ease the problem somehow?
We can still use seekg(), but not to hit the exact line as the number of characters varies.

What I would be tempted to do is read in a portion at the beginning of the file to calculate a good average line length, or write a separate program to do that.

Then I would seekg() to a location calculated from the average line length, and the number of required lines to skip. The number of skip lines can be calculated from the initial value of y (zero) and the value of y you are seeking and the number of y lines per layer.

Once at the guessed position you could then track back to the start location if you went too far with the seekg().

After that its a simple sequential read.
Hey,

Sorry, but i didn't get this part -
"Once at the guessed position you could then track back to the start location if you went too far with the seekg()."
Could you please explain what you meant again.

Thanks!
Well you can calculate exactly which line you want to jump to.


You know how many y values between 0.0 and the one you seek which is the value you are searching for divided by the y increment of each layer.


y_layers_to_skip = y_search / y_layer_inc


You know the number of particles (lines) per layer:


y_per_layer = total_particles / total_y_layers


So you can calculate exactly how many lines to skip:


lines_to_skip = y_per_layer * y_layers_to_skip


And you know the average number of characters per line so your seekg() value can be calculated:


skip = lines_to_skip * average_line_length


So you can jump into the file there:

1
2
3
std::ifstream ifs("particle_data.txt");

ifs.seekg(skip);


However we are only using an *average* line length so we might be slightly off.

We can align our data to the start of a line with a getline() read:

1
2
std::string line;
std::getline(ifs, line); // Align to start of  line 


We may be one or two lines too far So we could then make a loop to read lines in reverse until we reach a value of y that is less that the one we are searching for.

Then we know we are at the very beginning of the layer and we can simply do a sequential read forward from there on in.

Does that make sense?
Last edited on
Hey,

Thanks for the detalied explaination. I took the code that you had previously given me (in the other thread) and modified it a bit for this particular situation. However, the seekg() command doesnt seem to be working. The code compiles and executes correctly. However, it is still taking a large amount of time for reading out the desired particles. Please can you look at the code and tell me what the problem could be?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <string>
#include <sstream>
#include <iostream>
#include <stdio.h>
using namespace std;

struct particle
{
	double x;
	double y;
	double z;
	double radius;
	double dencity;
	int type;
};

int main(int argc, char* argv[])
{
	if(argc < 2)
	{
		cerr << "Error, need to supply y as argument." << endl;
		return 1;
	}

	double y;
	bool next = false;
	int count = 0;
	istringstream iss(argv[1]);
	if(iss >> y)
	{
		string line;
		while(getline(cin, line))
		{
			istringstream iss(line);
			particle p;
			iss >> p.x;
			iss >> p.y; 
			iss >> p.z;
			iss >> p.radius;
			iss >> p.dencity;
			iss >> p.type;

			if (p.y != y && count == 0) // For jumping once only
			{	
				int line = int (2024807438/15000); //no. of particles / layers
				int size = 84;	// average size of each line
				int jump = int (y/0.0021334 + 0.5);  // layers to be skipped
				int pos = line * size * jump;	// total jump
				iss.seekg(pos);  // this doesnt seem to work. If I do a cout here to check if its working or not, the cout does output only once, but the program still takes a long amount of time for larger values of y.
				count++;
			}

			getline(cin, line);
			if(p.y == y)
			{
				cout << p.x << '\t';
				cout << p.y << '\t';
				cout << p.z << '\t';
				cout << p.radius << '\t';
				cout << p.dencity << '\t';
				cout << p.type << endl;
				next = true;
			}

			if (p.y != y && next == true)
			{
				exit(EXIT_FAILURE);
			}
		}
	}
	else
	{
		cerr << "Argument y was not valid." << endl;
		return 1;
	}

	return 0;
}


Thanks in advance
Pages: 12