Importing 1000 Files, How To Skip "Bad" Lines In File?


Hi everyone,

I'm a moderately experienced C++ programmer working on code which must do the following:
(a) Import data from a lot of little files
(b) Load that data into various objects
(c) Do stuff with that data

The code I've written does (a), (b), and (c) pretty well, but I've noticed a problem with (a), which I want to ask you guys about.

Suppose I have 1000 source files. My program successfully processes Files #1 through #500. But when it reaches File #501, my program chokes and seg faults, and I automatically lose ALL the data I've collected. This is a big problem, because there is a LOT of data to process. It may take me three or four hours just to reach File #500.

When the program reads a file, each individual line is loaded into a string called Line, which is then parsed for individual values. If I'm reading gdb right (output below), the parsing is causing the trouble. (Lines 15-16, below) As for the line in the file which is causing the trouble, I don't see any format problems with the line itself. When I run the program multiple times, it is the same exact line which causes the seg fault every time.

What would be awesome would be a way to tell the program, "if you see a line which confuses you, skip that line, don't just automatically crash!" Skipping the entire file would be okay too.

Below is the code I'm using. Below that is the gdb analysis of why my program is choking. Any help or advice would be appreciated!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
vector<string> ListOfFiles;
string Line;
vector<string> ValRow;

// Load all the file names into ListOfFiles

for(int i=0; i<ListOfFiles.size(); i++)
  {
    Line.clear();
    ifstream In_Flows((ListOfFiles[i]).c_str());
    while (getline(In_Flows, Line))
      {
        istringstream linestream(Line);
        ValRow.clear();
        while(getline(linestream, Value, ','))
          { ValRow.push_back(Value); }
      }
    // Load contents of ValRow into objects
  }


GDB Output:
Program received signal SIGSEGV, Segmentation fault.
0xff056b20 in realfree () from /lib/libc.so.1
(gdb) bt
#0 0xff056b20 in realfree () from /lib/libc.so.1
#1 0xff0573d4 in cleanfree () from /lib/libc.so.1
#2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
#3 0xff05641c in malloc () from /lib/libc.so.1
#4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
#5 0xff318fe4 in std::string::_Rep::_S_create ()
from /usr/local/lib/libstdc++.so.6
#6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
#7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
#8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, ATAFlag=true)
at ReadTheFile.h:119
#9 0x0001a2f0 in main (argc=2, argv=0xffbffccc) at Main.cpp:60
(gdb) up
#1 0xff0573d4 in cleanfree () from /lib/libc.so.1
(gdb) up
#2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
(gdb) up
#3 0xff05641c in malloc () from /lib/libc.so.1
(gdb) up
#4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
(gdb) up
#5 0xff318fe4 in std::string::_Rep::_S_create ()
from /usr/local/lib/libstdc++.so.6
(gdb) up
#6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
(gdb) up
#7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
(gdb) up
#8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, Flag=true)
at ReadTheFile.h:119
119 while(getline(linestream, Value, ','))
(gdb) print Line
$1 = {static npos = 4294967295,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0xd756c "DataPoint0,DataPoint1,DataPoint2,DataPoint3,DataPoint4,DataPoint5,DataPoint6,DataPoint7,DataPoint8,DataPoint9"}}
(gdb)
The easiest way is to use a
1
2
3
4
try{
} catch (std::exception e){
}catch(...){
}

this would suppress all errors. But maybe you should write a handle function for the errors that you catch.
Don't forget to close the files when you catch a error in the file.
I can't see any obvious mistakes there. I just had a couple of ideas that you might want to try:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
vector<string> ListOfFiles;
string Line;
vector<string> ValRow;

// Move this outside so that the object is not being constantly
// constructed and destructed - maybe draining a resource?
ifstream In_Flows; 

// Load all the file names into ListOfFiles

for(int i=0; i<ListOfFiles.size(); i++)
  {
    Line.clear();

    In_Flows.open((ListOfFiles[i]).c_str()); 

    if(In_Flows) // check before using getline
    {
        while (getline(In_Flows, Line))
        {
            istringstream linestream(Line);
            ValRow.clear();
            while(getline(linestream, Value, ','))
            { ValRow.push_back(Value); }
      }
    }
    In_Flows.close(); // Explicitely close

    // Load contents of ValRow into objects
  }


Hope that might help.
Looking at your debug output it seems like you are running out of memory.

Maybe you need to write your collected data to file every once in a while to free some RAM?
Are you storing a lot of information in vectors? They want contiguous memory; maybe try a deque...
@RedX: Certainly not. You can't block SIGKILL, and it's a BAD idea to ignore SIGSEGV. In fact, it's a bad idea to ignore ANY signals. Catch them yes. Ignore, not.

I recommend that you load the files you need as you need them and afterward delete them. It saves memory, even though it's slower.

Or upgrade your system.

-Albatross
Last edited on
You are absolutely right, Albatross, that ignoring those signals is bad but i was unaware that try catch would also catch those interrupts see http://www.cplusplus.com/forum/general/5957/. catch(...) is always a bad idea, since you can't identify what went wrong, but sometimes it's just needed (or easier then find out what's wrong ;-) ).

I also thought that a probable reason his program is crashing is memory shortage.
Last edited on
Galik mentioned something I noticed, - in your example code, you were not closing
each file when you were done, so I was thinking that you may run out of file handles at 500????

You have no checks, to see if the file opened succesfully - attempting to read from
a bad file handle will almost certainly cause a crash.
Last edited on
This is all excellent, thanks so much, guys! I'll implement your suggestions...

Galik, you say you see memory problems in the gdb output; can you tell me what you see?
moorecm +1

Use a std::deque or std::list to store your data.

[edit] The file is local to the loop, so it should automatically close at the end of each loop.
Last edited on
phummon2 wrote:
Galik, you say you see memory problems in the gdb output; can you tell me what you see?


Here is the stack unwinding (back trace) from the error:


#0 0xff056b20 in realfree () from /lib/libc.so.1
#1 0xff0573d4 in cleanfree () from /lib/libc.so.1
#2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
#3 0xff05641c in malloc () from /lib/libc.so.1
#4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
#5 0xff318fe4 in std::string::_Rep::_S_create ()
from /usr/local/lib/libstdc++.so.6


I noticed that the error seems to have occurred while calling this function: std::string::_Rep::_S_create () from the standard library?

It looks a bit like something bad happened while trying to either grab or release memory during a std::string operation.

Duoas wrote:
[edit] The file is local to the loop, so it should automatically close at the end of each loop.


Yes indeed.

But I had the thought that by calling the fstream constructor/destructor each loop the implementation may be triggering the problem. Perhaps a new file descriptor is chosen after each constructor call? Then that might max-out at 500? Or something else?

However by moving the fstream out of the loop and explicitly calling open() and close() you avoid thrashing the constructor/destructor which might make some difference. Even though it should not.


[Edit] Mind you, I can't reproduce such an error here...
Last edited on
No, it doesn't make any difference. It is actually recommended to create/destroy the fstream in a loop than play with closing and reopening one external to a loop. (I'm not sure why, though... unless it just avoids carryover state problems from the previous iteration.)
Duoas wrote:
It is actually recommended to create/destroy the fstream in a loop than play with closing and reopening one external to a loop. (I'm not sure why, though... unless it just avoids carryover state problems from the previous iteration.)


Perhaps that's because letting a local go out of scope is less error prone. (Although, I'd bet if you open the same object twice it closes the first one anyway.)
I noticed that the error seems to have occurred while calling this function: std::string::_Rep::_S_create () from the standard library?

It looks a bit like something bad happened while trying to either grab or release memory during a std::string operation.


Galek,

You're the best, thanks a million, my friend! I think you're right, my program eats system memory like a beast, it must be tripping up over the allocation.

Thanks to everyone who took the time to look at this! :)
Topic archived. No new replies allowed.