refining data structures

I'm relativlely inexperienced with C/C++, just a bit of dabbling now and then.

I've written a program that helps me deal with data duplication in a large storage system. It works well except that it does slow down the more data you put through it. I can see why the slowdown happens: it runs through a linked list for each block of data before adding to the list. Consequently the more data that it looks at, the longer it takes to go through the list each time.

Can anyone suggest a better data structure for this purpose? Here is a brief description of my current structure.

A Single list of hash nodes
A hash node contains a hash, byte count and another linked list of filenames

The second linked list (filenames) is not the problem (I believe) as it is only accessed when certain conditions are met and doesn't occur that often.

Any tips for dealing with this information better would be appreciated. If I need to supply any more information, please let me know.

Thanks
Let me see if I understand what you're trying to do: you have a bunch of data blocks, and you want to know if there are exact duplicates, right?
Here's what you can do: hash the data blocks and add them to an std::set. You'll probably need to provide your own comparison function. As for hashing algorithms, there's many. MD5 and SHA-1 are good and easy to find. In fact, here's an implementation of SHA-1: http://www.packetizer.com/security/sha1/
Last edited on
Thanks for the reply, yes thats pretty much what I'm doing. I've got the hashing and comparison all sorted, the only thing is to try and refactor the method I'm using to store the hashes.

As an example, the program can process data at around 10MB/s, however by the time it gets past around 200MB, it's only doing 2-3MB/s.

Basically, when a hash is generated, I iterate through the linked list to see if it has already been recorded. If it hasn't then I add it, if it has then I associate the filename with the hash.

Would using std::set be faster than scanning through a linked list? I don't have much experience with many of the std library types/methods.

Thanks
Yes. std::set can verify whether an element exists in O(log n), as opposed to the O(n) of your current method. So for example, it would take 23 comparisons to verify against 10,000,000 elements.
Last edited on
Excellent, I'll give that a go. Thanks for the help

Topic archived. No new replies allowed.