Finding Duplicates in sets of data

Pages: 123
@xorebxebx

I think the main reason our solution is slow is because as the range of the numbers is getting large, the underlying tree in the map is getting deeper and deeper, thus searching for a key takes more time. Although cache misses can cause slowdowns too.
I made a goof. I put the timings for the non-destructive test the same as the destructive test. The non-destructive test is slightly slower.

I redid the test program slightly to graph number of elements against time:

http://imagebin.org/108999
Weird. m4ster r0shi's complexity seems to be slightly below O(n). Or maybe it's an optical illusion provoked by r0mai's being slightly above O(n).
One thing to remember is that as the size of the vector increases, the range of the numbers that it contains increases also. So this broadening in two dimensions may be working against a logarithmic result.
I'm guessing my lack of a degree at this point is why im struggling to follow the recent comments.
Obviously. To find duplicates, the algorithm needs to do two things: traverse the vector once, and access the map once for each element. That's an average of O(n log n).
Last edited on
m4ster r0shi's algorithm does a quicksort which should be logarithmic O(nlogn). It then iterates through the vector which should be linear. But as the algorithm iterates through the vector it makes a 'pit stop' every time the stored number changes. If the range of numbers is small then the number of 'pit stops' should be small. But as the range of numbers increases so will the number of 'pit stops'. I think that will also be linear. So I *think* we may have something like an O(nlogn + n2) thing going on. But my math is a bit rusty.

EDIT: Having said that I just re-ran it and pre-sorted the vector before calling the algorithm. That should remove the sort's O(nlogn) factor. It looks strongly linear:

http://imagebin.org/109015

So I guess my math isn't up to scratch...
Last edited on
Ok, I think I'm making some progress here... I made something that counts 16bit integers. There are two versions, one that works exactly like the one that counts 8bit integers (fast) and one that is slightly different (slow but not as memory consuming). The second one actually works like this:

It takes the 16bit integers given:

aa 12
bc 3d
ff a1
aa 11
ff 03
bc 2f

and stores them like this:

aa -> 12, 11
bc -> 3d, 2f
ff -> a1, 03

Then, instead of using a counter of 65536 ints it uses a counter of 256 ints to count every subset separately. Of course it doesn't make much sense to do this here, as an array of 65536 ints is not considered expensive, but if I wanted to sort 32bit ints then having an array of 4294967296 ints as a counter is way too much, so breaking it to smaller pieces is necessary and that's what I demonstrate here. I guess the next step would be writing something generic enough to work with any (raw) data type as long as you provide the number of bytes as an argument. I believe this will take some time...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
#include <iostream>

#include <vector>
//#include <deque>

#include <cstdlib>
#include <ctime>
#include <cstring>
using namespace std;

typedef unsigned char BYTE;
typedef unsigned short int WORD;

const int SIZE=1000000;
const bool print_results=false;

void count16_fast(vector<WORD> & word_vec)
{
    int counter[65536]={0};

    int start=clock();
    //cout << start << endl;

    for (int i=0; i<SIZE; i++)
        counter[word_vec[i]]++;

    int stop=clock();
    //cout << stop << endl;

    if (print_results)
    {
        for (int i=0; i<65536; i++)
            cout << i << ' ' << counter[i] << endl;
    }

    cout << "fast counting time: " << stop-start << " ms" << endl;
}

void count16_cheap(vector<WORD> & word_vec)
{
    int total_time=0;
    int start;
    int stop;

    start=clock();
    //cout << start << endl;

    vector<BYTE> counter[256];
    //deque<BYTE> counter[256];

    WORD cur_word;

    for (int i=0; i<SIZE; i++)
    {
        cur_word=word_vec[i];
        counter[(cur_word>>8)].push_back(cur_word&255);
    }

    int mini_counter[256];

    stop=clock();
    //cout << stop << endl;

    total_time+=stop-start;

    for (int i=0; i<256; i++)
    {
        start=clock();
        //cout << start << endl;

        memset(mini_counter,0,256*sizeof(int));

        vector<BYTE> & cur_vec=counter[i];
        //deque<BYTE> & cur_vec=counter[i];

        for (int j=0; j<cur_vec.size(); j++)
        {
            mini_counter[cur_vec[j]]++;
        }

        stop=clock();
        //cout << stop << endl;

        total_time+=stop-start;

        if (print_results)
        {
            for (int j=0; j<256; j++)
            {
                cur_word=(i<<8)+j;
                cout << cur_word << ' '
                    << mini_counter[j] << endl;
            }
        }
    }

    cout << "cheap counting time: " << total_time << " ms" << endl;
}

//my RAND_MAX is ~32K so if I want uniform distribution
//of WORDs I have to do tricks like this one...

const int MY_RAND_MAX=32767;

class WORD_RNG
{
private:

    static int sub_seeds[2];
    static int main_seed;

public:

    static void init()
    {
        memset(sub_seeds,0,2*sizeof(int));
        main_seed=0;
    }

    static int draw()
    {
        srand(main_seed++);

        int cur_seq=rand()%2;

        srand(sub_seeds[cur_seq]++);

        return rand()%(MY_RAND_MAX+1)+cur_seq*(MY_RAND_MAX+1);
    }

    static void seed(int s)
    {
        main_seed=s%MY_RAND_MAX;
    }
};

int WORD_RNG::sub_seeds[2];
int WORD_RNG::main_seed;

int main()
{
    WORD_RNG::init();
    WORD_RNG::seed(time(0));

    vector<WORD> word_vec;

    word_vec.resize(SIZE);
    for (int i=0; i<SIZE; i++)
    {
        word_vec[i]=WORD_RNG::draw();
    }

    count16_fast(word_vec);

    cout << "hit enter to continue..." << endl;
    cin.get();

    count16_cheap(word_vec);

    cout << "hit enter to quit..." << endl;
    cin.get();

    return 0;
}

Last edited on
closed account (EzwRko23)
Ok, have you ever heard of hashing? I presume that if you think on this problem for another day or two, you'll reinvent a hashmap. You are on a good way. ;)

Actually the optimal way to do this grouping is to hash to n buckets, with n set so that the whole hashmap fits in the L1 data cache. So probably n < 1000 is a good solution. Then, hash each bucket, this time with counting - such two phase algorithm could group over a million of distinct elements of any type with almost sequential access to RAM and most operations on data in L1.
If more elements are needed, add another phase - with 3 phase grouping, you could get over 1 billion elements.

I'll post code for this soon.


I think the main reason our solution is slow is because as the range of the numbers is getting large, the underlying tree in the map is getting deeper and deeper, thus searching for a key takes more time.

This is the cause of a slowdown of the version with std::map. But my version uses hashmap, and hashmap lookups are O(const), so the whole algorithm is linear if memory had the same speed regardless of required size.

The slowdown in my version is caused by:
1. resizing the hashmap - rehashing - this is costly, and this need not be done in a tree map; giving a size hint we could avoid this, but you must know the approximate number of distinct items in advance and this is not always the case. Anyway it is just a constant factor.
2. poor memory locality because of random data in the array - small hashmaps are fast because they fit in the L1 cache; when crossing the L2 cache size, every lookup means a cache miss = many, many CPU cycles just to wait for memory.

Last edited on
Topic archived. No new replies allowed.
Pages: 123