MultiMap removing duplicates

Pages: 1 234 5 6

Hmmm then can you post your full code?

Thanks a TONN again. It worked :D

std::vector<MyPred>* vPred = new std::vector<MyPred>;

vPred->push_back(MyPred("a", "b"));

// ... stuff ...

delete vPred;

Galik (2254)

Ah, its probably because I forgot to mention your constructor needs to operate as a default constructor. You need to make a slight change:

struct MyPred
{
	std::string x;
	std::string y;

	// add default values
	MyPred(const std::string& x = "", const std::string& y = ""): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};

heidiK (109)

it worked :D.

This code is for finding duplicates

for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

and once finding the difference between a and z, I can push_back the THEN unique element into the original vector like this:

for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred.end(), *i);
		if(ret.second - ret.first != 1)
		{
			for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
			{
				// overwrite original vector vPred with unique values
				*u++ = *i;
			}
		}
		else
		{
			uPred.push_back(*i);
		}

is this what you meant?

Galik (2254)

You will need slightly different code to iterate over your vector containing the duplicates but use that technique to overwrite the original vector to save space.

To convert a std::string to an integer you can use this:

#include <sstream>

// ... stuff ...

std::istringstream iss(my_string);
int v;
if(iss >> v)
{
    // conversion was a success and v is a valid integer.
}

heidiK (109)

AAAAAAAAAAAAAAAAAAA WHAT IS THIS :-?, scaryyyy

#include <sstream>

// ... stuff ...

std::istringstream iss(my_string);
int v;
if(iss >> v)
{
    // conversion was a success and v is a valid integer.
}

which country are you from? I live in London and it is 9pm now. I will compile the whole code tomorrow morning and will show it to you before I go to office on Monday. You have been an excellent teacher :D and I am not a very good student :P

Galik (2254)

A std::istringstream is an input stream for a string. It works just like inputting from a file on disk, except that you are reading from a string rather than a file. So you can read different data types from the stream:
http://www.cplusplus.com/reference/iostream/istream/operator%3E%3E/

heidiK (109)

Thank youuuuu :)

heidiK (109)

Good Morning Galik.

I tried the code you sent me yesterday with 60000 values and around 20 duplicates and it took a loooooottttttt of time ;'(. Here is the code.

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>
#include <sstream>

struct MyPred
{
	std::string x;
	std::string y;

	// add default values
	MyPred(const std::string& x = "", const std::string& y = ""): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	MyPred pred;

	for(int i = 0; i < 60000; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 10; i < 25; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 35000; i < 35005; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	/*vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("b", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "c"));
	vPred->push_back(MyPred("b", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "a"));*/

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	/*for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}*/

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	/*std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}*/

	delete vPred;
}

could you please have a look at it? It is I think taking too much time sorting the vector and then finding/displaying the duplicates.

Galik (2254)

The code you presented here runs very quickly indeed for me. But then you have given it test data that is almost completely sorted. I put in some timing

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>
#include <sstream>
#include <ctime>
#include <iomanip>

struct MyPred
{
	std::string x;
	std::string y;

	// add default values
	MyPred(const std::string& x = "", const std::string& y = ""): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	MyPred pred;

	std::cout.precision(2);
	std::cout << std::fixed;

	std::cout << "Initialising data: ";
	clock_t begin = clock();

	for(int i = 0; i < 60000; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 10; i < 25; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 35000; i < 35005; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	clock_t end = clock();
	std::cout << double(end - begin)/CLOCKS_PER_SEC << "sec for " << vPred->size() << " items\n";

	/*vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("b", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "c"));
	vPred->push_back(MyPred("b", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "a"));*/

	std::cout << "Sorting data     : ";
	begin = clock();

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	end = clock();
	std::cout << double(end - begin)/CLOCKS_PER_SEC << "sec for " << vPred->size() << " items\n";

	std::cout << "Searching data   : ";
	begin = clock();

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred->end(), *i);

		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	end = clock();
	std::cout << double(end - begin)/CLOCKS_PER_SEC << "sec\n";

	std::cout << "Printing data    :\n";
	begin = clock();


	std::cout << "vPred: Sorted input\n";
	/*for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}*/

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	end = clock();
	std::cout << '\n' << double(end - begin)/CLOCKS_PER_SEC << "sec\n";
	/*std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}*/

	delete vPred;
}

Initialising data: 0.33sec for 60020 items
Sorting data     : 0.10sec for 60020 items
Searching data   : 0.07sec
Printing data    :
vPred: Sorted input
dPred: Only the values that were duplicated
[10, 11]
[10, 11]
[11, 12]
[11, 12]
[12, 13]
[12, 13]
[13, 14]
[13, 14]
[14, 15]
[14, 15]
[15, 16]
[15, 16]
[16, 17]
[16, 17]
[17, 18]
[17, 18]
[18, 19]
[18, 19]
[19, 20]
[19, 20]
[20, 21]
[20, 21]
[21, 22]
[21, 22]
[22, 23]
[22, 23]
[23, 24]
[23, 24]
[24, 25]
[24, 25]
[35000, 35001]
[35000, 35001]
[35001, 35002]
[35001, 35002]
[35002, 35003]
[35002, 35003]
[35003, 35004]
[35003, 35004]
[35004, 35005]
[35004, 35005]

0.00sec

I'll try it again soon with some random data. That should slow it down a bit.

heidiK (109)

mine is still running looking for duplicates :D. It has been more than 15-20 minits :P

Also I tried with different strings, not just single characters and the duplicate output was not correct.

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>

struct MyPred
{
	std::string a;
	std::string x;
	std::string y;
	std::string z;

	MyPred(const std::string& a, const std::string& x, const std::string& y, const std::string& z): a(a), x(x), y(y), z(z) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y && z == p.z; // a == p.a && 
	}

	bool operator<(const MyPred& p) const
	{
		//if(a < p.a) return true;
		//if(a > p.a) return false;
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		if(z < p.z) return true;
		if(z > p.z) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	vPred->push_back(MyPred("a2c", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("j4h", "b", "c", "j87h"));
	vPred->push_back(MyPred("d4f", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("n7s", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("l9m", "b", "c", "j87h"));
	vPred->push_back(MyPred("p24a", "x", "c", "p43a"));
	vPred->push_back(MyPred("q56r", "l", "m", "q90r"));
	vPred->push_back(MyPred("g11v", "f", "h", "g63v"));
	vPred->push_back(MyPred("u3w", "v", "d", "u11w"));

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		/*ret = std::equal_range(i, vPred.end(), *i);
		if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
		else
		{
			dPred.push_back(*i);
		}*/

		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	delete vPred;
}


vPred: Sorted input
[a2c, 1Gak, c, d4f]
[d4f, 1Gak, c, d4f]
[n7s, 1Gak, c, d4f]
[j4h, b, c, j87h]
[l9m, b, c, j87h]
[g11v, f, h, g63v]
[q56r, l, m, q90r]
[u3w, v, d, u11w]
[p24a, x, c, p43a]

dPred: Only the values that were duplicated
[a2c, 1Gak, c, d4f]
[a2c, 1Gak, c, d4f] // 'a' should be d4f
[a2c, 1Gak, c, d4f] // 'a' should be n7s
[j4h, b, c, j87h]
[j4h, b, c, j87h] // 'a' should be l9m 

uPred: Only the values that were unique
[g11v, f, h, g63v]
[q56r, l, m, q90r]
[u3w, v, d, u11w]
[p24a, x, c, p43a]

Galik (2254)

I think the duplicate problem is because the original code assumed that duplicates were identical in every field. But, of course, now the a field is not considered in testing for duplication.

I think you can modify line 74:

dPred.push_back(*j); //put each duplicate onto a new vector (using j not i)!

heidiK (109)

Thanks it worked :)

dPred.push_back(*j); //put each duplicate onto a new vector (using j not i)!

Galik (2254)

I don't know what your speed problem is I can't slow it down much even giving it less efficient data.

heidiK (109)

it may be because of my laptop. It may run faster on my office PC. I HOPE SO. Could you please send me the random numbers. I mean that bit of code?

Galik (2254)

I tried various things. This produces strings 8 characters long and with random digits between 0 and 3.

	std::ostringstream oss;
	for(size_t i = 0; i < 60000; i++)
	{
		oss.str(""); // clear buffer
		for(size_t j = 0; j < 8; ++j)
		{
			oss << (rand() % 4); // 0 - 3
		}
		pred.x = oss.str();
		oss.str(""); // clear buffer
		for(size_t j = 0; j < 6; ++j)
		{
			oss << (rand() % 4); // 0 - 3
		}
		pred.y = oss.str();
		vPred->push_back(pred);
	}

Initialising data: 0.25sec for 60000 items
Sorting data     : 0.21sec for 60000 items
Searching data   : 0.23sec
Printing data    :
vPred: Sorted input
dPred: Only the values that were duplicated
[00110203, 020303]
[00110203, 020303]
[01212311, 000301]
[01212311, 000301]
[01303023, 313131]
[01303023, 313131]
[10310321, 233031]
[10310321, 233031]
[12213103, 030233]
[12213103, 030233]
[20132230, 302210]
[20132230, 302210]
[21313200, 021322]
[21313200, 021322]
[22102231, 321002]
[22102231, 321002]
[30111202, 011003]
[30111202, 011003]
[32122110, 223330]
[32122110, 223330]
[33321312, 010130]
[33321312, 010130]

0.00sec

Last edited on

heidiK (109)

You have almost solved the problem for me which I have been trying to solve for almost 2 weeks :P. Could you please do the rest as well :P i-e finding the difference between the integer part of 'a' and 'z' which are essentially the start and end date/time. yyyy/mm/dd hh/mm/ss/. I have to find the difference between minits (of duplicate elements). Then I have to push_back the duplicate with minimum time difference to the original vector so that the original vector now contains unique and only one copy of each element.

heidiK (109)

The string part of struct members should all be capital otherwise it again sorts based on case sensitivity.

Galik (2254)

The sort will always be case sensitive unless you take specific action to make it otherwise.

heidiK wrote:
finding the difference between the integer part of 'a' and 'z' which are essentially the start and end date/time. yyyy/mm/dd hh/mm/ss/.

Well I can point you in the right direction maybe. You need to look in <ctime> you can set the elements of a struct tm to the values for year, month, date, hour, minute and second. Then you can convert that to a time_t type which is the same date recorded in seconds since 1900. Then you can just compare those:

tm date;

date.tm_year = // extracted value - 1900 (dont forget to subtract 1900!!)
date.tm_mon = // extracted value
date.tm_mday = // extracted value
date.tm_hour = // extracted value
date.tm_min = // extracted value
date.tm_sec = // extracted value

time_t time = mktime(&date);

// now time contains the number of seconds between 1900 and the date in your struct tm.

heidiK (109)

Please please pleaseeeee do it for me, pleaseeeeeeeeee. I am just reading from a file with 60000 records (string lines) and date/time is stored as a string like any other string. Pleaseeeeeeeeeeeeeeeeeee.

I am very thankful for everything you have done.

Pages: 1 234 5 6