Finding Duplicates in sets of data

Pages: 123

So, are you ready to compare memory footprint of your entire OS with the memory footprint taken by JVM and my OS?

Okay, but you'll have to use your own system, because I have no idea what sort of configuration you got, there. Let me know if you get something really surprising.

Comparing size of the heap of the C++ app with the runtime part of the JAVA app is pointless.

I don't see how. The point is know exactly how much memory is needed to run the program.

closed account (EzwRko23)

The point is know exactly how much memory is needed to run the program.

Then you are comparing not languages or their memory management models but OSes.
This is pointless, because every OS requires some minimal amount of RAM to load. Some, very little, as the JVMs in mobile phones, some very large like Vista (which at startup occupies about 900 MB RAM on my laptop - I've got no idea, where it all goes). Also OSes differ with capabilities much - some are small and do little, some are large and have plenty fo functions. JVM adds quite a lot of capabilities not easily available in C++ world like serialization, reflexion or code hot-swap support. In times where you can get a GB of RAM for less than $20, saying "whoa, your app requires an OS that has 100 MB of RAM higher requirements than mine" is just plain stupid.

Much more interesting is comparing how the memory requirements grow with the growth of the size of the problem. And in this case, properly tuned Java is not much more heavy than C++ (and in some extreme cases can be lighter: memory fragmentation is your enemy). At least the difference is a fair price for getting memory management safety, and not >10x as found in some flawed benchmarks (like the computer language shootout game).

Last edited on

tition (887)

Well, it appears that scala is indeed faster on a parallel implementation than single-threaded C++.

What about using parallel C/C++?

http://msdn.microsoft.com/en-us/library/dd728073.aspx

I have never really done a parallel_for, but it appears at an uneducated glance that the syntax is only slightly nastier than the scala version.

However, in view of the fact that C/C++ is (in)famous for having a nasty syntax - I think many C++ programmers would agree here - it appears that C/C++ has viable parallel solutions as well.

Last edited on

closed account (EzwRko23)

All right, but you forgot that this was **not** a parallel Scala implementation. The Scala code I posted is perfectly single-threaded. Hmm, anyway that would be a very interesting test. If I have some more spare time I make a parallel implementation.

Last edited on

helios (17607)

Then you are comparing not languages or their memory management models but OSes.

Not really. Not assuming both runs use the same OS, which is why I said you should try doing that on your own system.

In times where you can get a GB of RAM for less than $20, saying "whoa, your app requires an OS that has even 100 MB of RAM higher requirements than mine" is just plain stupid.

I'll tell you what's plain stupid: uttering that phrase in an optimization thread.

closed account (EzwRko23)

Not assuming both runs use the same OS

But how? JVM **IS** an OS in a similar way like Vista is an OS to run C++ apps.
So Vista + JVM is a different OS than plain Vista. You pay for more functionality. Ok, you may ask how much do you pay for running this app on Java, compared with the case JVM was not needed.

So here is the measurement taken from the Windows Task Manager when I run the app for a 10 million element array: 59 MB

Strange? No, the JVM probably got cached by the OS and because I have several other Java apps running, the only increase in RAM is caused by the heap data. Anyway, a plain SE JVM run for the first time does not require more than about 10 MB of RAM.

Last edited on

helios (17607)

JVM **IS** an OS

............................................________
....................................,.-‘”...................``~.,
.............................,.-”...................................“-.,
.........................,/...............................................”:,
.....................,?......................................................\,
.................../...........................................................,}
................./......................................................,:`^`..}
.............../...................................................,:”........./
..............?.....__.........................................:`.........../
............./__.(.....“~-,_..............................,:`........../
.........../(_....”~,_........“~,_....................,:`........_/
..........{.._$;_......”=,_.......“-,_.......,.-~-,},.~”;/....}
...........((.....*~_.......”=-._......“;,,./`..../”............../
...,,,___.\`~,......“~.,....................`.....}............../
............(....`=-,,.......`........................(......;_,,-”
............/.`~,......`-...............................\....../\
.............\`~.*-,.....................................|,./.....\,__
,,_..........}.>-._\...................................|..............`=~-,
.....`=~-,_\_......`\,.................................\
...................`=~-,,.\,...............................\
................................`:,,...........................`\..............__
.....................................`=-,...................,%`>--==``
........................................_\..........._,-%.......`\
...................................,<`.._|_,-&``................`\

No, it isn't.

An operating system (OS) is a set of system software running on a computer that manages the system hardware and that provides a hardware abstraction platform and common system services for the execution of application software.

A run-time system (also called runtime system or just runtime) is a collection of software designed to support the execution of computer programs written in some computer language.

Venn diagram: ( runtime ( operating system ) )

R0mai (730)

#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
#include <ppl.h>
#include <algorithm>
#include <map>

template<class T>
void findDuplicates_r0mai(const std::vector<T>& v) {

    std::map<T, unsigned> counter;

    for(unsigned i = 0; i < v.size(); ++i) {
        ++counter[v[i]];
    }

    std::for_each(counter.begin(), counter.end(), [](const std::pair<T, unsigned>& p) {
        std::cout << p.first << ' ' << p.second << '\n';
    });

}

template<class T>
void findDuplicates_r0mai_parallel(const std::vector<T>& v) {

    std::map<T, unsigned> counter;

    Concurrency::parallel_for(unsigned(0), unsigned(v.size()), [&](unsigned i) {
        ++counter[v[i]];
    });
    
    std::for_each(counter.begin(), counter.end(), [](const std::pair<T, unsigned>& p) {
        std::cout << p.first << ' ' << p.second << '\n';
    });
}

int main() {
    srand(time(0));

    const int SIZE = 100000;

    std::vector<int> v(SIZE);
    
    for(unsigned i = 0; i < SIZE; ++i) {
        v[i] = rand() % 100;
    }

    clock_t start = clock();
    findDuplicates_r0mai_parallel(v);
    clock_t parallel_time = clock()-start;

    std::cout << "-------------------------\n";

    start = clock();
    findDuplicates_r0mai(v);
    clock_t non_parallel_time =  clock()-start;

    std::cout << "PARALLEL: " << parallel_time << std::endl;
    std::cout << "NON PARALLEL: " << non_parallel_time << std::endl;

    std::cin.ignore();
}

PARALLEL: 315
NON_PARALLEL: 421

Is it safe though? Multiple threads modifying the std::map at the same time. MSDN didn't say anything about it, I'm not sure.

helios (17607)

Ugh. That time sucks.
But no, it's not safe AFAIK. Actually, I'm surprised it didn't crash. I would have expected a twofold increment without synchronization. Maybe the Concurrency::parallel_for() is automatically adding synchronization, although I highly doubt it.

Last edited on

R0mai (730)

Commenting out the outputs give

PARALLEL: 244
NON_PARALLEL: 365

I've run it several times now, and it never crashed and always gave correct results.

closed account (EzwRko23)

It's not safe. You need to lock map while adding to it. std::map is not thread-safe.

On the topic whether JVM is an OS or not - name it as you wish, but JVM has many features of the an operating system:
- security control
- code loading / unloading
- memory management
- concurrency control
- monitoring capabilities
- common public API and ABI

With some enhancements and additions (like filesystem and device drivers) JVM could be run on a bare metal, without underlying OS. There exist such research operating systems: http://en.wikipedia.org/wiki/SavaJe

Galik (2254)

This is interesting. I am getting rather different results to mcleano. On my system m4ster_r0shi is performing rather better than r0mai. That is even true with the non-destructive version which performs the same as the destructive version. The copy must be very fast.

It seems that r0mai's code works very well for a small integer value range. mcleano's test only used values between 1 and 100. I have broadened the integer value range to make it as large as the test sample. So I am testing 1,000,000 values between 0 and 999,999. Then the std::map seems to feel the strain.

Here is the test code:

#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
#include <algorithm>
#include <map>

template<class T>
void findDuplicates_r0mai(const std::vector<T>& v)
{

	std::map<T, unsigned> counter;
	typedef typename std::map<T, unsigned>::iterator iter;

	for(unsigned i = 0; i < v.size(); ++i)
	{
		++counter[v[i]];
	}

	for(iter i = counter.begin(); i != counter.end(); ++i)
	{
		std::cout << i->first << " has " << i->second << "\n";
	};

}

/**
 * Non-Destructive version
 */
template <typename T>
void findDuplicates_m4ster_r0shi_nd(std::vector<T> values)
{
	typedef typename std::vector<T>::iterator iter;

	std::sort(values.begin(), values.end());
	iter i = values.begin();
	iter end = values.end();
	while(i != end)
	{
		iter it=upper_bound(i,end,*i);
		std::cout << *i << " has " << it-i << "\n";
		i=it;
	}
}

/**
 * Destructive version
 */
template <typename T>
void findDuplicates_m4ster_r0shi(std::vector<T>& values)
{
	typedef typename std::vector<T>::iterator iter;

	std::sort(values.begin(), values.end());
	iter i = values.begin();
	iter end = values.end();
	while(i != end)
	{
		iter it=upper_bound(i,end,*i);
		std::cout << *i << " has " << it-i << "\n";
		i=it;
	}
}

int main()
{
	srand(time(0));

	const size_t SIZE = 1000000;

	std::vector<int> v(SIZE);

	for(size_t i = 0; i < SIZE; ++i)
	{
		v[i] = rand() % SIZE;
	}

	std::vector<int> test_v(SIZE);

	clock_t start = clock();
	findDuplicates_r0mai(v);
	double r0mai = ((double)clock() - start) / CLOCKS_PER_SEC;

	start = clock();
	findDuplicates_m4ster_r0shi_nd(v);
	double m4ster_r0shi_nd = ((double)clock() - start) / CLOCKS_PER_SEC;

	start = clock();
	findDuplicates_m4ster_r0shi(v);
	double m4ster_r0shi = ((double)clock() - start) / CLOCKS_PER_SEC;

	std::cerr << "r0mai          : " << r0mai << std::endl;
	std::cerr << "m4ster_r0shi_nd: " << m4ster_r0shi << std::endl;
	std::cerr << "m4ster_r0shi   : " << m4ster_r0shi << std::endl;
}

-O0 - no optimisation
r0mai          : 2.6
m4ster_r0shi_nd: 1.64
m4ster_r0shi   : 1.64

-O3 - max optimisation
r0mai          : 1.91
m4ster_r0shi_nd: 0.55
m4ster_r0shi   : 0.55

EDIT: I should note that I ran this by redirecting the output to /dev/null to minimise the io component

./test > /dev/null

EDIT 2: I just added optimisation times. It seems that m4ster_r0shi's method benefits a lot more from optimisation.

Last edited on

helios (17607)

- security control
- concurrency control
- monitoring capabilities
- common public API and ABI

None of those features are unique to OSs.

- code loading / unloading
- memory management

These, however, are required of any runtime. What exactly can I do that doesn't involve loading code and allocating memory?

With some enhancements and additions (like filesystem and device drivers) JVM could be run on a bare metal, without underlying OS.

Name one interpreter that couldn't conceivably be transformed into an OS.

There exist such research operating systems: http://en.wikipedia.org/wiki/SavaJe

I always get a chuckle out of such projects. What kind of masochist would submit themselves to that kind of pain?

tition (887)

Is it safe though?

Well, that is not a problem for me, since I use my own implementation of vector/hashed list.

In my implementation, if you do not resize a vector/hashed list during execution, it is thread safe. To insure that, you only need to reserve enough memory in advance, which is no problem for the task at hand.

As far as the SCALA implementation being better C++ on a single thread: I definitely believe you. Is this an issue with the implementation of the vector class (which is still not a fundamental difference)?

Again, as I use my own alternative of the vector class, this wouldnt be a problem for me. Perhaps the real performance issue could be the way the code gets compiled down to machine code. Is that is more effective on scala?

Last edited on

helios (17607)

This appears to be an issue with the implementation of the vector class (which is still not a fundamental difference).

About that, interesting thought: using std::vector::operator[]() turned out to be slightly faster (~20 ms for every 10M elements) than iterating a pointer, and using an std::map<const T *,unsigned> was over 5 times slower for T==int. It seems pointer comparisons are really slow.

tition (887)

I just figured that to make my equivalent of the vector class thread-safe, all I need to do is to add a mutex variable to my "vector".

Then my vector resizing looks like:

1
2
3

this->theMutex.lock();
this->old_implementation_of_resize(newSize);
this->theMutex.unlock();

Just a note that theMutex object should be an instance of a wrapper class for mutexes that takes care of all the #ifdef 's depending on the OS you work on.

Since everybody knows what is the functionality of a mutex, this code is as portable as it gets.

Note: I still haven't been able to write a working mutex on windows XP. Linux poses no trouble though.

closed account (EzwRko23)

As far as the SCALA implementation being better C++ on a single thread: I definitely believe you. Is this an issue with the implementation of the vector class (which is still not a fundamental difference)?

I think this is mainly due to the fact that std::map is implemented differently than scala.collection.mutable.Map. I don't claim outperforming Scala in C++ is not possible - indeed it is, but you have to:
1. really know what you are doing
2. write lots of hand-tuned code, just for the purpose of this benchmark

On average, when using language "normally" and writing simple and readable code, they are in the same league in the aspect of performance.

Again, as I use my own alternative of the vector class, this wouldnt be a problem for me. Perhaps the real performance issue could be the way the code gets compiled down to machine code. Is that is more effective on scala?

This is just a different compiler, with different optimization set. Sometimes it will produce faster code, and sometimes slower one, but the differences are not so huge as they used to be 5 or 10 years ago. With every new version of Java the HotSpot compiler gets smarter and smarter.

blackcoder41 (1426)

@xorebxebx
http://en.wikipedia.org/wiki/Troll_%28Internet%29

closed account (z05DSL3A)

Jesus, you think he's trolling!

closed account (EzwRko23)

@Galik:

Of course the method with sorting will be faster when the range of randomized numbers is large.
This is caused by different memory access locality - quicksort is a very cache friendly algorithm. In the R0mai's solution (and mine, too), every element is added to / updated in the map - it is placed in a random place. If this map is small, it fits in the cache and access is extremely fast. If it does not, there are lots of cache misses. A cache miss may cost several tens or hundreds of CPU cycles.

Last edited on

Pages: 123

Finding Duplicates in sets of data

C++

Forum