are zig & rust 6.5 & 3.7 times faster than C++ respectively? sieve of Eratosthenes

Pages: 12
https://www.youtube.com/watch?v=pSvSXBorw4A
@20:27 see comparative speed/score. cant believe zig & rust are so much faster than C++ !

it's a competition using sieve of Eratosthenes problem, which programming language solves it fastest.

my simple code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//sieve of Eratosthenes
//edit: fixed code.
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
	auto start = high_resolution_clock::now();
	const int n = 1'000'000; // 1 million
	cout << "n = " << n << "\n";
	static bool ar[n + 1] = {}; // all elements 0.
	int sr = sqrt(n);
	
	for (int i = 2; i <= sr; i++) 
		for (int j = i + i; j <= n; j += i) 
			if(!ar[j]) 
				ar[j] = 1; //not prime			
								
	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<microseconds>(stop- start );
	cout << "Time taken = " << duration.count() << " microseconds\n";
	cout << "avg passes in 1 sce/score : " << 1e6/duration.count();
		
}

edit: according to dave(the youtuber) the programs were tested in a 32 thread computer. i checked the top c++ program on the GitHub, which is single thread program. can anyone check if the rust and zig programs single/multi threaded? if you don't know zig/rust just find the keywords for multithreading in them.
Edit 2: forgot to mention, notice @2:10 of the YouTube video that the score is counted by running your program for 5 seconds. and @20:22 see the score of top 5 languages. Zig: 10205, Rust 5857, C:1600, C++: 1564, Java: 1457.
Last edited on
It's a pointless comparison without actual code.
The code is linked below the video. https://github.com/PlummersSoftwareLLC/Primes
The actual video tries to cram a lot of stuff in at once and doesn't go over the actual methodology of the benchmarks, which I haven't looked through in detail yet.

I don't want to sound like I'm defending C++ for the sake of C++, but a 6.5x or 3.7x performance increase just sounds silly. I would try to compile it myself but have other stuff I need to do...

Edit: Apparently the link in the description is now a community-driven repo.
The original was: https://github.com/plummerssoftwarellc/Primes/tree/original
Last edited on
more than a fraction faster, something is going on like failed to compile with the right flags or debug mode or something. Assembly might be more than 2x faster, but I can't see anything else getting those values.
C or rust or whatnot could edge c++ out, sure, but not by those amounts.
another issue is that you only test 1 thing here ... integer math. If they do win, you still have a hundred more things to check before you can say x is faster than y. This is true if c++ wins as well -- doesn't mean that generally c++ is faster in that case.

the latest results I saw were here and had c++ second behind lisp by such a small amount you could say they were really equal.
https://plummerssoftwarellc.github.io/PrimeView/report?id=3181&hi=False&hf=False&hp=False&fi=&fp=&fa=&ff=&fb=&tp=False&sc=pp&sd=True
unless I am missing something ... a few of them had 5 with no decimal places but even so its still not significant amounts.

edit unclear how their output is sorted, so c++ may not be second. I gave up trying to make sense of it. But almost all the languages were within error margin of 5.0 seconds for whatever they tested, with a few stragglers pulling anywhere from 5.1 to 6.X and a tiny number coming in way off base at 386 speeds.
Last edited on
With these sorts of comparisons, it's often the algorithm used that is the main factor in the performance. Different algorithms for the same result can have dramatic effect on performance.

Also, how is performance measured? There's no timing code in that shown above.

But the most important fact re the code above is that it doesn't produce prime numbers! It produces odd numbers and 2!
Try this. For my laptop with n = 200,000,000 I get an execution time of 1648 ms

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <iostream>
#include <chrono>
#include <cmath>

template<class TimeUnit = std::chrono::milliseconds>
class Timer1 {
public:
	Timer1() {
		m_start = std::chrono::steady_clock::now();
	}

	~Timer1() {
		std::chrono::steady_clock::time_point stop = std::chrono::steady_clock::now();
		std::cout << "\n** Running time: " << std::chrono::duration_cast<TimeUnit>(stop - m_start).count() << '\n';
	}

private:
	std::chrono::steady_clock::time_point m_start;
};

int main() {
	constexpr unsigned n { 200'000'000 };
	const auto sr { unsigned(sqrt(n)) };
	static bool prime[n + 1] {};	// Move off the stack

	{
		Timer1 tm;

		for (unsigned i { 2 }; i <= sr; ++i)
			if (prime[i] == false)
				for (unsigned j { i * i }; j <= n; j += i)
					prime[j] = true;
	}

	// Print all prime numbers
	/*
	for (unsigned p { 2 }; p <= n; ++p)
		if (prime[p] == false)
			std::cout << p << ' ';
	*/
}


With n = 100 and showing the primes, this displays:



** Running time: 0
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97


so the algorithm is correct.
Last edited on
edit: fixed code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
//sieve of Eratosthenes
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
	auto start = high_resolution_clock::now();
	const int n = 1'000'000; // 1 million
	cout << "n = " << n << "\n";
	static bool ar[n + 1] = {}; // all elements 0.
	int sr = sqrt(n);
	
	for (int i = 2; i <= sr; i++) 
		for (int j = i + i; j <= n; j += i) 
			if(!ar[j]) 
				ar[j] = 1; //not prime			
								
	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<microseconds>(stop- start );
	cout << "Time taken = " << duration.count() << " microseconds\n";
	cout << "avg passes in 1 sce/score : " << 1e6/duration.count();
}

output:
n = 1000000
Time taken = 15626 microseconds
avg passes in 1 sce/score : 63.9959

top 5 scores shown in that youtube @ 20:22:
Rank Language Score
1 Zig 10205
2 Rust 5857
3 C 1600
4 C++ 1564
5 Java 1457

Last edited on
and that is the threaded c++ or the non-threaded zig? I was getting lost trying to track it by algorithm, threaded or not, etc.

your code, my box, g++ cygwin
n = 1000000
Time taken = 464 microseconds
passes in 1 sec = 2155.17

but, I believe your code cheats. The static bool is used in the condition, and after the first pass, you skip the inner loop forever.
but then it gets REALLY WEIRD.
taking the static off ar, I get this:
n = 1000000
Time taken = 112 microseconds
passes in 1 sec = 8928.57
which also looks like its keeping ar solved around.
I think you have to put bogus braces around AR so it is recreated and not be static so it is recreated and the problem actually solved each time.

also your code prints 15, 25, and many other non primes. ... so its irrelevant how fast it ran.

and taking a blunt axe to seeplus's code above, then, to see how many times it can do a million in a second...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <iostream>
#include <chrono>
#include <cmath>

template<class TimeUnit = std::chrono::nanoseconds>
class Timer1 {
public:
	Timer1() {
		m_start = std::chrono::steady_clock::now();
	}

	~Timer1() {
		std::chrono::steady_clock::time_point stop = std::chrono::steady_clock::now();
		std::cout << "\n** Running time: " << std::chrono::duration_cast<TimeUnit>(stop - m_start).count() << '\n';
		std::cout << "\n** Running time sec: " << std::chrono::duration_cast< std::chrono::seconds>(stop - m_start).count() << '\n';
	}

private:
	std::chrono::steady_clock::time_point m_start;
};

int main() 
{
	Timer1 t;
	constexpr unsigned n { 1'000'000 };
	int ctr{};
	auto m_start = std::chrono::steady_clock::now();
	while(std::chrono::duration_cast<std::chrono::milliseconds>
		(std::chrono::steady_clock::now() - m_start).count() < 1000)		
	{		
		//Timer1 tm;
		const auto sr { unsigned(sqrt(n)) };		
		bool prime[n + 1] {};	
		for (unsigned i { 2 }; i <= sr; ++i)
			if (prime[i] == false)
				for (unsigned j { i * i }; j <= n; j += i)
					prime[j] = true;
	 ctr++;
	}
	std::cout <<"count per sec:"<< ctr << std::endl;

	// Print all prime numbers
	/*
	for (unsigned p { 2 }; p <= n; ++p)
		if (prime[p] == false)
			std::cout << p << ' ';
	*/
}



I get: (and the site's shell gets 90/second, wow...)
C:\c>a
count per sec:24854473

** Running time: 1000166800

** Running time sec: 1

presumably you could roughly multiply the above X number of cores for a threaded version, so my box with 20 cpus would get 10 to 20 times more (I am a little fuzzy on how the real cpu cores vs the virtual ones work).
Last edited on
Well my poor laptop gets a count of 32,365 per sec.


count per sec:32365

** Running time: 1000949827

** Running time sec: 1


But really, unless you run the test codes on the same system you're not comparing like with like. Running the same code on coliru gives a count of only 87!
http://coliru.stacked-crooked.com/a/d047bddfcd641c07
pretty wild differences. I mean, my machine is good, its I9 @ 20 cpus and 3.6 ghz but 24 million vs 90? or 32K? That is out there. Must be some cache or prediction magic, but I got nothing on what it could be. Technically the count includes a few extra nanosec worth, so its not exact on that front either.
Last edited on
thanks joinin for pointing out. fixed.
I only get around 330 using GCC 8.3 with -O3.

I know my computer isn't the fastest but for those of you that get crazy high numbers I would suspect the compiler is able to optimize it somehow.

To test this hypothesis I created a benchmark on quick-bench.com:
https://quick-bench.com/q/bYy8yk84l0o9KMAdwgSvWWXNK28

Both Test1 and Test2 repeatedly tests the same code as above:
1
2
3
4
5
6
7
const auto sr { unsigned(std::sqrt(n)) };
bool prime[n + 1] {};

for (unsigned i { 2 }; i <= sr; ++i)
	if (prime[i] == false)
		for (unsigned j { i * i }; j <= n; j += i)
			prime[j] = true;

The only difference is that Test2 has the following line at the end
 
benchmark::DoNotOptimize(prime);
which prevent it from optimizing away the array (and all the code that fills it).

The result is that Test1 takes virtually no time at all compared to Test2 when using GCC 12.2.

If you rerun the benchmark with Clang or GCC 9.5 or earlier Test1 becomes as slow as Test2.
Last edited on
Im using g++ 10.2 so probably some minor updates from 8.x
using g++ -O3 -s -Wall -Wextra -pedantic -std=c++20 -march=native
that march native helps some programs really scream, and others you can't tell any difference.

my cpu has about 2x the cache of I7 models. I suspect that is part of it.
Not sure how to use that site to get a benchmark. I tried, but are you supposed to run it locally or just hit benchmark, and if so, where does it say anything? I see in a mouse over that t1 is 2 billion times faster than test 2, if it is talking about my box, with test 1 being 0.0000041 something.

this (the sieve project) is really a compiler test, from the feel of it.
Last edited on
jonnin wrote:
Not sure how to use that site to get a benchmark. I tried, but are you supposed to run it locally or just hit benchmark, and if so, where does it say anything?

You don't have to do anything. The benchmark already ran (on the server) and the result is shown on the right hand side.

You can rerun the benchmark if you want. Note that the results are cached so if you don't change anything you will need to check "Clear cached results" before clicking "Run Benchmark".

I don't think you can compare two different runs of the benchmark because it will essentially run it for as long as it thinks is necessary in order to get a reliable result. The result shows you the relative difference in speed between the different functions that have been registered (in this case Test1 and Test2).

If you look at the chart on the right you see that Test2 has a big yellow bar while the bar for Test1 is so small that it's not even visible (at least for me). If you hover over them you'll see that it says things like "Test1: 1900000000000 times faster than Test2". I don't think that exact number is very important. If you rerun the same benchmark you'll probably get slightly different numbers. The important thing to note here is just that Test1 is extremely much faster than Test2 which is ridiculous if you think they do the same amount of work.

What I think is happening in Test1 is that the compiler simply optimizes away the array and all code that is used to fill the array with values because the result is never used. In Test2 benchmark::DoNotOptimize(prime); is used to prevent this optimization. My understanding is that DoNotOptimize is essentially a dummy function that does nothing but that the compiler cannot "see through" so it cannot know if it will read or modify the entire array, which it of course doesn't do, but the compiler doesn't know so it cannot do the same optimization.


To "prove" that this is what's happening I decided to adjust the coliru.stacked-crooked.com example that seeplus posted above.

http://coliru.stacked-crooked.com/a/a4a902b7696e73c4
This is the same as above except that it enables optimizations (-O3)
count per sec:1329774

http://coliru.stacked-crooked.com/a/a52250863b6aa912
This is the same as the previous except that it disables the prime calculating code in the loop by putting it inside comments.
count per sec:1328966

As you can see, the counts are very similar.
Last edited on
Never underestimate the power of ai in modern c++ compilers...

Also note that in my code there's a typing error. i * i in the inner loop should be i + i AAhhh..
Last edited on
By an optimisation by dealing with 2 and evens first and then just dealing with odds the count increases to 1,338,215 for Coliru:
http://coliru.stacked-crooked.com/a/585ac08981d3aa8c

Every little helps!
Last edited on
It's within the margin of error.

http://coliru.stacked-crooked.com/a/b767aa049bdbbed9 (Runs old version 5 times)
count per sec:1349777
count per sec:1350175
count per sec:1332340
count per sec:1345124
count per sec:1346014


http://coliru.stacked-crooked.com/a/54c569b134db6b92 (Runs new version 5 times)
count per sec:1347120
count per sec:1329117
count per sec:1340763
count per sec:1341318
count per sec:1339301


If you want to try and optimize the code you would have to at least come up with a benchmark where the code that you want to test doesn't get optimized away completely because right now you're essentially testing how many times you can execute the loop condition within one second. Modifying the code that gets optimized away is unlikely to make it run any faster.
Last edited on
why so big difference between outputs(passes count) of coliru & my pc!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
//sieve of Eratosthenes
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
	auto t1 = high_resolution_clock::now();
	auto t2 = high_resolution_clock::now();
	auto duration = duration_cast<seconds>(t2- t1);
	int passes = 0;
while(1>duration.count()){
	const int n = 1'000'000; // 1 million
	bool ar[n + 1] = {}; // all elements 0.
	int sr = sqrt(n);
	for (int i = 2; i <= sr; i++){
			if(ar[i]==1)continue;
			for (int j = i + i; j <= n; j += i) 
				if(!ar[j]) 
					ar[j] = 1; //not prime		
	}
	t2 = high_resolution_clock::now();
	duration = duration_cast<seconds>(t2-t1);
	passes++;
}
	auto t3 = high_resolution_clock::now();
	auto ending = duration_cast<microseconds>(t3- t1);	
	cout << "avg passes in 1 sce/score : " << passes << "\n";	
	cout << "total time = " << ending.count() << " microseconds\n";	
}

coliru output:
avg passes in 1 sce/score : 1345280
total time = 1000001 microseconds

output from my pc:
avg passes in 1 sce/score : 393
total time = 1000076 microseconds

--------------------------------
Process exited after 1.013 seconds with return value 0
Press any key to continue . . .

edit: tried bitset instead of the bool array. but it became slower!
Last edited on
bitset or vec bool are going to have to fetch the compressed bool out of a byte, which costs a few operations and it adds up over the many.

I replaced the modified one with i<<1 (i+i) from i*i and replaced the array of bool with a pointer that is destroyed every run.
that put me down to 556/sec.

1
2
3
4
5
6
7
bool * prime = new bool[n+1]{};
		for (unsigned i { 2 }; i <= sr; ++i)
			if (prime[i] == false)
				for (unsigned j { i<<1 }; j <= n; j += i)
					prime[j] = true;
		ctr++;
		delete prime;	 
Last edited on
why so big difference between outputs(passes count) of coliru & my pc!


For your pc are you compiling as release (as opposed to debug) with full optimisation for speed enabled?
Pages: 12