are zig & rust 6.5 & 3.7 times faster than C++ respectively? sieve of Eratosthenes

Pages: 12
anup30 wrote:
why so big difference between outputs(passes count) of coliru & my pc!

It's because of what I said above on the previous page.

The compiler that coliru uses, with enough optimizations turned on, is able to completely optimize away the code that deals with the array. This optimization is valid because the array is never used so no one will notice if it's never created or filled with values (except that it will run much faster).

On your computer you probably use a different compiler or have not enabled enough optimizations.
Last edited on
Try breaking some data dependencies for a modest improvement.
The resulting code might be more friendly to out-of-order execution.
Probably the compiler optimizer can do this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
void sieve8(bool* prime, int sr, int n)
{
  for (int i { 2 }; i <= sr; ++i)
  {
    if (prime[i]) continue; 

    int const increment = i * 8;	

    int j0 = i * i; 
    int j1 = i * (i + 1);
    int j2 = i * (i + 2); 
    int j3 = i * (i + 3);
    int j4 = i * (i + 4);
    int j5 = i * (i + 5);
    int j6 = i * (i + 6);
    int j7 = i * (i + 7);

    while (j7 <= n)
    {
      prime[j0] = true;
      prime[j1] = true;
      prime[j2] = true;
      prime[j3] = true;
      prime[j4] = true;
      prime[j5] = true;
      prime[j6] = true;
      prime[j7] = true;
      j0 += increment;
      j1 += increment;
      j2 += increment;
      j3 += increment;
      j4 += increment;
      j5 += increment;
      j6 += increment;
      j7 += increment;
    }

    if (j0 <= n) prime[j0] = true;
    if (j1 <= n) prime[j1] = true;
    if (j2 <= n) prime[j2] = true;
    if (j3 <= n) prime[j3] = true;
    if (j4 <= n) prime[j4] = true;
    if (j5 <= n) prime[j5] = true;
    if (j6 <= n) prime[j6] = true;
  }
}
Last edited on
Changing the code to use the prime array (by printing the first 20 primes at the end) dramatically changes the count (time to display not included).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <iostream>
#include <chrono>
#include <cmath>
#include <cstring>

namespace chr = std::chrono;

template<class TimeUnit = chr::nanoseconds>
class Timer1 {
public:
	Timer1() {
		m_start = chr::steady_clock::now();
	}

	~Timer1() {
		chr::steady_clock::time_point stop = chr::steady_clock::now();
		std::cout << "\n** Running time: " << chr::duration_cast<TimeUnit>(stop - m_start).count() << '\n';
		std::cout << "\n** Running time sec: " << chr::duration_cast<chr::seconds>(stop - m_start).count() << '\n';
	}

private:
	chr::steady_clock::time_point m_start;
};

int main() {
	constexpr unsigned n { 1'000'000 };
	unsigned ctr {};
	bool prime[n + 1] {};
	const auto sr { unsigned(sqrt(n)) };

	{
		Timer1 t;

		for (auto m_start { chr::steady_clock::now() }; chr::duration_cast<chr::milliseconds>(chr::steady_clock::now() - m_start).count() < 1000; ++ctr) {
			memset(prime, 0, n + 1);

			for (unsigned i { 4 }; i <= n; i += 2)
				prime[i] = true;

			for (unsigned i { 3 }; i <= sr; i += 2)
				if (prime[i] == false)
					for (unsigned j { i << 1 }; j <= n; j += i)
						prime[j] = true;
		}
	}

	std::cout << "count per sec:" << ctr << '\n';

	// Print first 20 prime numbers to use prime array
	for (unsigned p { 2 }, cnt {}; p <= n && cnt < 20; ++p)
		if (prime[p] == false) {
			std::cout << p << ' ';
			++cnt;
		}
}


I'm now getting 472:


** Running time: 1000335838

** Running time sec: 1
count per sec:472
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71


and now Coliru gets 299:
http://coliru.stacked-crooked.com/a/f198b2bc37bf6c76
Last edited on
mbozzi's "partially unrolled" ? approach moved me from ~550 to ~600
however I did change the bottom if group to be like prime[j6] = (j6 <= n);
this is with me still allocate/delete the bool pointer 'array' each loop, so same on that front for both runs.

Anything special about 8 at a time?
Last edited on
Anything special about 8 at a time?

I got progressive improvement for 2, 4, and 8, but didn't feel like going any further. There's not much reason to stick with powers of two here, that's just what I tried.

To find out the "theoretically correct" number you'd need to compare the specifics of your processor against the generated code.

It helps because, when you unroll a loop like this:
while (true) { a = a + b; }
You get a bunch of repeated additions:
a = a + b; a = a + b; a = a + b; /* ... */

The problem with this code, from a performance perspective, is that each subsequent add relies on the result of the previous. Therefore the processor has to wait for the previous add to completely finish before starting to execute the next one. The delay is called a pipeline stall and this particular case is caused by loop-carried dependence.

By splitting it into
a0 = a0 + b;
a1 = a1 + b;
a2 = a2 + b;
The processor is free to get started executing line 2 and 3 before line 1 is finished. (All three lines are independent.)
Last edited on
I looked at the Rust, C++ and C implementations.

It wasn't obvious to me that the Rust and C++ versions were running similar code. So, imho, that makes the performance comparison irrelevant and the difference cannot be attributed to the programming language alone.
@seeplus, the test run is 5 sec. mentioned at 2:10 of the video. so your program already beats C & should become 3rd is the list. try submit your solution for 5sec. i think multithreading can beat rust & zig score also.
@anup30 If you're wanting to compare seeplus's code against the competition entrants, it should probably be updated to conform to the rules. In particular, from just after 13:40 in the video:

DavesGarage wrote:
... may only use a single bit to represent a sieve element, which avoids people wastefully using bytes or even complete words to represent a number in the sieve. It also forces the language to do some bitwise math to access a sieve element, which exercises both the language and the compiler ...

The original C++ code uses std::vector<bool> to achieve this, but other entrants have implemented their own bit-wise storage mechanisms.

For the rest of the rules, as Dave says, see the Github repo's docs.
Last edited on
Well simply you can use a std::bitset instead of an array. This reduces the amount of memory used for storage from 1,000,001 bytes to 125,008 bytes but at the cost of decreased performance. This is a common trade-off between performance and resources used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <chrono>
#include <cmath>
#include <cstring>
#include <bitset>

namespace chr = std::chrono;

template<class TimeUnit = chr::nanoseconds>
class Timer1 {
public:
	Timer1() {
		m_start = chr::steady_clock::now();
	}

	~Timer1() {
		chr::steady_clock::time_point stop = chr::steady_clock::now();
		std::cout << "\n** Running time: " << chr::duration_cast<TimeUnit>(stop - m_start).count() << '\n';
		std::cout << "\n** Running time sec: " << chr::duration_cast<chr::seconds>(stop - m_start).count() << '\n';
	}

private:
	chr::steady_clock::time_point m_start;
};

int main() {
	constexpr unsigned n { 1'000'000 };
	unsigned ctr {};
	std::bitset<n + 1> prime;
	const auto sr { unsigned(sqrt(n)) };

	bool prime1[n + 1] {};
	std::cout << sizeof(prime) << "  " << sizeof(prime1) << '\n';

	{
		Timer1 t;

		for (auto m_start { chr::steady_clock::now() }; chr::duration_cast<chr::milliseconds>(chr::steady_clock::now() - m_start).count() < 1000; ++ctr) {
			prime.reset();

			for (unsigned i { 4 }; i <= n; i += 2)
				prime[i] = true;

			for (unsigned i { 3 }; i <= sr; i += 2)
				if (prime[i] == false)
					for (unsigned j { i << 1 }; j <= n; j += i)
						prime[j] = true;
		}
	}

	std::cout << "count per sec:" << ctr << '\n';

	// Print first 20 prime numbers to use prime array
	for (unsigned p { 2 }, cnt {}; p <= n && cnt < 20; ++p)
		if (prime[p] == false) {
			std::cout << p << ' ';
			++cnt;
		}
}


See http://coliru.stacked-crooked.com/a/7fab59c8581bfb86
The original code:
dhayden@DHAYDENHTZK3M2 ~/tmp
$ cat foo.cxx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//sieve of Eratosthenes
//edit: fixed code.
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
        auto start = high_resolution_clock::now();
        const int n = 1'000'000; // 1 million
        cout << "n = " << n << "\n";
        static bool ar[n + 1] = {}; // all elements 0.
        int sr = sqrt(n);

        for (int i = 2; i <= sr; i++)
                for (int j = i + i; j <= n; j += i)
                        if(!ar[j])
                                ar[j] = 1; //not prime

        auto stop = high_resolution_clock::now();
        auto duration = duration_cast<microseconds>(stop- start );
        cout << "Time taken = " << duration.count() << " microseconds\n";
        cout << "avg passes in 1 sce/score : " << 1e6/duration.count();

}


dhayden@DHAYDENHTZK3M2 ~/tmp
$ g++ -O4 -o foo.exe foo.cxx

dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9886 microseconds
avg passes in 1 sce/score : 101.153
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9961 microseconds
avg passes in 1 sce/score : 100.392
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 10056 microseconds
avg passes in 1 sce/score : 99.4431


Not timing the initial cout:
dhayden@DHAYDENHTZK3M2 ~/tmp
$ cat foo.cxx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//sieve of Eratosthenes
//edit: fixed code.
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
        const int n = 1'000'000; // 1 million
        cout << "n = " << n << "\n";
        auto start = high_resolution_clock::now();
        static bool ar[n + 1] = {}; // all elements 0.
        int sr = sqrt(n);

        for (int i = 2; i <= sr; i++)
                for (int j = i + i; j <= n; j += i)
                        if(!ar[j])
                                ar[j] = 1; //not prime

        auto stop = high_resolution_clock::now();
        auto duration = duration_cast<microseconds>(stop- start );
        cout << "Time taken = " << duration.count() << " microseconds\n";
        cout << "avg passes in 1 sce/score : " << 1e6/duration.count();

}

dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9195 microseconds
avg passes in 1 sce/score : 108.755
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9790 microseconds
avg passes in 1 sce/score : 102.145
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9941 microseconds
avg passes in 1 sce/score : 100.594


Not timing the sqrt() to get a sense of how long it takes.
$ cat foo.cxx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//sieve of Eratosthenes
//edit: fixed code.
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
        const int n = 1'000'000; // 1 million
        cout << "n = " << n << "\n";
        static bool ar[n + 1] = {}; // all elements 0.
        int sr = sqrt(n);
        auto start = high_resolution_clock::now();

        for (int i = 2; i <= sr; i++)
                for (int j = i + i; j <= n; j += i)
                        if(!ar[j])
                                ar[j] = 1; //not prime

        auto stop = high_resolution_clock::now();
        auto duration = duration_cast<microseconds>(stop- start );
        cout << "Time taken = " << duration.count() << " microseconds\n";
        cout << "avg passes in 1 sce/score : " << 1e6/duration.count();

}

dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 8789 microseconds
avg passes in 1 sce/score : 113.779
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 10270 microseconds
avg passes in 1 sce/score : 97.371
dhayden@DHAYDENHTZK3M2 ~/tmp
$ ./foo
n = 1000000
Time taken = 9101 microseconds
avg passes in 1 sce/score : 109.878


Including the initial output statement in the time invalid. The fact that sqrt() takes a sizeable fraction of the time means that the test should use a larger value of n. So the test is pretty suspect.
@peter87, the forced use of bitset is weird, because regular bool array usage only takes about 1 Megabytess. the daves garrage is using rygen 5950x (with 32 threads) still wanting the array size to be about 128 Kb. even it's not standard way of doing it, you should say something like this: "program runtime executable can use 200KB memory at max". it's 2023 - mobiles have gigabytes of memory. it's not especially programmed for embedded systems! and the topic was top 5 fastest programming languages.(then he suddenly included memory efficiency in it.) and you should say the rules/requirements at the beginning of the video, but he says after discussing about C & assembly(initially i thought it were about only for assembly - not required for all languages). for this i have no respect now for daves garrage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// sieve of Eratosthenes
// single threaded. score 2787 (1973 if bool ar[n + 1] is declared inside the while loop)
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
using namespace std::chrono;

int main() {
	auto t1 = high_resolution_clock::now();
	auto t2 = high_resolution_clock::now();
	auto duration = duration_cast<seconds>(t2- t1);
	int passes = 0;
	const int n = 1'000'000; // 1 million
	bool ar[n + 1] = {}; // all elements 0.
	while(duration.count()<5){
		//const int n = 1'000'000; // 1 million
		//bool ar[n + 1] = {}; // all elements 0.
		int sr = sqrt(n);
		for (int i = 2; i <= sr; i++){
			if(ar[i]==1)continue;
			for (int j = i + i; j <= n; j += i) 
				if(!ar[j]) 
					ar[j] = 1; //composite		
		}
		t2 = high_resolution_clock::now();
		duration = duration_cast<seconds>(t2-t1);
		passes++;
	}
	auto t3 = high_resolution_clock::now();
	auto ending = duration_cast<microseconds>(t3- t1);	
	cout << "total passes in 5 sce/score : " << passes << "\n";	
	cout << "total time = " << ending.count() << " microseconds\n";
	cout<<"primes below 100: \n";
	for(int i=2; i<100; i++){
		if(!ar[i]) cout << i << ", ";
	}
	cout <<"\n";
	cout<<"biggest prime in range: \n";
	for(int i=n; i>1; i--){
		if(!ar[i]){
			cout<<i <<"\n"; // 999983
			break;
		}
	}
}
total passes in 5 sce/score : 2787
total time = 5000343 microseconds
primes below 100:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
biggest prime in range:
999983

--------------------------------
Process exited after 5.018 seconds with return value 0
Press any key to continue . . .

bitset version:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// single thread. 5sec, using bitset, score: 218
#include <iostream>
#include <cmath>
#include <chrono>
#include <bitset> ///
using namespace std;
using namespace std::chrono;

int main() {
	auto t1 = high_resolution_clock::now();
	auto t2 = high_resolution_clock::now();
	auto duration = duration_cast<seconds>(t2- t1);
	int passes = 0;
	const int n = 1'000'000; // 1 million
	int sr = sqrt(n);
	//bool ar[n + 1] = {}; // all elements 0. 
	bitset<n+1> ar{}; //set all to 0 /// normal array i == n-i in bitset
	while(duration.count()<5){
		for (int i = 2; i <= sr; i++){
				if(ar[i]==1)continue;
				for (int j = i + i; j <= n; j += i) 
					if(!ar[j]) 
						ar[j] = 1; //composite		
		}
		t2 = high_resolution_clock::now();
		duration = duration_cast<seconds>(t2-t1);
		passes++;
		//break; /////////// run 1 time
	}
	auto t3 = high_resolution_clock::now();
	auto ending = duration_cast<microseconds>(t3- t1);	
	cout << "total passes in 5 sec/score : " << passes << "\n";	
	cout << "total time = " << ending.count() << " microseconds\n";
	cout<<"primes below 100: ";
	for(int i=2; i<=100; i++){
		if(!ar[i]){ 
			cout << i << ", ";
		}		
	}cout<<"\n";
	cout<<"biggest prime in range: ";
	for(int i=n; i>1; i--){
		if(!ar[i]){
			cout<<i <<"\n"; // 999983
			break;
		}
	}
}
total passes in 5 sec/score : 218
total time = 5000334 microseconds
primes below 100: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
biggest prime in range: 999983

--------------------------------
Process exited after 5.021 seconds with return value 0
Press any key to continue . . .

10 threaded regular bool array:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
//test, score 7724, 5s, t=10, n=1e6, 
#include <iostream>
#include <cmath>
#include <chrono>
#include <thread>
using namespace std;
using namespace std::chrono;

void foo(bool ar[], int n, int sr, int s, int e){
	if(s<3)s=3;
	if(e>n)e=n;
	for (int i = 2; i <= sr; i++){
		if(ar[i]==1)continue;
		if(s%i)	s=s+i-(s%i);
		for (int j = s; j <= e; j += i)	ar[j] = 1; //composite
	}
}

int main() {
	auto t1 = high_resolution_clock::now();
	auto t2 = high_resolution_clock::now();
	auto duration = duration_cast<seconds>(t2- t1);
	int passes = 0;
	const int n = 1e6; // 1 million
	const int tcount = 10; // thread count
	int npt = 1e5; // n/tcount, numbers per thread
	bool ar[n + 1] = {}; // all elements 0.
	int sr = sqrt(n);
	std::thread t[tcount]; //not started yet
	while(duration.count()<5){
		int s = 1, e = npt;	
		for (int i = 0; i < tcount; i++) {
			t[i] = std::thread(foo, ar, n, sr, s, e);
			s += npt;
			e += npt;
		}
		for (int i = 0; i < tcount; i++) {
			t[i].join();
		}
		t2 = high_resolution_clock::now();
		duration = duration_cast<seconds>(t2 - t1);
		passes++;		
	}
	auto t3 = high_resolution_clock::now();
	auto ending = duration_cast<microseconds>(t3 - t1);
	cout << "passes /score : " << passes << "\n";
	cout << "total time = " << ending.count() << " microseconds\n";
	cout<< "\nfirst few primes = ";
	for(int i=2;i<100;i++){
		if(!ar[i]) cout << i << " ";
	}
	cout<< "\nbiggest prime in range = "; //999983
	for(int i=n;i>=2;i--){
		if(!ar[i]){
			cout << i <<"\n";
			break;
		}
	}	
}
passes /score : 7724
total time = 5000334 microseconds

first few primes = 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
biggest prime in range = 999983

--------------------------------
Process exited after 5.011 seconds with return value 0
Press any key to continue . . .
all outputs from my rygen 5600g
dhayden wrote:
g++ -O4 -o foo.exe foo.cxx
Maybe try -O5
:)
Last edited on
Ganado wrote:
Maybe try -O5
You got me! :) I'm used to the Solaris compiler where -O4 is legitimate.
Topic archived. No new replies allowed.
Pages: 12