### are zig & rust 6.5 & 3.7 times faster than C++ respectively? sieve of Eratosthenes

Pages: 12
anup30 wrote:
why so big difference between outputs(passes count) of coliru & my pc!

It's because of what I said above on the previous page.

The compiler that coliru uses, with enough optimizations turned on, is able to completely optimize away the code that deals with the array. This optimization is valid because the array is never used so no one will notice if it's never created or filled with values (except that it will run much faster).

On your computer you probably use a different compiler or have not enabled enough optimizations.
Last edited on
Try breaking some data dependencies for a modest improvement.
The resulting code might be more friendly to out-of-order execution.
Probably the compiler optimizer can do this.

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546`` ``````void sieve8(bool* prime, int sr, int n) { for (int i { 2 }; i <= sr; ++i) { if (prime[i]) continue; int const increment = i * 8; int j0 = i * i; int j1 = i * (i + 1); int j2 = i * (i + 2); int j3 = i * (i + 3); int j4 = i * (i + 4); int j5 = i * (i + 5); int j6 = i * (i + 6); int j7 = i * (i + 7); while (j7 <= n) { prime[j0] = true; prime[j1] = true; prime[j2] = true; prime[j3] = true; prime[j4] = true; prime[j5] = true; prime[j6] = true; prime[j7] = true; j0 += increment; j1 += increment; j2 += increment; j3 += increment; j4 += increment; j5 += increment; j6 += increment; j7 += increment; } if (j0 <= n) prime[j0] = true; if (j1 <= n) prime[j1] = true; if (j2 <= n) prime[j2] = true; if (j3 <= n) prime[j3] = true; if (j4 <= n) prime[j4] = true; if (j5 <= n) prime[j5] = true; if (j6 <= n) prime[j6] = true; } }``````
Last edited on
Changing the code to use the prime array (by printing the first 20 primes at the end) dramatically changes the count (time to display not included).

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455`` ``````#include #include #include #include namespace chr = std::chrono; template class Timer1 { public: Timer1() { m_start = chr::steady_clock::now(); } ~Timer1() { chr::steady_clock::time_point stop = chr::steady_clock::now(); std::cout << "\n** Running time: " << chr::duration_cast(stop - m_start).count() << '\n'; std::cout << "\n** Running time sec: " << chr::duration_cast(stop - m_start).count() << '\n'; } private: chr::steady_clock::time_point m_start; }; int main() { constexpr unsigned n { 1'000'000 }; unsigned ctr {}; bool prime[n + 1] {}; const auto sr { unsigned(sqrt(n)) }; { Timer1 t; for (auto m_start { chr::steady_clock::now() }; chr::duration_cast(chr::steady_clock::now() - m_start).count() < 1000; ++ctr) { memset(prime, 0, n + 1); for (unsigned i { 4 }; i <= n; i += 2) prime[i] = true; for (unsigned i { 3 }; i <= sr; i += 2) if (prime[i] == false) for (unsigned j { i << 1 }; j <= n; j += i) prime[j] = true; } } std::cout << "count per sec:" << ctr << '\n'; // Print first 20 prime numbers to use prime array for (unsigned p { 2 }, cnt {}; p <= n && cnt < 20; ++p) if (prime[p] == false) { std::cout << p << ' '; ++cnt; } }``````

I'm now getting 472:

 ``` ** Running time: 1000335838 ** Running time sec: 1 count per sec:472 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 ```

and now Coliru gets 299:
http://coliru.stacked-crooked.com/a/f198b2bc37bf6c76
Last edited on
mbozzi's "partially unrolled" ? approach moved me from ~550 to ~600
however I did change the bottom if group to be like prime[j6] = (j6 <= n);
this is with me still allocate/delete the bool pointer 'array' each loop, so same on that front for both runs.

Anything special about 8 at a time?
Last edited on
 Anything special about 8 at a time?

I got progressive improvement for 2, 4, and 8, but didn't feel like going any further. There's not much reason to stick with powers of two here, that's just what I tried.

To find out the "theoretically correct" number you'd need to compare the specifics of your processor against the generated code.

It helps because, when you unroll a loop like this:
while (true) { a = a + b; }
You get a bunch of repeated additions:
a = a + b; a = a + b; a = a + b; /* ... */

The problem with this code, from a performance perspective, is that each subsequent add relies on the result of the previous. Therefore the processor has to wait for the previous add to completely finish before starting to execute the next one. The delay is called a pipeline stall and this particular case is caused by loop-carried dependence.

By splitting it into
a0 = a0 + b;
a1 = a1 + b;
a2 = a2 + b;
The processor is free to get started executing line 2 and 3 before line 1 is finished. (All three lines are independent.)
Last edited on
I looked at the Rust, C++ and C implementations.

It wasn't obvious to me that the Rust and C++ versions were running similar code. So, imho, that makes the performance comparison irrelevant and the difference cannot be attributed to the programming language alone.
@seeplus, the test run is 5 sec. mentioned at 2:10 of the video. so your program already beats C & should become 3rd is the list. try submit your solution for 5sec. i think multithreading can beat rust & zig score also.
@anup30 If you're wanting to compare seeplus's code against the competition entrants, it should probably be updated to conform to the rules. In particular, from just after 13:40 in the video:

DavesGarage wrote:
... may only use a single bit to represent a sieve element, which avoids people wastefully using bytes or even complete words to represent a number in the sieve. It also forces the language to do some bitwise math to access a sieve element, which exercises both the language and the compiler ...

The original C++ code uses std::vector<bool> to achieve this, but other entrants have implemented their own bit-wise storage mechanisms.

For the rest of the rules, as Dave says, see the Github repo's docs.
Last edited on
Well simply you can use a std::bitset instead of an array. This reduces the amount of memory used for storage from 1,000,001 bytes to 125,008 bytes but at the cost of decreased performance. This is a common trade-off between performance and resources used.

 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859`` ``````#include #include #include #include #include namespace chr = std::chrono; template class Timer1 { public: Timer1() { m_start = chr::steady_clock::now(); } ~Timer1() { chr::steady_clock::time_point stop = chr::steady_clock::now(); std::cout << "\n** Running time: " << chr::duration_cast(stop - m_start).count() << '\n'; std::cout << "\n** Running time sec: " << chr::duration_cast(stop - m_start).count() << '\n'; } private: chr::steady_clock::time_point m_start; }; int main() { constexpr unsigned n { 1'000'000 }; unsigned ctr {}; std::bitset prime; const auto sr { unsigned(sqrt(n)) }; bool prime1[n + 1] {}; std::cout << sizeof(prime) << " " << sizeof(prime1) << '\n'; { Timer1 t; for (auto m_start { chr::steady_clock::now() }; chr::duration_cast(chr::steady_clock::now() - m_start).count() < 1000; ++ctr) { prime.reset(); for (unsigned i { 4 }; i <= n; i += 2) prime[i] = true; for (unsigned i { 3 }; i <= sr; i += 2) if (prime[i] == false) for (unsigned j { i << 1 }; j <= n; j += i) prime[j] = true; } } std::cout << "count per sec:" << ctr << '\n'; // Print first 20 prime numbers to use prime array for (unsigned p { 2 }, cnt {}; p <= n && cnt < 20; ++p) if (prime[p] == false) { std::cout << p << ' '; ++cnt; } }``````

See http://coliru.stacked-crooked.com/a/7fab59c8581bfb86
The original code:
dhayden@DHAYDENHTZK3M2 ~/tmp
\$ cat foo.cxx
 ``1234567891011121314151617181920212223242526`` ``````//sieve of Eratosthenes //edit: fixed code. #include #include #include using namespace std; using namespace std::chrono; int main() { auto start = high_resolution_clock::now(); const int n = 1'000'000; // 1 million cout << "n = " << n << "\n"; static bool ar[n + 1] = {}; // all elements 0. int sr = sqrt(n); for (int i = 2; i <= sr; i++) for (int j = i + i; j <= n; j += i) if(!ar[j]) ar[j] = 1; //not prime auto stop = high_resolution_clock::now(); auto duration = duration_cast(stop- start ); cout << "Time taken = " << duration.count() << " microseconds\n"; cout << "avg passes in 1 sce/score : " << 1e6/duration.count(); }``````

 ```dhayden@DHAYDENHTZK3M2 ~/tmp \$ g++ -O4 -o foo.exe foo.cxx dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9886 microseconds avg passes in 1 sce/score : 101.153 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9961 microseconds avg passes in 1 sce/score : 100.392 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 10056 microseconds avg passes in 1 sce/score : 99.4431```

Not timing the initial cout:
dhayden@DHAYDENHTZK3M2 ~/tmp
\$ cat foo.cxx
 ``1234567891011121314151617181920212223242526`` ``````//sieve of Eratosthenes //edit: fixed code. #include #include #include using namespace std; using namespace std::chrono; int main() { const int n = 1'000'000; // 1 million cout << "n = " << n << "\n"; auto start = high_resolution_clock::now(); static bool ar[n + 1] = {}; // all elements 0. int sr = sqrt(n); for (int i = 2; i <= sr; i++) for (int j = i + i; j <= n; j += i) if(!ar[j]) ar[j] = 1; //not prime auto stop = high_resolution_clock::now(); auto duration = duration_cast(stop- start ); cout << "Time taken = " << duration.count() << " microseconds\n"; cout << "avg passes in 1 sce/score : " << 1e6/duration.count(); }``````

 ```dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9195 microseconds avg passes in 1 sce/score : 108.755 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9790 microseconds avg passes in 1 sce/score : 102.145 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9941 microseconds avg passes in 1 sce/score : 100.594```

Not timing the sqrt() to get a sense of how long it takes.
\$ cat foo.cxx
 ``1234567891011121314151617181920212223242526`` ``````//sieve of Eratosthenes //edit: fixed code. #include #include #include using namespace std; using namespace std::chrono; int main() { const int n = 1'000'000; // 1 million cout << "n = " << n << "\n"; static bool ar[n + 1] = {}; // all elements 0. int sr = sqrt(n); auto start = high_resolution_clock::now(); for (int i = 2; i <= sr; i++) for (int j = i + i; j <= n; j += i) if(!ar[j]) ar[j] = 1; //not prime auto stop = high_resolution_clock::now(); auto duration = duration_cast(stop- start ); cout << "Time taken = " << duration.count() << " microseconds\n"; cout << "avg passes in 1 sce/score : " << 1e6/duration.count(); }``````

 ```dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 8789 microseconds avg passes in 1 sce/score : 113.779 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 10270 microseconds avg passes in 1 sce/score : 97.371 dhayden@DHAYDENHTZK3M2 ~/tmp \$ ./foo n = 1000000 Time taken = 9101 microseconds avg passes in 1 sce/score : 109.878```

Including the initial output statement in the time invalid. The fact that sqrt() takes a sizeable fraction of the time means that the test should use a larger value of n. So the test is pretty suspect.
@peter87, the forced use of bitset is weird, because regular bool array usage only takes about 1 Megabytess. the daves garrage is using rygen 5950x (with 32 threads) still wanting the array size to be about 128 Kb. even it's not standard way of doing it, you should say something like this: "program runtime executable can use 200KB memory at max". it's 2023 - mobiles have gigabytes of memory. it's not especially programmed for embedded systems! and the topic was top 5 fastest programming languages.(then he suddenly included memory efficiency in it.) and you should say the rules/requirements at the beginning of the video, but he says after discussing about C & assembly(initially i thought it were about only for assembly - not required for all languages). for this i have no respect now for daves garrage.

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546`` ``````// sieve of Eratosthenes // single threaded. score 2787 (1973 if bool ar[n + 1] is declared inside the while loop) #include #include #include using namespace std; using namespace std::chrono; int main() { auto t1 = high_resolution_clock::now(); auto t2 = high_resolution_clock::now(); auto duration = duration_cast(t2- t1); int passes = 0; const int n = 1'000'000; // 1 million bool ar[n + 1] = {}; // all elements 0. while(duration.count()<5){ //const int n = 1'000'000; // 1 million //bool ar[n + 1] = {}; // all elements 0. int sr = sqrt(n); for (int i = 2; i <= sr; i++){ if(ar[i]==1)continue; for (int j = i + i; j <= n; j += i) if(!ar[j]) ar[j] = 1; //composite } t2 = high_resolution_clock::now(); duration = duration_cast(t2-t1); passes++; } auto t3 = high_resolution_clock::now(); auto ending = duration_cast(t3- t1); cout << "total passes in 5 sce/score : " << passes << "\n"; cout << "total time = " << ending.count() << " microseconds\n"; cout<<"primes below 100: \n"; for(int i=2; i<100; i++){ if(!ar[i]) cout << i << ", "; } cout <<"\n"; cout<<"biggest prime in range: \n"; for(int i=n; i>1; i--){ if(!ar[i]){ cout<
 ```total passes in 5 sce/score : 2787 total time = 5000343 microseconds primes below 100: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, biggest prime in range: 999983 -------------------------------- Process exited after 5.018 seconds with return value 0 Press any key to continue . . .```

bitset version:
 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344454647`` ``````// single thread. 5sec, using bitset, score: 218 #include #include #include #include /// using namespace std; using namespace std::chrono; int main() { auto t1 = high_resolution_clock::now(); auto t2 = high_resolution_clock::now(); auto duration = duration_cast(t2- t1); int passes = 0; const int n = 1'000'000; // 1 million int sr = sqrt(n); //bool ar[n + 1] = {}; // all elements 0. bitset ar{}; //set all to 0 /// normal array i == n-i in bitset while(duration.count()<5){ for (int i = 2; i <= sr; i++){ if(ar[i]==1)continue; for (int j = i + i; j <= n; j += i) if(!ar[j]) ar[j] = 1; //composite } t2 = high_resolution_clock::now(); duration = duration_cast(t2-t1); passes++; //break; /////////// run 1 time } auto t3 = high_resolution_clock::now(); auto ending = duration_cast(t3- t1); cout << "total passes in 5 sec/score : " << passes << "\n"; cout << "total time = " << ending.count() << " microseconds\n"; cout<<"primes below 100: "; for(int i=2; i<=100; i++){ if(!ar[i]){ cout << i << ", "; } }cout<<"\n"; cout<<"biggest prime in range: "; for(int i=n; i>1; i--){ if(!ar[i]){ cout<
 ```total passes in 5 sec/score : 218 total time = 5000334 microseconds primes below 100: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, biggest prime in range: 999983 -------------------------------- Process exited after 5.021 seconds with return value 0 Press any key to continue . . .```

 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859`` ``````//test, score 7724, 5s, t=10, n=1e6, #include #include #include #include using namespace std; using namespace std::chrono; void foo(bool ar[], int n, int sr, int s, int e){ if(s<3)s=3; if(e>n)e=n; for (int i = 2; i <= sr; i++){ if(ar[i]==1)continue; if(s%i) s=s+i-(s%i); for (int j = s; j <= e; j += i) ar[j] = 1; //composite } } int main() { auto t1 = high_resolution_clock::now(); auto t2 = high_resolution_clock::now(); auto duration = duration_cast(t2- t1); int passes = 0; const int n = 1e6; // 1 million const int tcount = 10; // thread count int npt = 1e5; // n/tcount, numbers per thread bool ar[n + 1] = {}; // all elements 0. int sr = sqrt(n); std::thread t[tcount]; //not started yet while(duration.count()<5){ int s = 1, e = npt; for (int i = 0; i < tcount; i++) { t[i] = std::thread(foo, ar, n, sr, s, e); s += npt; e += npt; } for (int i = 0; i < tcount; i++) { t[i].join(); } t2 = high_resolution_clock::now(); duration = duration_cast(t2 - t1); passes++; } auto t3 = high_resolution_clock::now(); auto ending = duration_cast(t3 - t1); cout << "passes /score : " << passes << "\n"; cout << "total time = " << ending.count() << " microseconds\n"; cout<< "\nfirst few primes = "; for(int i=2;i<100;i++){ if(!ar[i]) cout << i << " "; } cout<< "\nbiggest prime in range = "; //999983 for(int i=n;i>=2;i--){ if(!ar[i]){ cout << i <<"\n"; break; } } }``````
 ```passes /score : 7724 total time = 5000334 microseconds first few primes = 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 biggest prime in range = 999983 -------------------------------- Process exited after 5.011 seconds with return value 0 Press any key to continue . . .```
all outputs from my rygen 5600g
dhayden wrote:
g++ -O4 -o foo.exe foo.cxx
Maybe try -O5
:)
Last edited on