In the out() function below, which sums elements, you can see I have a switch case to do the summation in a single line, and in the default case (for any more than 4 elements) it uses a loop.
My question is: does omitting the loop for known sizes actually optimize the code?
Maybe, may not. Don't be the compiler; just code a loop. If you find that your program seems to have some speed issues, use a profiler, and fix the code that actually needs optimizing.
The optimizer will unroll loops to the extent that it is advantageous to do so (if we make the function available for inline substitution, at the call-site).
#include <numeric>
// uncomment the static and in the code generated (last part)
// we can see that the loop is unrolled for the first eight
// possible iterations of the loop
// (the optimiser has added cases 5, 6, 7, 8)
staticint foo( constint array[], std::size_t index )
{
int sum = 0 ;
switch(index)
{
case 4 : sum += array[3] ; [[fallthrough]];
case 3 : sum += array[2] ; [[fallthrough]];
case 2 : sum += array[1] ; [[fallthrough]];
case 1 : sum += array[0] ; [[fallthrough]];
case 0 : return sum ;
default : return std::accumulate( array, array+index, 0 ) ;
}
}
// uncomment the static and in the code generated (last part)
// we can see that the loop is unrolled for the first eight
// possible iterations of the loop
staticint bar( constint array[], std::size_t index )
{ return std::accumulate( array, array+index, 0 ) ; }
// generates identical (well, almost identical; the order of summation is different)
// code for these two functions
int foobaz( constint array[] ) { return foo( array, 4 ) ; }
int barbaz( constint array[] ) { return bar( array, 4 ) ; }
It looks like you got the assembly for me, and I agree it looks about the same.
I had kept the switch statement just in case, since I was a bit paranoid about performance in an audio application.
I remember something about loop unrolling from class but forgot all of it, but this seems like a good example. The function i'm using will be an inline member of a struct and I am using GCC, so I feel more confident after looking at the generated code in the demo that those optimizations will take place for me