> Are you sure you profiled correctly?
Helios,
Yes. This one function represents over 50% of the programs runtime for runs that total 90+hrs.
It is essential 4 loops -- 2 of which contains small nested loops -- running over a very large dataset that is effectively 7 parallel vectors that are applied (mult/dot) in various combinations to an eighth, 2D vector -- that is represented as a sparse array. All doubles.
{huge chunk of detail and rant elided :)]
Basically, I'm working with, by choice, for good reasons, MSVC 9. Maybe C++ compiler optimisers have got a lot more clever since; and maybe with the rationalisations to the STL that comes from C++11, a modern compiler would do a better job; but what I'm seeing is 30 lines of C++ consisting of 4 loops relatively simple loops; expanding to 300 lines of assembler with 20+ embedded method calls pre-inlining; which then with inlining and optimisation becomes 450 lines of interleaved and unintelligible mess that is far from optimal.
Simple example. there are 16 SSE register (32 if you split them) and yet only 4 are ever used. The routine contains two constants -- one static; one computed up front.-- that are reused in all four loops. They ought to be loaded into registers once; but instead they are reloaded over and over. Even in the optimised code. (And they are both marked const!)
Another: the body of one of the loops is Y[ i ] = X[ i ] * c; The result of the expansion of the iterators followed by the optimisers attempts, has that loop as (pseudo-code):
1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
move reg1, i
move reg2, &X
move reg3, &Y
move reg4, n
move xmm3, c
loop:
move xmm0, xmm3
mult xmm0, qword ptr [ reg2 + reg1 ]
move qword ptr [ reg3 + i ], xmm0
incr reg1
cmp reg1, reg4
jle loop
| |
Looks fine, but the problem is that the loop body has a cheap SSE reg->reg transfer, followed by an expensive memory to reg multiply and immediately another expensive reg to memory store.
The two consecutive memory operations stall the pipeline. A simple reordering of the SSE operations make the loop 25% more efficient.
1 2 3 4 5 6 7 8 9 10 11
|
...
loop:
move xmm0, qword ptr, [ reg2 + reg1 ] ; xmm0 <- X[ i ]
mult xmm0, xmm3 ; x[ i ] * c
move qword ptr [ reg3 + reg 1 ], xmm0 ; Y[ i ] <- X[ i ] * c
incr reg1
cmp reg1, reg4
jl loop
| |
(Please note: the above is a gross simplification of the actual loop that unrolls the loop by a factor of 4; the point is that there is a lot of fat in there to be excised.)
Combine that with the desire to do SSE vectorisation to do two operations at a time; and the need to do some interlocking (lock prefix) of some instructions and I hope it becomes obvious that this is not "premature optimisation" :)
Anyway, if there are any x64 assemblers with experience of interfacing with C++ -- if anyone actually read this far, I guess they might have -- then I'd love to make contact.
Cheers, Buk.