I am studying the best way to use vectorization (SIMD),
in order to achieve thread&SIMD parallelization as Intel proposes the potential of speed up as the below.
But, I know that there are some sound books relating to thread parallelism (TBB or openMP), on the other hand, books about SIMD are not so prevalent.
Is there masterpiece's books (or documents) about SIMD?
And, I notice that efficient SIMD implementation is so difficult, and there are many techniques that ease them (e.g. parallel STL, openMP SIMD pragma etc).
I hope that their pros and cons are given..
This is an introductory question, so that I would appreciate it if you could help me.
I know that vectorization is automatically done in usual compilers (e.g. intel compiler needs over -O2).
But, I could not obtain auto-vectorization even in simple for loops. ("report" compile options told me that)
(Maybe, due to iteration access in STL containers).
If someone could know some best practices or documents,
I would appreciate it if you could tell me.