Okay, I found something. phi in main() is uninitialized before it's passed to r2cfft(). memsetting it gets rid of the first avalanche of errors. At a glance it would appear Pi has the exact same problem. This is all I have time for, for now.
This is some real shitty code, OP. Sorry for being so blunt.
1. Liberally sprinkling
omp parallel for collapse(foo)
won't make your code faster. The added complexity might even make it slower.
2. No attempt at object orientation. How many times do you need to do j + width*i before realizing you need a matrix class?
3. Meaningless variable names.
4. Repetition, repetition, repetition. The error I mentioned above could have been avoided if you'd just created a function that both allocated and initialized the memory.
5. Confusing function parameters where it's unclear which are inputs and which are outputs.
6. Functions with too many parameters. (Would be solved by encapsulating in classes.)
7. Meaningless comments that only serve to add visual noise.
Honestly, it's so bad I might do a full refactor later, if I can find the time.
I will try [commenting out the omp statement] |
That's not necessary. It will have the same effect as not using -fopenmp
Also, I read someone having a similar issue and it was suggested the segmentation fault might have to do with multiple threads updating one value at the same time. Can this be the issue here? |
I don't think so. Unlike that poster's code, your code doesn't contain race conditions. At least not in the parts I checked.
I just don't see how a serial code that runs fine with no segmentation fault error when parallelized creates this problem. |
That's the fun part of undefined behavior.
EDIT:
This is something I am trying so hard to work on, I noticed with turning the -fopenmp off the code takes 18 mins instead of the full serial code which takes 20 mins (granted potentialk functions is still acting funky and results are incorrect since nothing is being updated in the potential solver). I am wondering if this has anything to do with the FFTW built-in threading or is it actually the openMP in the code. I guess if that's possible how can I enhance the performance with my specific code? My sole purpose is to reduce my runtime significantly and therefore enhance my results resolution. What would you suggest? |
The immediate problem is how you structured the parallelization. OpenMP isn't magic. You can't expect efficient use of threads by just making every single for loop parallel and synchronizing after each one. Unfortunately designing performant parallel code is a subject complex enough for a book, not something I can usefully communicate in a forum post.
For now, let's just say that you first need to understand which data depends on the value of which data. Things that can be computed independently are good parallelization candidates.
In your case you're seeing poor parallel scaling performance because you're treating your CPU as if it was a GPU. GPUs are very good at handling massive parallelization (hundreds or thousands of threads) of very short and simple functions. E.g. computing each cell of a large matrix multiplication in parallel. CPUs are exactly the opposite; they're very good at handling low parallelization (on desktop computers usually 4-8 threads, rarely more than 32) of very complex functions. E.g. Computing two independent large matrix multiplications in parallel.