for loop execute faster

Hi,

i have the code below:
1
2
3
4
5
for (i=1;i<512;i++)
  for (j=1;j<512;j++){
     temp=y[j+1][i]+y[j-1][i]+y[j][i-1]+y[j][i]-2;
     x[j][i]=y[j][i]+temp;
  }


and i would like to execute it faster using openmp.
So what i have tried yet is:
1
2
3
4
5
6
7
#pragma omp parallel for num_threads(2)
for (i=1;i<512;i++)
  #pragma omp parallel for num_threads(2)
  for (j=1;j<512;j++){
     temp=y[j+1][i]+y[j-1][i]+y[j][i-1]+y[j][i]-2;
     x[j][i]=y[j][i]+temp;
  }


is there any better way of doing so? Also, i am not certain if the code i provide is correct.

Any help please.

Thanks.
Last edited on
That won't work. Parallellism isn't nearly as simple as you want it to be. I'm resisting the urge to simply link you the OpenMP manual...

a) Your nesting threads. Are you sure you want to do that? It's often better to divide the outer loop over 4 threads. Are you even sure you have 4 processors? Are you sure you want to use (all) 4? (If you're using OpenMP 3.0+, nested parallellism isn't simply odd, but fully stupid.)

b) By the looks of it, i and j are defined outside the loop. Are they shared by threads? Should they be? I doubt it...

c) What about temp? Not only is it never reset, it is also shared over threads. I'm guessing you messed it up in the sequential version already and the not-resetting is a mistake. In the case that it isn't, your result is now completely messed up, because parallel loops don't guarantee sequential looping. What if i = 500 is executed before i = 1? Does your result still work?

Honestly, I suggest you read a proper manual. OpenMP makes parallellism look easy, but that's just the implementation. The algorithm design is the actual core of parallellism. Data management is possibly the most difficult part about concurrency.
You can't do that because temp would be modified by every thread, causing concurrency issues.
OK, i thought it was easy... i am using older version than 3.0.

lets say that we have "pr" processors, with pr > 1

i parallellize the outer loop.

1
2
3
4
5
6
7
8
9
10
int i,j,temp;
#pragma omp parallel for num_threads(pr) shared(x) private(j,i) schedule(dynamic, 1)
for (i=0;i<512;i++)
  for (j=0;j<512;j++){

     #pragma omp critical
     temp=y[j+1][i]+y[j-1][i]+y[j][i-1]+y[j][i]-2;

     x[j][i]=y[j][i]+temp;
  }


temp is not shared to threads, so i add a pragma omp critical, in order to be executed only by one thread.

Any comment please.
So you basically made sure that the loop you were trying to parallelize can only be executed by one thread at a time? How is that supposed to help? The threads will just spend 99% of their time fighting for the critical section. Just eliminate the outer variables, which you should do anyway - parallelization or not.
And you're accessing elements -1 and 512 in y... unless y is some unusual construct, that's not valid. Likewise, if x is a regular 2D array, your loop order is extremely cache-unfriendly (plus it prevents automatic SSE vectorization), so reverse it for a massive performance gain.

1
2
3
4
5
#pragma omp parallel for
for (int j=1;j<511;j++)for (int i=1;i<512;i++)
{
     x[j][i]=y[j][i]+int(y[j+1][i]+y[j-1][i]+y[j][i-1]+y[j][i]-2);
} //do j=1,511 and i=0 separately 


Edit: just to get an idea of how important it is to be careful in what you're doing when parallelizing - the above variant is about 700 times faster on my machine (and actually correct).
Last edited on
Topic archived. No new replies allowed.