Loop Interchange
The following two small programs, written in F90 and C, simply perform the same two loop operations on matrix a,b,c. Familiarize with them, compile your preferred F90 and C compiler and then run them.
Ifort crashes with optimization on because of the call to rand()!
program loop_interchange ! loop interchange example integer:: n=4000,m=4000 ! global dimensions INTEGER:: I,J ! running indexes real,allocatable :: a(:,:),b(:,:),c(:,:) double precision :: time1=0d0,time2=0.d0 allocate(a(n,m)) allocate(b(n,m)) allocate(c(n,m)) a=0.0 b=rand() c=rand() call cpu_time(time1) DO I = 1, N DO J = 1, M A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO call cpu_time(time2) print*,"control value:", A(N,M), "time to execute first loop (i,j) is: ", time2-time1 a=0.0 call cpu_time(time1) DO J = 1, M DO I = 1, N A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO call cpu_time(time2) print*,"control value:", A(N,M), "time to execute second loop (j,i) is: ", time2-time1 end program | #include <stdio.h> #include <stdlib.h> #include <time.h> #define N 4000 #define M 4000 main() { int i,j; float *a,*b,*c; double wall_time1, wall_time2, wall_time; a=(float *)malloc(N*M*sizeof(float)); b=(float *)malloc(N*M*sizeof(float)); c=(float *)malloc(N*M*sizeof(float)); for(i=0;i<N;i++){ for(j=0;j<M;j++){ b[j+N*i] = (float)rand()/(float)RAND_MAX; c[j+N*i] = (float)rand()/(float)RAND_MAX; } } wall_time1=(double)clock()/(double)CLOCKS_PER_SEC; for(i=0;i<N;i++) { for(j=0;j<M;j++) a[j+N*i] = b[j+N*i] + c[j+N*i]; } wall_time2=(double)clock()/(double)CLOCKS_PER_SEC; wall_time=(wall_time2-wall_time1); printf("%f\t %e\n",a[1],wall_time); wall_time1=(double)clock()/(double)CLOCKS_PER_SEC; for (j=0;j<M;j++){ for (i=0;i<N;i++) a[j+N*i] = b[j+N*i] + c[j+N*i]; } wall_time2=(double)clock()/(double)CLOCKS_PER_SEC; wall_time=(wall_time2-wall_time1); printf("%f\t %e\n",a[1],wall_time); } |
A typical output for the C version (compiled with gcc 4.1.2-42) is:
$ ./loop-gcc control value: 1.581539 time to execute first loop (i,j) is: 1.900000e-01 control value: 1.581539 time to execute second loop (j,i) is: 2.050000e+00
while the fortran one (compiled with gfortran 4.1.2-42) gives something like:
$ ./loop-gfortran control value: 0.1315453 time to execute first loop (i,j) is: 2.23166000000000 control value: 0.1315453 time to execute second loop (j,i) is: 0.209968000000000
Answer now the following questions:
- Which is the fastest loop in the C version ? and in the Fortran one?
- Explain why
- Is there any differences between different compilers result?
- Did you observe any improvement enabling standard optimization flags ? -O2 /- O3
- On which loop ? Do you have any explanation about the behavior observed ?