Loop Interchange

The following two small programs, written in F90 and C, simply perform the same two loop operations on matrix a,b,c. Familiarize with them, compile your preferred F90 and C compiler and then run them.

Ifort crashes with optimization on because of the call to rand()!

program loop_interchange

! loop interchange example 
      integer:: n=4000,m=4000 ! global 
        dimensions 
        INTEGER:: I,J ! running indexes 
        real,allocatable :: a(:,:),b(:,:),c(:,:) 
        double precision :: time1=0d0,time2=0.d0 
        allocate(a(n,m)) 
        allocate(b(n,m)) 
        allocate(c(n,m)) 
        a=0.0 
        b=rand() 
        c=rand() 
        call cpu_time(time1) 
        DO I = 1, N 
           DO J = 1, M 
                A(I, J) = B(I, J) + C(I, J) 
           ENDDO 
        ENDDO 
        call cpu_time(time2) 
       print*,"control value:", A(N,M), 
       "time to execute  first loop (i,j) is: ", 
       time2-time1 
       a=0.0  

        call cpu_time(time1)  
	DO J = 1, M 
  	   DO I = 1, N  
    		A(I, J) = B(I, J) + C(I, J) 
           ENDDO 
	ENDDO 
	call cpu_time(time2) 
      print*,"control value:", A(N,M), 
      "time to execute second loop (j,i) is: ", 
      time2-time1 





        end program 
#include <stdio.h> 
#include <stdlib.h> 
#include <time.h> 
#define N 4000 
#define M 4000 

main() 
{ 
  int i,j; 
  float *a,*b,*c; 
  double wall_time1, wall_time2, wall_time; 

  a=(float *)malloc(N*M*sizeof(float)); 
  b=(float *)malloc(N*M*sizeof(float)); 
  c=(float *)malloc(N*M*sizeof(float)); 

  for(i=0;i<N;i++){ 
    for(j=0;j<M;j++){ 
      b[j+N*i] = (float)rand()/(float)RAND_MAX; 
      c[j+N*i] = (float)rand()/(float)RAND_MAX; 
    } 
  } 
  wall_time1=(double)clock()/(double)CLOCKS_PER_SEC; 

  for(i=0;i<N;i++) { 
    for(j=0;j<M;j++) 
      a[j+N*i] = b[j+N*i] + c[j+N*i]; 
  } 
  wall_time2=(double)clock()/(double)CLOCKS_PER_SEC; 
  wall_time=(wall_time2-wall_time1); 
  printf("%f\t %e\n",a[1],wall_time); 

  wall_time1=(double)clock()/(double)CLOCKS_PER_SEC; 

  for (j=0;j<M;j++){ 
    for (i=0;i<N;i++) 
      a[j+N*i] = b[j+N*i] + c[j+N*i]; 
  } 
  wall_time2=(double)clock()/(double)CLOCKS_PER_SEC; 
  wall_time=(wall_time2-wall_time1); 
  printf("%f\t %e\n",a[1],wall_time);  
} 

A typical output for the C version (compiled with gcc 4.1.2-42) is:

$ ./loop-gcc 
 control value: 1.581539       time to execute  first loop (i,j) is:    1.900000e-01
 control value: 1.581539       time to execute second loop (j,i) is:   2.050000e+00

while the fortran one (compiled with gfortran 4.1.2-42) gives something like:

$ ./loop-gfortran 
 control value:  0.1315453     time to execute  first loop (i,j) is:    2.23166000000000     
 control value:  0.1315453     time to execute second loop (j,i) is:   0.209968000000000     

Answer now the following questions:

  1. Which is the fastest loop in the C version ? and in the Fortran one?
  2. Explain why
  3. Is there any differences between different compilers result?
  4. Did you observe any improvement enabling standard optimization flags ? -O2 /- O3
  5. On which loop ? Do you have any explanation about the behavior observed ?