Loop Interchange 2

In the following program you will see a standard triple-loop implementation (three nested do statements in F90) for matrix-matrix operations. The goal of this exercise is to find out which is the index permutation that gives the best performance. The code implements two permutations out of six (factorial of 3). Familiarize with it, compile and run it.

program loop_interchange2

!!
!! small program to find out the best permutation of  i,j,k
!!

  real*8,allocatable :: A(:,:,:),B(:,:,:),C(:,:,:)
  real*8 :: t1,t2
  integer :: i,j,k
  integer :: nsize

  write(*,*) 'provide an integer (suggested range 100-250)'
  write(*,*) 'larger values can be very memory and time-consuming'
  write(*,*) 'check available memory on your system before playing with larger values'
  read(*,*) nsize

  allocate(A(nsize,nsize,nsize))
  allocate(B(nsize,nsize,nsize))
  allocate(C(nsize,nsize,nsize))

call cpu_time(t1)
DO i=1,nsize
   DO j=1,nsize
      DO k=1,nsize
         A (i,j,k) =0
         B (i,j,k) =rand()
         C (i,k,k) =rand()
      END DO
   END DO
END DO
call cpu_time(t2)

print*,'inizialisation time=', t2-t1

call cpu_time(t1)
DO i=1,nsize
   DO j=1,nsize
      DO k=1,nsize
         A (i,j,k) = A (i,j,k)+ B (i,j,k)* C (i,j,k)
      END DO
   END DO
END DO
call cpu_time(t2)

  print*,A(nsize,nsize,nsize),'ijk', t2-t1


call cpu_time(t1)
DO k=1,nsize
   DO j=1,nsize
      DO i=1,nsize
         A (i,j,k) = A (i,j,k)+ B (i,j,k)* C (i,j,k)
      END DO
   END DO
END DO
call cpu_time(t2)

  print*,A(nsize,nsize,nsize),'kji', t2-t1

end program loop_interchange2

You are then requested to complete the code adding the missing four permutations and run again the program. Report the times you got in a table like the one here below on your page:

NOTE: Be sure to interchange the parameters i,k,j in the DO loop for each permutation.

permutationstimes measured
i,j,k 
i,k,j 
k,i,j 
k,j,i 
j,k,i 
j,i,k