Bloqued

In this exercise a simple program named blocked.f90 implements two different transpose algorithm: the first one is the naive implementation:

!straightforward method
 call cpu_time(t1)
 do i=1,N
    do j=1,M
       y(i,j)=x(j,i)
    end do
 end do
 call cpu_time(t2)

 print*, '#direct and simple=',t2-t1

while the second one is a little bit more complicated one using a blocked algorithm:

  call cpu_time(t1)
  do ib = 1, nb
     ioff = (ib-1) * bsize
     do jb = 1, mb
        joff = (jb-1) * bsize
        do j = 1, bsize

! load the direct matrix in a block
           do i = 1, bsize
              buf(i,j) = x(i+ioff, j+joff)
           enddo
        enddo
! transpose the block
        do j = 1, bsize
           do i = 1, j-1
              bswp = buf(i,j)
              buf(i,j) = buf(j,i)
              buf(j,i) = bswp
           enddo
        enddo
!load the transposed blocked in the right place of the final transposed matrix
        do i=1,bsize
           do j=1,bsize
              y(j+joff, i+ioff) = buf(j,i)
           enddo
        enddo
     enddo
  enddo
  call cpu_time(t2)

Take a look at it, then compile and run it. The program will ask you about the size of the squared (for simplicity) blocks. A typical output goes as follows:

 #direct and simple= 0.057992
 give the size of the block (assumed squared for simplicity)
  please be sure that size is a perfect divisor of the size of the matrix
128
  #blocked algorithm
         512         128  6.998999999999998E-003
  • basic exercise:

You are requested to run the code several times exploring which is the size that minimizes the execution time. It would also be interesting to make a plot showing the execution time as a function of the block size.

  • advanced exercise:

Modify the blocking algorithm to deal with any size of the blocks.