Bloqued
In this exercise a simple program named blocked.f90 implements two different transpose algorithm: the first one is the naive implementation:
!straightforward method call cpu_time(t1) do i=1,N do j=1,M y(i,j)=x(j,i) end do end do call cpu_time(t2) print*, '#direct and simple=',t2-t1
while the second one is a little bit more complicated one using a blocked algorithm:
call cpu_time(t1) do ib = 1, nb ioff = (ib-1) * bsize do jb = 1, mb joff = (jb-1) * bsize do j = 1, bsize ! load the direct matrix in a block do i = 1, bsize buf(i,j) = x(i+ioff, j+joff) enddo enddo ! transpose the block do j = 1, bsize do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo !load the transposed blocked in the right place of the final transposed matrix do i=1,bsize do j=1,bsize y(j+joff, i+ioff) = buf(j,i) enddo enddo enddo enddo call cpu_time(t2)
Take a look at it, then compile and run it. The program will ask you about the size of the squared (for simplicity) blocks. A typical output goes as follows:
#direct and simple= 0.057992 give the size of the block (assumed squared for simplicity) please be sure that size is a perfect divisor of the size of the matrix 128 #blocked algorithm 512 128 6.998999999999998E-003
- basic exercise:
You are requested to run the code several times exploring which is the size that minimizes the execution time. It would also be interesting to make a plot showing the execution time as a function of the block size.
- advanced exercise:
Modify the blocking algorithm to deal with any size of the blocks.