DISTRIBUTED MEMORY CONCURRENT COMPUTERS*

Jaeyoung Choi<br>Department of Computer Science<br>University of Tennessee<br>107 Ayres Hall<br>Knoxville, TN 37996-1301<br>Jack Dongarra<br>Mathematical Sciences Section Engineering Physics and Mathematics Division Oak Ridge National Laboratory<br>Oak Ridge, TN 37831-6367<br>AND<br>Department of Computer Science<br>University of Tennessee<br>107 Ayres Hall<br>Knoxville, TN 37996-1301<br>D. W. Walker<br>Mathematical Sciences Section<br>Engineering Physics and Mathematics Division<br>Oak Ridge National Laboratory<br>Oak Ridge, TN 37831-6367



> The submitted manuscript has been authored by a contractor of the U.S. Government under contract No. DE-ACO5-84OR2 1400 . Accordingly. the U.S. Government retains a nonexclusive. royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so. for U.S. Government purposes."

[^0]
# Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers 

Jaeyoung Choi ${ }^{1}$,<br>${ }^{2}$ Mathematical Sciences Section<br>Oak Ridge National Laboratory<br>P.O. Box 2008, Bldg. 6012<br>Oak Ridge, TN 37831-6367

Jack J. Dongarra ${ }^{1,2}$,
David W. Walker ${ }^{2}$

${ }^{1}$ Department of Computer Science<br>University of Tennessee at Knoxville<br>107 Ayres Hall<br>Knoxville, TN 37996-1301


#### Abstract

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a $P \times Q$ processor template with a block scattered data distribution. $P, Q$, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, $\mathbf{C}=\mathbf{A} \cdot \mathbf{B}$, the algorithms are used to compute parallel multiplications of transposed matrices, $\mathbf{C}=\mathbf{A}^{T} \cdot \mathbf{B}^{T}$, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.


## 1 Introduction

Matrix transposition is a fundamental matrix operation of linear algebra and arises in many scientific and engineering applications. On a uniprocessor, an algorithm involving a transposed matrix may not actually require the matrix data to be transposed in physical memory. Instead, it may be accessed simply by exchanging the row and column indices. However, in a distributed-memory multiprocessor environment, we cannot simply interchange the global row and column indices. Instead, the data must be physically moved from one processor to another.

Transposition of a matrix is a redistribution of its elements. Many researchers have considered the
problem for different architectures. In 1972, Eklundh [7] considered the problem of directly accessing rows or columns of a matrix when its size is larger than the available high-speed storage. O'Leary [10] implemented an algorithm for transposing an $N \times N$ matrix on a one-dimensional systolic array. Azari, Bojanczyk and Lee [1] developed an algorithm for transposing an $M \times N$ matrix on an $N \times N$ mesh-connected array processor, and Johnsson and Ho [9] presented an algorithm for a Boolean n-cube, or hypercube.

Current advanced architecture computers possess hierarchical memories in which accesses to data in the upper levels of the memory hierarchy (registers, cache, and/or local memory) are faster than those in lower levels (shared or off-processor memory). To exploit the power of such machines, block-partitioned algorithms are preferred for dense linear algebra computations, in which operations are performed on submatrices, rather than individual matrix elements. In distributing matrix data over processors we therefore assume a block scattered decomposition [4,6]. The block scattered decomposition can reproduce the most common data distributions used in dense linear algebra, as described briefly in the next section.

In this paper, the parallel matrix transpose algorithms are presented based on the block scattered decomposition. The algorithms are implemented on the Intel Touchstone Delta computer. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of the number of rows and columns ( $P$ and $Q$ ) of the processor template. If $P$ and $Q$ are relatively prime, the matrix transpose algorithm involves complete exchange communication. This is called all-to-all personalized communication, in which each of $N_{p}=P \cdot Q$ processors is required to send distinct subblocks to each of the remaining $N_{p}-1$
processors，and receive distinct subblocks from each of them．Bokhari and Berryman［2］have developed bi－ nary exchange and quadrant exchange algorithms on a circuit switched mesh，where $P$ and $Q$ are powers of 2 ．The complete exchange communication in our transpose algorithms arises for any processor configu－ ration，and is not limited to the case where $P$ and $Q$ are powers of 2 ．We implemented the complicated two－ dimensional complete exchange communication prob－ lem by generalizing the one－dimensional complete ex－ change communication based on direct point－to－point communication．Details are discussed in Section 3．1．

We have presented the Parallel Universal Matrix Multiplication Algorithms（PUMMA）in［5］for per－ forming $\mathbf{C} \Leftarrow \alpha o p(\mathbf{A}) \cdot o p(\mathbf{B})+\beta \mathbf{C}$ ，where $o p(\mathbf{X})=$ $\mathbf{X}$ or $\mathbf{X}^{T}$ ，based on the block scattered decompo－ sition．One of the cases in the PUMMA package， $\mathbf{C} \Leftarrow \alpha \mathbf{A}^{T} \cdot \mathbf{B}^{T}+\beta \mathbf{C}$ ，is implemented in two steps （ $\mathbf{T} \Leftarrow \alpha \mathbf{B} \cdot \mathbf{A} ; \mathbf{C} \Leftarrow \mathbf{T}^{T}+\beta \mathbf{C}$ ）．The second step in－ volves parallel matrix transposition．The performance of this algorithm for evaluating $\mathbf{C}=\mathbf{A}^{T} \cdot \mathbf{B}^{T}$ is com－ pared with the algorithm for evaluating $\mathbf{C}=\mathbf{A} \cdot \mathbf{B}$ on the Intel Delta machine in Section 4.

## 2 Design Issues

The way in which an algorithm＇s data are dis－ tributed over the processors of a concurrent computer has a major impact on the load balance and communi－ cation characteristics of the concurrent algorithm，and hence largely determines its performance and scalabil－ ity．The block scattered decomposition provides a sim－ ple，yet general－purpose way of distributing a block－ partitioned matrix on distributed memory concurrent computers．In the block scattered decomposition，de－ scribed in detail in［4，6］，an $M \times N$ matrix is parti－ tioned into blocks of size $r \times s$ ，and blocks separated by a fixed stride in the column and row directions are assigned to the same processor．If the stride in the column and row directions is $P$ and $Q$ blocks respec－ tively，then we require that $P \cdot Q$ equal the number of processors，$N_{p}$ ．Thus，it is useful to imagine the pro－ cessors arranged as a $P \times Q$ mesh，or template．The processor at position $(p, q)(0 \leq p<P, 0 \leq q<Q)$ in the template is assigned the blocks indexed by，

$$
\begin{equation*}
(p+i \cdot P, q+j \cdot Q) \tag{1}
\end{equation*}
$$

where $i=0, \ldots,\left\lfloor\left(M_{b}-p-1\right) / P\right\rfloor, j=0, \ldots,\left\lfloor\left(N_{b}-\right.\right.$ $q-1) / Q\rfloor$ ，and $M_{b} \times N_{b}$ is the size in blocks of the matrix $\left(M_{b}=\lceil M / r\rceil, N_{b}=\lceil N / s\rceil\right)$ ．

Blocks are scattered in this way so that good load balance can be maintained in parallel algorithms，such

|  | 0 | 1 | 2 | 3 | 4 | 5 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 8 | 令眺 | \％ | 0 | 1 | 2 |
| 1 | 全交 | \％\％ | 駺 | 3 | 4 | 5 |
| 2 | 0 | 1 | 2 | 0 | \％ | 2 |
| 3 | 3 | 4 | 5 | 8 | 4， | 2k |
| 4 | ¢ | \％ | \％ | 0 | 1 | 2 |
| 5 |  | S\％紋 | 等等 | 3 | 4 | 5 |

（a）block distribution over template

（b）LCM block distribution Figure 1：A matrix with $6 \times 6$ blocks is distributed over a $2 \times 3$ processor template．（a）Each shaded and un－ shaded area represents different templates．The num－ bered squares represent blocks of elements，and the number indicates at which location in the processor template the block is stored－all blocks labeled with the same number are stored in the same processor． The slanted numbers，on the left and on the top of the matrix，represent indices of row block and column block，respectively．（b）The matrix has $1 \times 1$ LCM blocks．Blocks belong to the same processor if the rel－ ative locations of blocks are the same in each square $L C M$ block．The definition of the $L C M$ block is de－ fined in the text．
as LU factorization［3，6］．The nonscattered decompo－ sition（or pure block distribution）is just a special case of the scattered decomposition in which the block size is given by $r=\lceil M / P\rceil$ and $s=\lceil N / Q\rceil$ ．A purely scattered decomposition（or two－dimensional wrapped distribution）is another special case in which the block size is given by $r=s=1$ ．

If $P$ and $Q$ are relatively prime，the matrix trans－ pose algorithm involves a two－dimensional complete exchange communication，where each of $N_{p}$ proces－ sors is required to send distinct subblocks to each of the remaining $N_{p}-1$ processors，and receive distinct subblocks from each of them．We implemented the complicated two－dimensional complete exchange algo－ rithm by generalizing the one－dimensional complete exchange algorithm．

## 3 Matrix Transpose Algorithms

We assume that a matrix is distributed over a two－ dimensional processor mesh，or template，so that in general each processor has several blocks of the matrix as shown in Figure 1 （a），where a matrix with $6 \times 6$ blocks is distributed over a $2 \times 3$ template．Denot－ ing the least common multiple of $P$ and $Q$ by $L C M$ ， we refer to a square of $L C M \times L C M$ blocks as an $L C M$ block．Thus，the matrix may be viewed as a $1 \times 1$ array of $L C M$ blocks，as shown in Figure 1 （b）．


Figure 2: An example of matrix transpose for a block scattered decomposition, when $P=2, Q=3$, and $M_{b}=N_{b}=6$.

The concept of the $L C M$ block was introduced in [5], and is very useful for implementing algorithms that use a block scattered data distribution. Blocks belong to the same processor if their relative locations are the same in each square $L C M$ block. An algorithm may be developed for the first $L C M$ block, and then it can be directly applied to the other $L C M$ blocks, which all have the same structure and the same data distribution as the first $L C M$ block. That is, when an operation is executed on a block of the first $L C M$ block, the same operation can be done simultaneously on other blocks, which have the same relative location in each $L C M$ block.

We now describe parallel matrix transpose algorithms. A matrix A, distributed over a $P \times Q$ processor template, has $M_{b} \times N_{b}$ blocks and each block consists of $r \times s$ elements, where $r$ and $s$ are arbitrary. Figure 2 (a) shows an example of a matrix transpose on a $2 \times 3$ template. If $\mathbf{A}$ is transposed, the transposed matrix $\mathbf{A}^{T}$ is distributed over the same $P \times Q$ template, and it has $N_{b} \times M_{b}$ blocks and each block has $s \times r$ elements. The elements of each block remain in the same block, but may be in a different processor, and each block is itself transposed. Figure 2 (b) shows the same example from the processor point-ofview. If $P$ and $Q$ are relatively prime, as shown in the figure, blocks in the first processor $P_{0}$ are scattered to all processors. As shown in Figure 3, which is the same example on a $3 \times 3$ square template, the blocks in each processor are not dispersed, but they

(a) matrix transpose from matrix point-of-view


Figure 3: An example of matrix transpose for a block scattered decomposition, when $P=3, Q=3$, and $M_{b}=N_{b}=6$.
are moved as one entity to a different processor. As shown in Figure 3 , which is the same example on a $3 \times 3$ square template, the blocks in each processor are not dispersed, but they are moved as one entity to a different processor. Parallel matrix transpose algorithms for the block scattered data distribution have several communication patterns determined by the greatest common divisor (GCD) of $P$ and $Q$.

## 3.1 $P$ and $Q$ : relatively prime

We start with the simple case where $P$ and $Q$ are relatively prime, i. e. $G C D=1$. In this case blocks in $P_{0}$ are scattered to all processors after being locally transposed as shown in Figure 2 (b). This case involves the two-dimensional complete exchange communication. That is, every processor needs to communicates with every other processor. The complete exchange problem is implemented by direct communication between sender and receiver.

Figure 4 shows the pseudocode from the processor point-of-view, where $P\langle p, q\rangle$ represents $P_{\mathrm{MOD}(p, P), \mathrm{MOD}(q, Q)}$ in the processor template. Processor $P(p, q)(0 \leq p<P$ and $0 \leq q<Q)$ starts to transpose blocks whose transposed blocks belong to itself. Then it deals with blocks whose transposition are in processors in the same column of the template ( $P\langle p-i, q\rangle, 0 \leq i<P$ ). The processor sends blocks to its top neighbor, $P(p-1, q)$, and receives blocks from its bottom neighbor, $P\langle p+1, q\rangle$. Before sending the blocks, it is necessary to copy the blocks to be sent

```
DO \(J=0, Q-1\)
    DO \(I=0, P-1\)
        [ Copy all blocks of \(\mathbf{A}\) required by
        \(P\langle p+I, q-J\rangle\) to T 1\(]\)
        [Send \(\mathbf{T} 1\) to \(P\langle p+I, q-J\rangle]\)
        [ Receive T2 from \(P\langle p-I, q+J\rangle\) ]
        [ Copy T2 to C]
    END DO
END DO
```

Figure 4: A parallel matrix transpose algorithm from the processor point-of-view, when $P$ and $Q$ are relatively prime. $P(p, q\rangle$ represents $P_{\operatorname{MOD}(p, P), \mathrm{MOD}(q, Q)}$. Processor $P_{p, q}(0 \leq p<P$ and $0 \leq q<Q)$ communicates with $P\langle p+I, q-J\rangle$ to send, and $P(p-I, q+J)$ to receive based on point-to-point communication.
into a contiguous message buffer. Next it sends blocks to the next top processor, $P\langle p-2, q\rangle$, and receives blocks from the next bottom processor, $P\langle p+2, q\rangle$.

After it completes its operations with the processors in the same column, it sends blocks to the processors to the left in the template ( $P\langle p-i, q-1\rangle, 0 \leq i<P$ ), and receives blocks from the processors to the right $(P\langle p+i, q+1\rangle)$. All operations are completed in $P \times$ $Q=L C M$ steps.

We interpret the algorithm from the matrix point-of-view. In the first $L C M$ block, the above algorithm performs the operation by transposing one (wrapped) diagonal blocks at one step. The first step of the algorithm in Figure 4 requires no explicit communication. It corresponds to an in-place transpose of the diagonal blocks of $\mathbf{A}(\mathbf{A}(i, i))$ (See Fig. 5(a)). Then every $P$ diagonal blocks of $\mathbf{A}(\mathbf{A}(i, j), \operatorname{MOD}(j-i, P)=0)$ (See Fig. 5(b)) are transposed in the first outer loop ( $J=0$ ) of Figure 4. In the next outer loop $(J=1)$, the next $P$ diagonal blocks $(\mathbf{A}(i, j), \operatorname{MOD}(j-i, P)=1)$ are transposed. In Figures 5 (c) and (d), $P_{0}(P\langle 0,0\rangle)$ sends blocks to $P_{2}(P(0,2))$, and receives from $P_{1}$ $(P(0,1))$, where $P_{0}, P_{1}$ and $P_{2}$ are in the same row. Then $P_{0}$ sends blocks to $P_{5}(P\langle 1,2\rangle)$, and receives from $P_{4}(P\langle 1,1\rangle)$, and so on. The pseudocode for the algorithm from the matrix point-of-view is shown in Figure 6. Processors need to determine a diagonal block of $\mathbf{A}(\mathbf{A}(i, j), \operatorname{MOD}(j-i, L C M)=K)$ which they start to transpose in the outer $J$ loop in order to communicate with other processors in the same row of the template. The three lines before the inner DO-


Figure 5: Snapshots of matrix transposition when $P=2, Q=3$ and $M_{b}=N_{b}=6$. The small slanted number in the upper left corner in each block represents processor identification number. One wrapped block diagonal is transposed in each step. The darkly shaded area represents blocks to be shifted, and lightly shaded area stands for their transpositions.
loop compute the value of $K$.

## 3.2 $P$ and $Q$ : not relatively prime

In the previous section, we have investigated the case when $P$ and $Q$ are relatively prime, which involves complete exchange communication. In this section we consider the case of $G C D>1$. The former algorithm is a special case ( $G C D=1$ ) of this algorithm.

Figure 7 shows an example of transposing a $12 \times$ 12 matrix on a $4 \times 6$ template from the processor point-of-view. Each processor has its own $3 \times 2$ $(=L C M / P \times L C M / Q)$ array of blocks. The processors can transpose the matrix with $6(=L C M / P$.

```
\(\operatorname{DO} J=0, Q-1\)
    \(K=J\)
    WHILE \((\operatorname{MOD}(K, P) \neq 0)\)
        DO \(K=\operatorname{MOD}(K+Q, L C M)\) END DO
    \(\mathrm{DO} I=0, P-1\)
        [Copy every ( \(K: N_{b}: L C M\) )-th wrapped
        diagonal blocks in \(P_{p, q}\) to T1]
        [ Move T1 from \(P_{p, q}\) to \(P\langle p+I, q-J\rangle\) ]
        [Copy the received T 1 to C ]
        \(K=\operatorname{MOD}(K+Q, L C M)\)
    END DO
END DO
```

Figure 6: A parallel matrix transpose algorithm from the matrix point-of-view, when $P$ and $Q$ are relatively prime. One diagonal block is transposed at one step. The 'While' statement should be executed until $\mathrm{MOD}(K, P)$ becomes 0 . (start : limit : intv) represents values of $x$, where $x=$ start + intv $y, y=0,1, \cdots$, and $x$ can't exceed limit.
$L C M / Q=L C M / G C D)$ communications steps. A processor $P\langle p, q\rangle$ starts to communicate with $P\langle\tilde{p}, \tilde{q}\rangle$, where $\tilde{p}$ and $\tilde{q}$ are computed from $p$ and $q$ (details are explained later of this section). Once $P\langle\tilde{p}, \tilde{q}\rangle$ is determined, it communicates with other processors, whose vertical and horizontal intervals are $G C D$ from $P\langle\tilde{p}, \tilde{q}\rangle$. The two loops of the algorithm in Figure 4 are changed from $Q$ and $P$ to $L C M / P$ and $L C M / Q$. The pseudocode of the algorithm is shown in Figure 8.

Figure 9 shows two snapshots of the same example, from the matrix point-of-view, to transpose the zeroth and the first diagonal blocks of $\mathbf{A}(\mathbf{A}(i, j)$, $\operatorname{MOD}(j-i, L C M)=0$ and 1 , respectively.) The pro-


Figure 7: A matrix transpose example on a $4 \times 6$ template.

```
PARDO K=1,GCD
    g=MOD}(q-p,GCD
    \tilde{p}=\operatorname{MOD}(p+g,P);\tilde{q}=\operatorname{MOD}(q-g,Q)
    DO J = 0,LCM/P-1
        DO I=0,LCM/Q-1
        [ Copy all blocks of A to T1
        required by }\mp@subsup{P}{{\hat{p}+I\timesGCD,\tilde{q}-J\timesGCD)}{}
        [Send T1 to P}\mp@subsup{P}{{\tilde{p}+I\timesGCD,\tilde{q}-J\timesGCD)}{}
        [Receive T2 from }\mp@subsup{P}{{\tilde{p}-I\timesGCD,\tilde{q}+J\timesGCD\rangle}{}
        [Copy T2 to C]
        END DO
    END DO
END PARDO
```

Figure 8: A modified matrix transpose algorithm from the processor point-of-view. Operations of $G C D$ groups of processors are overlapped.

(b) processor template
(a) LCM Blcok

Figure 9: A snapshot of matrix transposition for transposing the first wrapped block diagonals, when $P=4$, $Q=6$ and $M_{b}=N_{b}=12$. In this example, transposing of even numbered wrapped block diagonals can be overlapped with that of odd numbered.


Figure 10: Matrix transposition when $P=Q=$ $G C D=3$. Processors transpose $3(=G C D)$ diagonal blocks at one step, so that the transposition is done in one step.

```
PARDO \(K=1, G C D\)
    \(g=\operatorname{MOD}(q-p, G C D)\)
\(\tilde{p}=\operatorname{MOD}(p+g, P) ; \tilde{q}=\operatorname{MOD}(q-g, Q)\)
    DO \(J=0, L C M / P-1\)
    \(K=J\)
    WHILE (MOD \((K-g, P) \neq 0)\)
        DO \(K=\operatorname{MOD}(K+Q, L C M)\) END DO
    DO \(I=0, L C M / Q-1\)
        [ Copy every ( \(K: N_{b}: L C M\) )-th diagonal
        blocks in \(P(p, q\rangle\) to T1]
        [ Move T1 from \(P\langle p, q\rangle\) to
        \(P(\tilde{p}+I \times G C D, \tilde{q}-J \times G C D\rangle]\)
        [Copy the received T1 to C]
        \(K=\operatorname{MOD}(K+Q, L C M)\)
    END DO
END DO
END PARDO
```

Figure 11: A modified matrix transpose algorithm from matrix point-of-view. $G C D$ diagonal blocks are transposed simultaneously.
cessors which have the blocks to send out are shaded at the bottom. In the example, only $P \cdot Q / G C D$ processors are involved in block communication in each step. Processors are split into $G C D$ groups of processors, and a processor $P\langle p, q\rangle$ belongs to a group $g$ if it has the same value of $g=\operatorname{MOD}(q-p, G C D)$. Processors in a group $g$ send and receive their blocks to other processors in another group $g^{\prime}=\operatorname{MOD}(G C D-g, G C D)$. The operations of each group can be overlapped.

The problem is interpreted from the matrix point-of-view. In general, for transposing the $k$-th diagonal block of $\mathbf{A}(\mathbf{A}(i, j), \operatorname{MOD}(j-i, L C M)=k)$, a group of processors $g_{k}=\mathrm{MOD}(k, G C D)$ send the blocks to another group $g_{k}^{\prime}=\operatorname{MOD}\left(G C D-g_{k}, G C D\right)$. Since the operations are overlapped over different groups of processors, processors transpose $G C D$ diagonal blocks simultaneously. So, the matrix can be transposed with $L C M / G C D$ steps. For the extreme case of $P=Q=$ $G C D=3$ as shown in Figure 10, processors transpose $3(=G C D)$ diagonal blocks at one step. That is, the transposition is done in one step. A processor $P(p, q)$ exchanges data with processor $P\langle q, p\rangle$. The pseudocode of the algorithm from the matrix point-of-view is shown in Figure 11. The code includes the case of $G C D=1$.

| 96 processors |  | 64 processors |  | 48 processors |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $P \times Q$ | Time | $P \times Q$ | Time | $P \times Q$ | Time |
| $6 \times 16$ | 0.404 | $4 \times 16$ | 0.596 | $4 \times 12$ | 0.652 |
| $8 \times 12$ | 0.330 | $8 \times 8$ | 0.572 | $6 \times 8$ | 0.546 |
| $12 \times 8$ | 0.307 | $16 \times 4$ | 0.475 | $8 \times 6$ | 0.527 |
| $16 \times 6$ | 0.381 |  |  | $12 \times 4$ | 0.547 |

Table 1: Dependence of performance on template configuration for fixed number of processors ( $M=N=$ 2400, Unit:second).

## 4 Results

In this section we present performance results of the parallel matrix transpose algorithms on the Intel Touchstone Delta computer. The performance of the transpose algorithms cannot be represented in floating point operations per second (flops), since there is no multiplications or additions in the transpose algorithms. The algorithms are combined with a matrix multiplication routine in the PUMMA to compute $\mathbf{C}=\alpha \mathbf{A}^{T} \cdot \mathbf{B}^{T}+\beta \mathbf{C}$ in two steps $(\mathbf{T} \Leftarrow \alpha \mathbf{B} \cdot \mathbf{A}$; $\mathbf{C} \Leftarrow \mathbf{T}^{T}+\beta \mathbf{C}$ ). We assume that $\alpha=1$ and $\beta=0$ in our test. The performance of $\mathbf{A}^{T} \cdot \mathbf{B}^{T}$ is compared with that of $\mathbf{A} \cdot \mathbf{B}$.

Matrix elements are generated uniformly on the interval $[-1,1]$ in double precision. Conversions between measured runtimes and performance in gigafiops (Gflops) are made assuming an operation count of $2 M N L$ for the multiplication of a $M \times L$ by a $L \times N$ matrix. In our test examples, all processors have the same number of blocks so there is no load imbalance. The algorithms were implemented with force type communication [8].

First, we considered how, for a fixed number of processors $N_{p}=P \times Q$, performance depends on the configuration of the processor template. Some typical results are presented in Table 1 for a fixed number of processors. In the test, the block size is fixed at $5 \times 5$ elements. It may be seen that the template configuration does have some effect on performance. The performance difference is between 19 and $24 \%$. For rectangular templates with different aspect ratios, the algorithm prefers those with small $Q$ to those with small $P$. On the Delta, communication speed along vertical links seems faster than along horizontal links.

Figures 12 and 13 show the performance of the routines on $15 \times 16(G C D=1$, i.e., $P$ and $Q$ are relatively prime), and $16 \times 16(P=Q=G C D=16)$ templates, respectively. In all cases the block size is


Figure 12: Performance comparison of $\mathbf{A} \cdot \mathbf{B}$ and $\mathbf{A}^{T} \cdot \mathbf{B}^{T}$ on $15 \times 16$ template. $(P=15, Q=16$, and $G C D=1$ ). $\mathbf{C} \Leftarrow \mathbf{A}^{T} \cdot \mathbf{B}^{T}$ is implemented in two steps, $\mathbf{T} \Leftarrow \mathbf{B} \cdot \mathbf{A}$, and then $\mathbf{C} \Leftarrow \mathbf{T}^{T}$.


Figure 13: Performance comparison of $\mathbf{A} \cdot \mathbf{B}$ and $\mathbf{A}^{T}$. $\mathbf{B}^{T}$ on $16 \times 16$ template. $(P=Q=G C D=16)$.
fixed at $5 \times 5$ elements. The solid and the dashed lines show the performance of $\mathbf{A}^{T} \cdot \mathbf{B}^{T}$ and $\mathbf{A} \cdot \mathbf{B}$, respectively. The difference of the two lines shows the loss of performance due to matrix transposition.

The transposed multiplication routine shows good performance relative to matrix multiplication. The loss of performance due to the matrix transpose routine is about 2 or $3 \%$. The transpose routine has excellent performance if $P$ and $Q$ are relatively prime. In other cases ( $G C D \geq 2$ ), network congestion may degrade the performance of the routine.

Table 2 shows how the block size affects the performance of the algorithms. It includes three cases of the block size, two extreme cases - the smallest and largest

| $P \times Q$ | Matrix Size | Block Size | Time |
| :---: | :---: | :---: | :---: |
| $12 \times 16$ | $4800 \times 4800$ | $5 \times 1$ | 1.280 |
|  |  | $5 \times 5$ | 0.893 |
|  |  | $100 \times 100$ | 0.882 |
| $14 \times 16$ | $5600 \times 5600$ | $1 \times 1$ | 1.484 |
|  |  | $5 \times 5$ | 1.193 |
|  |  | $50 \times 50$ | 1.161 |
| $15 \times 16$ | $6000 \times 6000$ | $1 \times 1$ | 1.740 |
|  |  | $5 \times 5$ | 1.437 |
|  |  | $25 \times 25$ | 1.426 |
| $16 \times 16$ | $6400 \times 6400$ | $1 \times 1$ | 1.967 |
|  |  | $5 \times 5$ | 1.967 |
|  |  | $400 \times 400$ | 1.967 |

Table 2: Dependence of performance on block size.

| $P \times Q$ | Matrix Size | $\mathbf{A} \cdot \mathbf{B}(\%)$ | $\mathbf{A}^{T} \cdot \mathbf{B}^{T}(\%)$ |
| :---: | :---: | :---: | :---: |
| $1 \times 1$ | $500 \times 500$ | $36.70(100.0)$ | $35.04(100.0)$ |
| $12 \times 16$ | $6720 \times 6720$ | $32.09(87.4)$ | $31.64(90.3)$ |
| $14 \times 16$ | $6720 \times 6720$ | $32.52(88.6)$ | $32.11(91.6)$ |
| $15 \times 16$ | $7200 \times 7200$ | $32.78(89.3)$ | $32.43(92.6)$ |
| $16 \times 16$ | $8000 \times 8000$ | $31.22(85.1)$ | $30.38(86.7)$ |

Table 3: Performance per node in Mflops. Block size is fixed to $5 \times 5$ elements. $1 \times 1$ template gives performance of assembly-coded matrix multiplication. Numbers in parentheses represent efficiency compared with the performance on 1 processor.
possible block sizes - and $5 \times 5$ block of elements. If $P=Q$, processors directly copy all blocks at once, so block size does not affect the performance. For the case of the smallest block size ( $1 \times 1$ element) when $P \neq Q$, processors make a copy element by element, so it takes a little more time to make a copy. The routines with the smallest block sizes are slower than those with the largest possible block sizes by between $15 \%$ and $31 \%$. This difference is negligible, compared with the total elapsed time of the matrix multiplication.

Performance per node is shown in Table 3. The $1 \times 1$ template gives the performance of the assembly-coded level 3 BLAS matrix multiplication routine for the two cases $\mathbf{A} \cdot \mathbf{B}$ and $\mathbf{A}^{T} \cdot \mathbf{B}^{T}$. Processors have about $85 \%$ efficiency for $\mathbf{A} \cdot \mathbf{B}$, and $87 \%$ for $\mathbf{A}^{T} \cdot \mathbf{B}^{T}$ if $P=Q=16$. The routines perform better on templates for which $P \neq Q$. Processors achieve about $89 \%$, and $93 \%$ of efficiency for each case if $P$ and $Q$ are relatively prime.

## 5 Conclusions and Remarks

We have presented parallel matrix transpose algorithms based on the block scattered decomposition. The algorithms have good performance for arbitrary processor configurations on the Intel Delta computer.

If $P$ and $Q$ are relatively prime, the transpose routine involves complete exchange communication on a two-dimensional template. We have approached this complicated problem with a direct point-to-point communication scheme (see Section 2). When $P$ and $Q$ are not relatively prime $(G C D>1)$, the processors' operations are overlapped over different groups, so that only $L C M / G C D$ communications are required.

In our Fortran implementation, we assume that the first dimension of the matrix may be different from the number of rows of the matrix in a processor. Even when $P=Q$, the processor needs to copy blocks of $A$ to a communication buffer before sending, and copy the received buffer to blocks of $\mathbf{C}$ after receiving.

The parallel matrix transpose algorithms have been combined with matrix multiplication routines. The integrated routines comprise a general-purpose matrix multiplication package, called PUMMA [5], for MIMD message-passing computers. The package has good performance for a wide range of decomposition parameters, that is, its performance depends weakly on processor configuration and block size.

The PUMMA package is currently available only for double precision real data, but will be implemented in the near future for other data types, i.e., single precision real and complex, and double precision complex. To obtain a copy of the software and a description of how to use it, send the message "send pumma from misc" to netliboornl.gov.

## Acknowledgments

The authors would like to thank Eduardo D'Azevedo at ORNL for his helpful suggestions to improve the quality of the paper. This research was performed in part using the Intel Touchstone Delta System operated by the California Institute of Technology on behalf of the Concurrent Supercomputing Consortium. Access to this facility was provided through the Center for Research on Parallel Computing.

## References

[1] N. G. Azari, A. W. Bojanczyk, and S.-Y. Lee, Synchronous and Asynchronous Algorithms for Matrix Transposition on MCAP, SPIE Vol. 975, Advanced

Algorithms and Architecture for Signal Processing III, pp.277-288, 1988.
[2] S. H. Bokhari and H. Berryman, Complete Exchange on a Circuit Switched Mesh, Proceedings of the 1992 Scalable High Performance Computing Conference, IEEE Press, pp.300-306, 1992.
[3] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers, Proceedings of Fourth Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia), IEEE Computer Society Press, Los Alamitos, California, October 19-21, pp.120-127, 1992.
[4] J. Choi, J. J. Dongarra, and D. W. Walker, The Design of Scalable Sofiware Libraries for Distributed Memory Concurrent Computers, Proceedings of Environment and Tools for Parallel Scientific Computing Workshop, (Saint Hilaire du Touvet, France), Elsevier Science Publishers, September 7-8, pp.3-15, 1992.
[5] J. Choi, J. J. Dongarra, and D. W. Walker, PUMMA : Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers, Technical Report TM-12252, Oak Ridge National Laboratory, Mathematical Sciences Section, August, 1993.
[6] J. J. Dongarra, R. van de Geijn, and D. Walker, A look at Scalable Linear Algebra Libraries, Proceedings of the 1992 Scalable High Performance Computing Conference, IEEE Press, pp.372-379, 1992.
[7] J. O. Eklundh, A Fast Computer Method for Matrix Transposing, IEEE Transactions on Computers, Volume 21, pp.801-803, 1972.
[8] Intel Corporation, Touchstone Delta Fortran Calls Reference Manual, April, 1991.
[9] S. L. Johnsson and C.-T. Ho, Algorithms for Matrix Transposition on Boolean $N$-cube Configured Ensemble Architecture, SIAM J. Matrix Anal. Appl, Volume 9, No 3, pp.419-454, July, 1988.
[10] D. P. O'Leary, Systolic Arrays for Matrix Transpose and Other Reorderings, IEEE Transactions on Computers, Volume 36, pp.117-122, January, 1987.

## DISCLAIMER

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.


[^0]:    * This work was supported in part by DARPA and ARO under contract number DAAL03-91-C0047, and in part by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with the Martin Marietta Energy Systems, Inc.

