Best Point-to-Point Intranode Message-Passing Performance
on the IBM p690 (32-way SMP "Turbo" node / 1.3 GHz POWER4 processor)
These results represent the highest measured bandwidths for each
of the experiments:
0 to 1
0 to 2
0 to 16
0 to 1, 2 to 3, ..., 30 to 31 simultaneously
0 to 16, 2 to 17, ..., 15 to 31 simultaneously
0 to 1 to 2 to ... to 31 to 0 simultaneously
0 to 1 in 8-processor LPAR
0 to 2 in 8-processor LPAR
0 to 1, 2 to 3, ..., 6 to 7 simultaneously in 8-processor LPAR
0 to 1 in four 8-processor LPARs (same p690) simultaneously
0 to 2 in four 8-processor LPARs (same p690) simultaneously
for the single iteration, nonoverlap communication protocols, both
with and without cache invalidation.
Results are presented for bidirectional and
unidirectional protocols separately, using MPI.
Note that the unidirectional bandwidth is also the "bidirectional"
bandwidth when using the optimal unidirectional protocol to complete
a swap. For some situations, this is a larger bandwidth (faster
protocol) than when using the optimal bidirectional protocol.
MPI on 32 processor "Turbo" node
Bandwidth for Bidirectional Protocols
Bandwidth for Unidirectional Protocols
MPI on 8 processor LPAR (within a 32 processor "Turbo" node)
Bandwidth for Bidirectional Protocols
Bandwidth for Unidirectional Protocols
MPI on four 8 processor LPARs (within a single 32 processor "Turbo" node)