home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

PSTSWM on the Cray X1


SSP and SMP Experiments

These experiments look at performance when using a single SSP instead of an MSP, and when running multiple instances of the job on the same SMP node. On the X1, this means solving 4 instances of the same problem on all 4 MSP processors or 16 instances on all 16 SSP processors in an SMP node. We also include data from running 4 instances of the same problem on all 4 SSP processors that make up a single MSP. The intent is to determine whether memory contention between the different processes can degrade performance. 16 MByte pages are used in all experiments described here.

The first set of graphs compare performance when using a single MSP processor, a single SSP processor, 4 MSP processors, 4 SSP processors, and 16 SSP processors when the number of vertical levels is specified at runtime.

From these data, there is no significant per processor performance difference between running with just one MSP and with four MSPs, indicating no performance degradation due to memory contention. In contrast, there is a small performance degradation from using four SSPs instead of one and an even greater difference between using sixteen SSPs instead of one. The performance degradation for the SSP experiments is likely due to contention for access to both Ecache and main memory.

The next graph is the per processor performance ratio between experiments using a single MSP and a single SSP, between a single MSP and 4 SSP processors, and between 4 MSP processors and 16 SSP processors for problems with 18 vertical levels. This is meant to be a visual aid in the following discussion comparing the performance of the MSP and SSP experiments. Similar results hold for the other vertical levels, and the earlier graphs can also be used to estimate the ratios.

The single MSP performance is not four times greater than the single SSP performance, typically ranging from 2-3 times greater, so there is some inefficiency in the compiler' attempt to exploit all of the MSP hardware. As the single SSP experiment does not have to share access to the Ecache or the main memory, it is perhaps more reasonable to compare data for the 4 SSP experiments with the 1 MSP experiments. In this comparison, the 1 MSP performance is typically 2-3.5 times better, so is still not as effective as using the SSPs directly in this throughput metric. In contrast, when comparing using 4 MSPs with 16 SSPs, the MSP performance can be as high as 4.7 times as great as the SSP performance, in which case the memory contention caused by assigning separate processes to SSPs overcomes the lack of efficiency in assigning a single process to an MSP. Note that none of these experiments are definitive. Using SSPs directly in a parallel code requires more explicit parallelism, which often comes with its own overhead. Additionally, parallelizing a fixed size problem across SSPs or MSPs will assign smaller subproblems per SSP than MSP. While this decreases memory contention for the SSP runs, it also decreases loop lengths, which likely offsets any gains in memory performance. Overall, our current feeling is that assigning a process to an MSP is likely to be the more efficient approach.

The second set of graphs compare performance when using a single MSP processor, a single SSP processor, 4 MSP processors, 4 SSP processors, and 16 SSP processors when the number of vertical levels is specified at compile time. The fourth graph, as above, is the per processor performance ratio between experiments using a single MSP and a single SSP, between a single MSP and 4 SSP processors, and between 4 MSP processors and 16 SSP processors for problems with 18 vertical levels.

From these data, the same general conclusions hold as when using runtime vertical levels. The only slight change is that specifying the number of vertical levels at compile time improves the per processor performance ratio of experiments using a single MSP as compared to experiments using four SSP processors. The performance of the MSP experiments is still not four times better than the SSP experiments, but it is much closer.


   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/PHOENIX/PSTSWM-ssp-smp.CRAYX1.html
Updated: Thursday, 19-Jun-2003 13:08:27 EDT

webmaster