|
|
|
|
The performance data presented here were collected by Patrick H. Worley on the AlphaServer SC and IBM SP systems at Oak Ridge National Laboratory during 9/2000, using version 940.2a of LS-DYNA from Livermore Software Technology Corporation and data sets provided by the Computational Materials Science Group at Oak Ridge National Laboratory. Two crash simulations were used.
- A car-to-barrier crash simulation employing 125K finite elements.
- A car-to-car crash simulation employing 250K finite elements.
Performance results are presented in terms of seconds of simulation per day, extrapolated from timing simulations of duration .0025, .005, and .01 model seconds. A few longer simulations were also timed, to validate the accuracy of the extrapolation, and the agreement was found to be very good. Even these "short" runs take 7-8 hours on one processor.
The performance results are graphed as functions of the number of 4-way SMP nodes. Typically, applications do not share nodes, so even if an application does not use all of the processors in a node, the entire node is "consumed". Thus node count is the relevant coordinate when comparing these IBM and Compaq systems. Note that the "0" node value corresponds to the serial performance, i.e., the performance when using one processor in one SMP node.
As these are fixed size problems, the exploitable parallelism is limited and scaling degrades as the parallel overheads grow relative to the parallel computation. The overhead costs are also affected by the number of processors used per SMP node, where multiple processes per node compete for node memory bandwidth and access to the interconnection network. In consequence, it is sometimes more efficient to use less than all of the processors in a node. Finally, the finite element mesh is relatively unstructured, and the load balance and associated communication costs vary with the number of processors employed. This results in somewhat erratic performance behavior as a function of the node count.
Note that all I/O was isolated to the beginning and end of the benchmark runs. While this does not reflect how the code is used in production, it allows us to eliminate differences in the performance due to differences in the I/O systems on the respective platforms. Moreover, runs with typical I/O requirements are not significantly slower than these "minimal I/O" benchmark runs.
![]()
![]()
The AlphaServer SC has a consistent advantage over the SP for small node counts for both simulations, running up to 40% faster in the most extreme case. However, beyond 10 nodes, the comparison becomes sensitive to the node count and the simulation, with each platform being better for specific instances. In this regime, the communication costs and load imbalances are constraining performance, and raw processor performance is not the primary determining factor.
The next four graphs examine the performance in more detail. For each experiment and each system, the performance data for a fixed number of processors per node is plotted as a function of the number of nodes.
![]()
![]()
![]()
![]()
From these data it is clear that the performance behavior is significantly different on the two systems. On the IBM, 4 processors per node is competitive for all node counts, with 3 processors per node being slightly preferred for the largest numbers of nodes. In contrast, on the Compaq 4 processors per node is the worst choice for any but a very small number of nodes. Three processors per node is the best choice up to 20 nodes, while 2 processors per node is optimal for higher node counts. Understanding why the Compaq is unable to use all 4 processors in a node successfully could lead to significant improvements in LSDYNA performance on this system.
For a different perspective, the next four graphs plot performance as a function of the number of processors (not nodes). The number of processors determines issues such as load imbalance and interprocessor communication. The only performance differences due to the number of processes per node should be the relative cost of internode and intranode communication (better for larger numbers of processes per node) and contention for access to the interconnect (better for smaller numbers of processes per node). Two different experiments were run on the AlphaServer SC when using 4 processors per node, one with the default environmental variables and one with
setenv LIBELAN_WAITTYPE 1500
As can be seen, this has a significant impact on performance for this case. However, it has little impact when using 1, 2, or 3 processes per node.
![]()
![]()
![]()
![]()
These results confirm the previous conclusions. On the IBM, performance is primarily a function of the number of processors, not the number of processors per node. In contrast, beyond 20 processors, there is a significant penalty for using all 4 processors in a node on the AlphaServer SC, especially when using the default communication library environment variables. Even 3 processors per node demonstrates measureable performance degradation. Given that the AlphaServer SC interconnect has better performance than the IBM, this behavior appears to indicate a performance bug of some sort.