I finally finished the chunk experiment. To remind you,I completed a
physics/dynamics split based on ccm3.10.46, and looked at a very simple
chunking algorithm. For physpkg, each latitude line is divided into runs
of contiguous longitudes of length <= pcols <= plon. These chunks
are assigned to the same processes that own the corresponding latitudes
in dynpkg, so no interprocessor communication is required. The only costs
are the local copy costs. (Note that the data structures and logic also
support
chunks that span latitude lines, but I did not examine that option
in these experiments.)
Two advantages are possible from this chunking:
a) improved performance in the physics due to better serial performance for certain loop lengths and improved load balance in OMP parallel DO loops (from having more, smaller, tasks).
b) additional processors can be used for physpkg, allowing more than 64 processors to be used.
The point of the exercise is to
i) verify that chunking works as expected
ii) identify what performance improvements are realizable using only this approach, and to what extent load imbalance and 2D parallelization of the dynamics is required to effectively use more than 32 processors.
I present a lot of data in what follows. For easier navigation,here are the conclusions:
1) Chunking works, providing improved serial and load balancing performance
and improved scalability.
2) The I/O costs are approximately 10 secs/day, independent of the
number of processors. This is 25% of the total run time for 64 processors
for the minimum timing.
3) Beyond 64 processors (16 MPI processes in the optimal algorithm),
the cost of dynpkg is greater than that of physpkg, and is growing. A more
efficient (2D parallelization?) is required for further scaling.
4) Tphysbc and Tphysac scale extremely well. However, LSM does not,
primarily because of very poor load balance with the default latitude decomposition.
Conclusion: Chunking is a good first step, but load balancing and a
more efficient dynpkg are required to
achieve any significant increase in performance and scalability.
--------------- Details ------------------------------
I ran 10 day experiments on the Compaq AlphaServer SC
for a range of processor numbers and numbers of OMP threads.
The AlphaServer is a cluster of 4-processor SMP nodes,
so up to 4 OMP threads can be used.
Results are described for accumulated time spent in
1) full run: inital, inti, initext, intht, stepon
2) stepon
3) physpkg
4) dynpkg
5) stepon I/O and statistics calculations:
wshist, write_restart, wrapup,
engystats
The time spent in physpkg and dynpkg is sensitive to load imbalance.
While not the full story, I give both the min and max times
over all of the processes for these values as max:min. Note that
the sum of the physpkg and dynpkg timings for a given process is
invariably a constant over all processes. That is, a low physpkg
timing results in a high dynpkg timing, as the process waits
in dynpkg for the other processes to cath up.
--------------------------- Expt. 1 ------------------------------
Using the standard full latitude line chunks and a pure MPI
implementation gave the following results:
seconds per day
Nodes
4
8 16
32 64
(1) 105 80
65
(2) 103 78
64
(3) 58:39 35:22
21:13
(4) 34:53 34:47
33:41
(5) 9
9 9
16 nodes corresponds to 64 processors, so this is the maximum
number that can be used when using all processors in a node
and not using OpenMP.
--------------------------- Expt. 2 ------------------------------
Using the standard full latitude line chunks and 4 OMP threads
per node. Thus each node corresponds to a single MPI process.
OMP loops are over latitude in both dynpkg and physpkg.
seconds per day
Nodes
4
8 16
32 64
(1) 100 64
44 43
43
(2) 99 63
43 42
42
(3) 65:50 36:24
19:12 18:10
18:9
(4) 24:39 18:29
14:22 14:22
15:24
(5) 9
9 9
9 9
While we can use up to 64 nodes, 32 nodes only has 2 latitudes
to be parallelized over per node, and 64 nodes only has 1
latitude per parallel DO loop, so there is no advantage to
using more than 16 nodes.
--------------------------- Expt. 3 ------------------------------
The optimal results are presented when varying number of threads
per node (1,2, or 4) and the chunk size (1,2,4,8,16,32,64,128).
seconds per day
Nodes
4
8 16
32 64
(1) 90 58
40 36
37
(2) 89 57
39 35
36
(3) 55:45 30:21
16:11 11:7
10:5
(4) 24:34 18:26
14:19 14:18
17:21
(5) 9
9 9
9 9
Optimal algorithm:
threads 4 4
4 4
4
pcols 16 16
16 32
32
The optimization process improves the performance compared
to that when chunks are not used, but it does not improve the
scalability a great deal. There is a constant 1 sec/day
from the initialization and 9 sec/day for the
I/O that are relatively insensitive to the number of nodes used.
The physpkg cost does decrease with the number of nodes,
but dynpkg cost eventually starts increasing. In particular,
beyond 16 nodes, more time is spent in dynpkg than in physpkg.
Once sulfer cycle and other additional physics are
introduced into the model, this will not be as limiting,
but it is definitely a factor with this version of the code.
--------------------------- Expt. 4 ------------------------------
The optimal algorithms were rerun with some additional
instrumentation in physpkg:
(6) copy costs introduced to support physics/dynamics split
(much of which can be removed once the split is
done right).
(7) tphysbc
(8) lsm
(9) tphysac
These are also sensitive to load imbalance, and the values
corresponding to the min and max physpkg load imbalance are described.
Note that this is not the max/min imbalance for tphysbc, for
example, but rather the values for the processes demonstrating
the max/min physpkg imbalance.
seconds per day
Nodes
4
8 16
32 64
(1) 90 58
40 36
37
(2) 89 57
39 35
36
(3) 55:45 30:21
16:11 11:7
10:5
(4) 24:34 18:26
14:19 14:18
17:21
(5) 9
9 9
9 9
(6) 7.6 3.2
1.2 .6
.3
(7) 31:33 15:16
7:8 3:4
2:2
(8) 13:1 10:.3
7:.4 5:.3
5:.3
(9) 2:2 1:1
.5:.5 .3:.3
.1:.2
Thus, the copy costs, tphysbc, and tphysac scale very well.
The bottleneck is LSM. Note that I did not change MPT (the
determiner of the "chunk" parameter in LSM), so this might
be optimized to improve the performance. However, the load
imbalance is very bad with the typical latitude assignment.
As this is a static assignment issue, a static load balance
algorithm for LSM should improve the scalability of physpkg.
--------------- Other Comments -----------------------------
Experiments using CRM indicated that chunk size was
crucial for getting good performance on the IBM. I
have not had time to run these experiments there, and
do not intend to at this time. Instead, I will try to
get the modifications checked into the CCM 3.11 trunk
and examine performance in the context of the current
version of the code.