Optimization of Communications towards Scalable Algorithms on Post Petascale Supercomputers

Kengo Nakajima (The University of Tokyo, Japan)

Parallel multigrid method is expected to be a powerful tool for large-scale computations, but includes both of serial and parallel communication processes which are generally expensive. The serial communication is the data transfers through memory hierarchies of each processor, while the parallel one is by message passing between computing nodes using MPI. This presentation summarizes recent efforts of optimization of serial and parallel communications in parallel MGCG (conjugate gradient with multigrid preconditioning) solvers with geometric multigrid procedures using up to 4,096 nodes (65,536 cores) of Fujitsu PRIMEHPC FX10 [1]. Performance of both of flat MPI and HB MxN (M: number of threads on each MPI process, N: number of MPI processes on each node) has been evaluated. In the present work, new format for sparse matrix storage based on sliced ELL, which has been well-utilized for optimization of SpMV, is proposed for optimization of serial communication on memories, and hierarchical coarse grid aggregation (hCGA) is introduced for optimization of parallel communication by message passing.

The parallel MGCG solver using the sliced ELL format provided performance improvement in both weak scaling (25%-31%) and strong scaling (9%-22%) compared to the code using the original ELL format. Moreover, hCGA provided excellent performance improvement in both weak scaling (1.61 times) and strong scaling (6.27 times) for flat MPI parallel programming model. hCGA was also effective for improvement of parallel communications. Computational amount of coarse grid solver for each core of flat MPI is 256 (=16x16) times as large as that of HB 16x1. Therefore, hCGA is expected to be really effective for HB 16x1 with more than 2.50x10^5 nodes of Fujitsu FX10, where the peak performance is more than 60 PFLOPS. CGA and hCGA include a various types of parameters, and the optimum values of those were derived through empirical studies in the present work. Development of methods for automatic selection of these parameters is also an interesting technical issue for future work. Optimum parameters can be estimated based on calculation of computational amounts, performance models, parameters of hardware, and some measured performance of the system. But it is not so straightforward. Because some of these parameters also make effects on convergence, construction of such methods for automatic selection is really challenging.

[1] Nakajima, K., Optimization of Serial and Parallel Communications for Parallel Geometric Multigrid Method, Proceedings of the 20th IEEE International Conference for Parallel and Distributed Systems (ICPADS 2014) (2014) 25-32.