|
|
Given this decomposition, the physics computations are independent between processors, and no interprocessor communication is required. However, much of the physics is related to solar radiation and there is a significant load imbalance between night and day grid points. To alleviate this, each processor swaps half of its grid points with the processor in the same row holding grid points that are 180 degrees away, swapping them back when the physics computations are complete.
The semi-Lagrangian algorithm also uses the physical grid. For each grid point, a trajectory is calculated back in time, to determine what grid cell to use to interpolate the current values. This calculation is independent between grid points, but the data needed to calculate the trajectories and to interpolate the fields may not be local to the processor holding the grid point. The current parallel algorithm fills halo regions of sufficient thickness around each patch that, once filled, all needed information is local to each processor. Typically, this only requires communication with nearest neighbors in the logical processor grid. However, near the poles the halo region for a patch must include the entire polar cap. This requires communication between all processors assigned patches near the pole, resulting in a load imbalance in the cost of filling the halo regions between the polar and equatorial processors.
Two different approaches are supported in CCM/MP-2D for computing the FFTs used in the spectral transform method: distributed and transpose. The distributed algorithm computes the FFT using the given domain decomposition, communicating between processors in the same row to share data and intermediate results. The transpose algorithm "rotates" the domain decomposition within a processor row, undecomposing the longitude coordinate, and decomposing over the vertical levels and the different fields. Using this scheme, each processor has a set of independent FFTs to calculate. When the transforms are complete, the rotation is reversed, undecomposing the vertical levels and the fields, and decomposing over the wavenumber coordinate.
The Legendre transform used in the spectral transform is approximated by Gauss quadrature for each spectral coefficient. Each processor computes its contributions to these integrals, and a collective summation of the contributions over each column of processors is used to complete the computation. The parallel summation algorithm used in the Legendre transform replicates the spectral coefficients assigned to a given column of processors over all processors in the column. This redundancy results in duplicate work in spectral space, but allows the inverse Legendre transform to be computed without further interprocessor communication. Given the relatively small amount of time spent in spectral space computation, this is often a cost-effective tradeoff.
For more details on the parallel algorithms used in CCM/MP-2D see
Patrick H. Worley / (
worleyph@ornl.gov)