The forward and inverse Legendre transforms are
respectively. For the forward Legendre transform, each depends only on data associated with the same wavenumber m, and so depends only on data assigned to a single processor column. Each processor in that column can calculate independently its contribution to , using data associated with the latitudes assigned to that processor. To finish the calculation, these P contributions need to be summed, and the result needs to be rebroadcast to all P processors, since spectral coefficients are duplicated within the processor column. To minimize communication costs, local contributions to all spectral coefficients can be calculated first, leaving a P-way vector sum (made up of the local contributions to all of the spectral coefficients assigned to this processor column) and rebroadcast to be calculated. This motivates naming this approach the vector sum algorithm.
The column-wise vector sum is a separate step in the algorithm, and the communication is not overlapped with computation. But there are sophisticated techniques for calculating the vector sum that effectively minimize both the communication cost and the associated parallel computation cost. Currently we use a variant of the recursive halving algorithm .
For the inverse transform, calculation of requires only spectral coefficients associated with wavenumber m, all of which are local to every processor in the corresponding processor column. Thus, no interprocessor communication is required in the inverse transform.
In summary, using the vector sum algorithm to compute the Legendre transforms incurs no additional computational cost, is perfectly parallel with good load balance within a processor column, and requires interprocessor communication in only the forward transform. Moreover, this communication can be implemented very efficiently. Furthermore, few modifications to CCM2 were required to implement this algorithm in PCCM2.
The disadvantages of the vector sum algorithm are that all computations within the spectral domain must be calculated redundantly (in the processor column), the communication in the forward Legendre transform can not be overlapped with communication, and additional storage is required to hold the duplicated spectral coefficients. Since relatively little work is done in the spectral domain in CCM2, this redundant work has not proved to be an issue, and the vector sum has proved to be a viable parallel algorithm for PCCM2. For a more detailed discussion of these issues see [14,46,10,15,21].