Draft Draft Draft Draft Draft Draft Draft Draft Draft Draft Draft Draft
High Performance Atmospheric Model Software Design
February 12, 2001









Contents
1.0 Functionality Requirements
2.0 Software Engineering Design
Issues and Discussion
Appendix A:  Pseudo-Code Templates
Appendix B:  Class structure

This document describes the software requirements and design of the high performance atmospheric component (CCM4) for the Community Climate System Model (CCSM).

The major goal of the software design of the CCSM is to build a performance-portable model which can execute effectively on a range of platforms, in a variety of configurations (see Foster's DOE proposal).  The immediate target platforms for CCSM are the IBM SP, the SGI Origin, and networked clusters, but the design requirements must take into account the overall goal to maintain performance portability across scalable parallel supercomputers, commodity clusters, shared memory multiprocessors, and vector supercomputers.

To address the overall goals of the CCSM for the atmospheric component, the requirements have been separated into two broad categories: functionality requirements and software design.  Generally speaking, the  functionality describes what the atmospheric component must do; these are the design goals.  The scientific (and numerical) aspects of the atmospheric model will be described in the usual documentation of the CCM4.     The software design addresses how the design goals are to be met and the functionality requirements are to be provided.  The conventions on coding style and the specification of interfaces, user defined data types, constructs for performance on vector, shared memory and destributed memory computers are described.
The testing procedures required for maintence of code correctness in the repository, along with a description of the build scripts used for testing and running the code on the target platforms, are also described.

1  Functionality Requirements


1.1 Scalable Parallel Design and Performance Portability
A scalable parallel design using a mix of distributed- and shared-memory parallelism is important to maximize throughput and scientific productivity using the target platforms.  The design must allow for adequate granularity in the parallel decomposition to effectively utilize the scalable, high performance computers.  Of course, the design must pay special attention to effectiveness on machines emphasizing low-resolution ensemble runs for DOE applications.  Decomposing the spatial domains in two or more directions can yield the required granularity.  The current assumption of this programming model is that the "inner" directions will be handled in shared memory using the standard OpenMP multitasking directives and the "outer" directions of the decomposition will be implemented for distributed-memory message passing.  Thus a two-dimensional decomposition for the atmospheric code could be achieved by decomposing the latitudes in distributed memory and within each vertical level using shared memory.  Other decompositions may yield better performance, especially where load balancing is of concern.  The parallel decomposition can be different for different dynamical cores. A different decomposition for the calculation of the physical parameterizations (column physics) may also be advantageous.  Certainly, to achieve performance portability across the target architectures, a degree of flexibility is a necessary design goal.
     Requirement: The code will  be configurable for high throughput on a variety of high performance parallel platforms.
      Specific:

1.2 Swapable Dynamics
The need for "swappable dynamics" has been recognized and recommended by a grass roots movement to promote a national modeling infrastructure.  The design should share common goals with the Common Modeling Infrastructure Working Group (http://nsipp.gsfc.nasa.gov/infra), but will contribute as an example of the kind of design that can be done without sacrificing quality of simulation or performance.
     Requirement:  The software will implement a dynamical core interface that allows alternative dynamics packages to be incorporated.
     Specific: 1.3 Dynamical Cores
Three dycores will be implemented in CCM-4, the Eulerian-Spectral, the Semi-Lagrangian-Spectral, and the Lin-Rood Lagrangian-Finite Volume dynamical core.  The Lin-Rood method has been under development at NASA and will be extended for high performance parallel architectures.

1.3.1  Eulerian-Spectral Dynamical Core
The Eulerian-spectral dynamical core will be restructured to eliminate the strict dependence on a latitudinal parallel decomposition.  The parallel algorithm for the spectral transformations will exploit both a distributed-memory (MPI) parallelism as well as a shared-memory (OpenMP) parallelism.  First, the data transposition software will be customized and optimized to map distributed dynamics data to distributed physics data structures (see below for segmented physics data layout).  The transpose software will be adapted from the methods used in the 2D parallel decomposition of the PCCM-3 developed by Drake et. al. (1995), to a generalized decomposition.  These data transpositions will take account of the physical space data as well as Fourier and spectral coefficient data. The dynamics data decomposition will focus on a segmented implementation (same as physics) and a level-fields decomposition that maintains contiguous memory for the transforms. The segmented implementation will transpose and decompose data to calculate the Fourier and Legendre transforms.  That way the FFTs can effectively use library software on a computational node and take advantage of the peak shared-memory performance of the platform.  The segment size will be adjustable to hit maximum cache performance as well as to expose parallelism on a reduced grid for multitasking and message passing.  The standard 1D parallel decomposition of CCM-3 will thus be a special case and vector lengths that support good vectorization will be a runtime option. The minimum design point for the CCM-4 is 26 levels and several more fields.  Shared-memory nodes with 4-16 processors are the order of the day, so we expect a 2-D lat-lon decomposition to be efficient for the parallel spectral transform. Both the physics (segmented) and the dynamics data decompositions work well with a "reduced grid" for the spectral dynamics.  In fact, the generalization from a simple 1-D or 2D parallel decomposition is necessary for proper load balancing of the computation on a reduced grid.  In both decompositions, the number of gridpoints calculated by a compute node can be evenly distributed.  The segmented decomposition can use segment size to accommodate fewer grid points per latitude as the poles are approached.  With a segmented decomposition the number of points can be approximately balanced. After the transpose and new parallel spectral algorithms are in place in the dynamical core, the semi-Lagrangian transport of the tracer fields will be addressed. This involves the same considerations and strategies as for the semi-Lagrangian-Spectral dynamical core.  Exploiting this overlap, we hope to implement two dynamical cores for the effort of something less than two.
     Requirement:  An Eulerian-Spectral dynamics package will implement a scalable parallel algorithm in accordance with the dynamical core interface rules.
     Specific:

1.3.2 Semi-Lagrangian-Spectral Dynamical Core
The semi-Lagrangian parallel algorithms are shared between all three dynamical cores.  We will coordinate the development and implementation of supporting routines and libraries.  For horizontally decomposed data, like the segmented physics structure, a set of halo update routines will be adapted from the PCCM-3 and the Lin-Rood developments.  The halo region contains data from "neighboring"nodes that is required to complete the particle-tracking step of the semi-Lagrangian algorithm and for interpolation of advected fields.  On a reduced grid, the halo region update requires a level of indirect addressing since the neighboring points are no longer identified simply by their indices. But the assumption of tensor product grids and interpolations based on these grids makes this tractable.  The halo region must be updated every timestep, but by dynamically monitoring flow directions and magnitudes, the amount of communication required can be minimized. The decomposition by lat and lon described in the Eulerian-Spectral dynamical core, will also work well for the semi-Lagrangian aspects of the dynamics.  Halo regions will  be required for the horizontal directions and communication costs of vertical advection are minimized .  The polar region likewise requires special treatment, and shared-memory parallelism can be easily implemented for the calculation within a layer or (local)collection of layers.
   Requirement:  A semi-Lagrangian-Spectral dynamics package will implement a scalable parallel algorithm in accordance with the dynamical core interface rules.
   Specific:


1.3.3  Lin-Rood Dynamical Core
The CCM-4 model now includes the recently developed Lin-Rood dynamical core (Lin and Rood, 1996; Lin and Rood, 1997; Lin, 1997; Lin and Rood, 1998; and Lin and Rood, 1999). The basic algorithms are derived and evolved from the modern high-resolution finite volume algorithms pioneered by van Leer (1977) and Colella and Woodward (1984), which are one-dimensional algorithms designed primarily for astrophysical and aerospace engineering applications requiring the resolution of sharp gradients (e.g., shocks). Details of the Lin-Rood dynamical core can be found in a recently produced algorithm theoretical basis document (DAO 2000).  In summary, its unique attributes are the following:

The semi-Lagrangian aspects are especially important in the east-west direction by enabling accurate simulations near the polar areas without the need for large reductions in the time step. The development of the Lin-Rood dynamical core will provide a scalable parallelization based on a 2D (latitude-vertical) domain decomposition of the physical space.  Given the flux form semi-Lagrangian formulation in the east-west direction and the floating Lagrangian vertical coordinate, blocks of latitude and altitude regions will be distributed to computational nodes.  The entire length of cells on a longitude will be contained in the shared memory of a computational node. Shared-memory (OpenMP) parallelism will be exploited along this longitudinal direction. Transposing the data for this decomposition has been shown to be efficient and software to achieve these tasks can be jointly developed with the other two dynamical cores.  The software for updating of the north-south direction halo regions can also be developed jointly with the other dynamical cores.
   Requirement:  A Lin-Rood dynamics package will implement a scalable parallel algorithm in accordance with the dynamical core interface rules.
   Specific:


1.4  Parallel Physics: Data Decomposition and Code Rework
The parallel decomposition of the parameterization package, the column physics, is a key to high performance of the atmospheric model.  Since each vertical column of the atmosphere can be computed independently, there is abundant parallelism for a scalable algorithm.  As chemistry and more comprehensive physical parameterizations are included in the model, this parallelism only becomes easier to exploit. Two considerations are very important, however, and will be addressed as part of this proposal.  First is the issue of node performance on the physics computation.  At an appropriate time in the development of CCM-4, we will rework the parameterization package to support a general segmented data decomposition.  The required rewrite will change the parameterization package so that it works on a collection of columns, with inner loop indexing running from an ibegin to iend.  Currently, the inner loops assume that an entire latitude slice is being processed with the ending index variable to accommodate a reduced grid. We will add the beginning index construction. The current choice is excellent for node performance on vector machines but requires an extremely large cache for the types of computer architectures prevalent within the DOE community. The segmented data decomposition is thus a generalization of the 1D and 2D parallel decompositions, containing both as a special case.  In addition, it is a simple decomposition in that all the details of the column location can be hidden.  It will thus simplify the coding style and readability of the code.  Cache utilization and local shared-memory multitasking will be supported with this decomposition of the physical space so that the scientists developing parameterizations will not need to multitask their parts of the code.  These basic code modifications will be done as soon as reasonably possible given the evolving parameterization code.  Since this portion of the code consumes a large fraction of the total compute resources it will be best to address it early in the planning and development process. The second consideration is load balancing.  Especially for reduced grids a general data decomposition is required.  Since the local node will call the parameterization package with an arbitrary collection of columns, the column assignment can be made to effectively balance the load among the computational nodes.  The short wave radiation calculation is an example of the type of load imbalance that occurs.  By assigning an equal number of day columns to the nodes, the computational load can be equalized.  This assignment will be accomplished in the context of the data transpose steps.  The inclusion of load-balancing transpose routines will be accomplished in the third step of our phased development. The design of the column physics could require that location (lon,lat) is passed to the physics rather than assumed based on index values. This may be necessary to support the general load balancing and segmentations we are proposing.
     Requirement:  The physical paramerization packages will conform with column physics interface rules and will effectively utilize target machine memory heirarchies.  Load balanced decompositions will be supported.
     Specific:


1.5  Atmosphere - Land Surface Interface
The proposal for the redesign of the flux coupler will have an impact on the design of the atmospheric model.  In particular, the option for including the land surface model with the atmospheric model, using the same grid (as in PCM), will have implications for the structure of the atmospheric model. Implementing efficient parallel algorithms for the land surface model is tied strongly to the parallel data decomposition of the atmospheric model.  The code structure is problematic: it allows a separate executable for the LSM or an interface as part of the CCM-4 executable; it is not clear how to construct this interface except with a large block of ifdefs.  The design will be an early priority.
     Requirement:  The land surface package will execute in parallel with the atmospheric model or with the CSM coupler.

1.6 Validation and Testing
Finally, the design must include model validation and code testing procedures.  By model validation, we mean validation between different platforms where it is expected not to produce exactly the same numerical results.  One example of such a validation procedure compares the rate of separation between results started from the same initial conditions (see Rosinski and Williamson 1997).  Code testing procedures must include checks to see that the code is working as expected.  These may include performance expectations, numerical results, reproducibility and reconfigurability.  The inclusion of such procedures in a design document is important so that everyone understands what the rules are and how much leeway programmers have in making code modifications.
     Requirement:  Code entered in the CCM repository will be tested and validated according to the configuration management procedures.
     Specific tests:   All platforms on the target platform list must be checked with tests appropriate for the modification made.  The documentation for the change and the test results will be entered in a repository.

     Verification that code conforms to minimal coding standards will be checked using emacs, flint and by monitoring performance and memory size during validation tests.  The following checks will be performed associated with code configuration and repository managment. A more extensive procedure will be developed for code validation, recognizing that algorithmic and model changes require careful peer review.  To facilitate this review a variety of tests and error growth plots will be automatically generated with model build.  Any changes to the parameterization will produce slow error growth, or, less satisfactorily,  an option to bypass parts of code which produce fast growth.
 
 

1.7 Design, Evaluation and Review
The design document for the atmospheric component of CCSM should be thought of as a dynamic work in progress.   A process will be set in place, in conjunction with the CCSM Software Engineering Working Group and NCAR management, that will allow for design reviews both internally and externally.   An evaluation of how well the software meets the stated design goals and requirements will be conducted. With DOE, NSF and NASA participation in this project, an "open" design process will result with the design document as the public deliverable.  As the design evolves, we expect to identify further encapsulation in the software design, which will require interfaces to be documented and implemented.
     Requirement:  A CCM design document will be kept up to date and made available to collaborators for comment and revision in accordance with CCSM policy.
 

2. Software Engineering Design


2.1 Language and Coding Conventions
The approach taken will be oriented toward a high performance implemention but with a software organization that is object oriented.  The object orientated design drives many of the decisions on data structures, control structures and procedure encapsulation.  The Fortran90 standard will be followed subject to style and conventions guidelines described in http://www.cgd.ucar.edu/cms/ccm4/codingstandard.shtml.  Use of other languages (eg. C and C++) is limited to the utility and library layers and must be Fortran90 callable.  Note in particular, the rules for the development of stand-alone packages included in the CCM.  Packages should be documented and tested in accordance with configuration and repository management procedures.

To facilitate portability, the machine dependencies will be minimized and model configurations will be specified with runtime parameters.  Special installation features and non-uniformities will be accomodated in the build mechanism using "cpp" with "ifdefs" and "defines" in the source code.   The definitions will be integrated into the build mechanism.

The following conventions for "cpp" defines have been adopted.

PVP - parallel vector constructs
BAC - large cache constructs SPMD  - denotes message passing, distributed memory parallelism constructs
COUP_CSM - configure for coupled execution mode with the CSM
PLON - horizontal grid resolution, number of longitudes (see notes on runtime parameters)
PLAT - horizontal grid resolution, number of latitudes
PLEV - vertical grid resolution, number of vertical levels
PCOLS - maximum number of columns in physics "chunked" data structure SHELL_MSS - mass store shell command FORTRANUNDERSCORE - C-Fortran interface convention.  If anyone writes C or C++ code to interface with the model they will likely need to check this kind of variable.  The names should be defined consistently.

2.2 Packages and  Interfaces
The definition of a package is that it can be built and tested independently. Interfaces will be testable for performance and correctness using "unit testers". The packages and the interface routines constitute the model layer of a three layered software design. The model layer calls routines from a utility layer which may call routines from a library layer.  The design isolates the interface to installation specific math libraries or other optimized routines in the utility layer.  The distinction between the utility layer and the library layer will be flexible as we expect to define this interface to fit the typical installation of the code and to clarify the interface with special libraries.  The utility layer will be distributed with the code, while it is expected that the library layer will only contain default or filler routines.

The atmospheric model package will consist of five sub-packages


In keeping with the object orientation for the specification of interfaces, packages will communicate with each other through state objects, consisting of state data and methods for moving the data between package data structures.  The required methods may include changing the data decomposition, regridding, calculation of derived variables and conversion between states, tendencies or fluxes.  Interface routines will use inquiry functions to obtain data from the state structure and will pass arrays to the working code.

  2.2.1 Atmospheric Driver
The atmospheric driver will implement the control function for the dynamics and physical parameterization packages.
The "dynamical core" actually performs 2 nearly separate functions

Although these functions are closely related, they are performed fairly separately, in most "dynamical cores". A chemical transport model needs to replace the time stepping of the dynamics with reading of datasets, time interpolation, and a variety
of consistency requirements. The advection of tracers is unchanged. The interfaces inside the "core" are specified
so that only the dynamics could be replaced, or the "dynamical core" could be an offline advection module.

The atmospheric driver is described in psuedo-code in Appendix A.  The basic flow control is shown here.

   BASIC ATM DRIVER FLOW  CONTROL

   BEGIN TIME LOOP


  2.2.2 Dynamical Core Interface

Dynamical cores will provide a public type called state which holds (or returns upon query) the current values of the prognostic and diagnositc fields.  An interface routine will convert between the dynamical state and the physical parameterization state.  The state will hold information on distributed data decomposition which can be used for transpose and redistribution functions.

The dynamic variables needed for the physical parameterizations are  u, v, T, ps, phi, omega, q.  These constitute the phys%state.  The dyn%state consists of the following variables for each of the dynamical cores.
 

Variables required Eulerian spectral Semi-Lagrangian spectral Lin-Rood
state variables zeta, div (u,v,), T, lnps,
q, (phi_s)
u, v, (div), T (delT), lnps, q,
(phi_s, del phi_s)
u, v, theta_v, delta p, q, (phi_s)
interface transformations ps, phi, omega ps, phi, omega ps, phi, omega, T

(notation: del = horizontal gradient, delta = vertical difference)

The dynamical core takes as input the current state (which must be consistent with the state history of the dynamics or be an initial or restart condition).  On output, the dynamical core returns the updated state (and has updated the internal state history).  Thus the time stepping method, leap frog explicit, semi-implicit, etc, are considered part of the dynamical core.

In addition, to initialization routines, the dynamical core must also interface with restart logic.  Depending on the time stepping method, this may require saving state history and time integration information to the restart file.

  2.2.3 Physical Parameterization Interface

The physical parameterization package takes as input, the physics state defined on the physics grid.  The package produces the tendencies for u, v, T, ps, phi, omega, q.  The physics driver and the individual parameterization packages are given in pseudo-code in Appendix A and outlined in the following.

    PHYSICS DRIVER

    Input    T,u,v,ps,q+tracers on parameterization grid/distribution, location (lat,lon), basic time information, flag if restart data set should be produced this time step.

 (NOTE: dynamics package may provide the specification of the physics grid.  There is no compeling need for different dynamics and physics grids and efficiency and stability may be significantly enhanced if they are identical.)

   Loops over parameterization sub-steps (if desired)

   Calculates (via call to library routine) required time. For example, if parameterization time step equals overall time step, time n for 3-time-level scheme, n+1/2 for 2-time-level scheme if parameterization is sub-stepped,  time appropriate for that  sub-step.

   Calculates and passes tendencies of individual parameterizations to history buffer.

   Physics conservation initialization (column-wise since parameterizations are currently all column only) - column-wise provides more information than global, but requires more space. If initial values are saved for total tendency calculation they can be used for energy conservation calculation.

   Physics conservation check, fixers applied as required.

   Output net T,u,v,ps,q+tracers tendencies for total time step from all parameterizations on parameterization grid/distribution

Each parameterization is described by the following.

  INDIVIDUAL PARAMETERIZATIONS

  input T,u,v,ps,q. If parameterization works on some other variable like theta, parameterization does conversion. This could be changed to driver does conversion before and after. Which makes more sense depends on whether tendencies or updated variables are returned.

  output(to physics driver) = updated values (this could be changed to tendencies since someone must compute them for the history buffer, but which ever is chosen, all parameterizations must do the same.

 output(to restart buffer on command) time evolving data needed for restart
      NOTE: parameterization initialization component will need a restart component. But then if it needs restart data, it will need to initialize it somehow. So initialization routine will need a restart flag.

  output directly to history buffer
      The parameterization component of history buffer is distributed as parameterizations, makes dynamical load balancing more cumbersome since either buffer would need to be redistributed or data to buffer transposed.

  initialization from individual file, should not require specific order with respect to other parameterization. This makes NAMELIST a problem or requires a "rewind" of the input file.

The physics parameterizations work on an arbitrary collection of atmospheric columns indexed from 1 to ncols.  It is not assumed that these columns constitute a continguous set of longitutes or that they are all on the same latitude.  However,
the latitude and longitude coordinates of any given set of columns are available to the parameterization.

  2.2.4 Coupler and Surface Model Interface (see CCSM Coupler Design)
The CCSM coupler and surface model interface will also use interface routines to handle conversions between dynamical states and physical paramterization states as well as the requisite data transpositions and redistributions for distributed memory data structures.

  2.2.5 History File Interface (see CCSM I/O Design)
The physics and dynamics packages will interface directly with the I/O package for history output. The code will retain current approach (OUTFLD) until new version is completed.  The distributed data structures for saving history output coincide with the decomposition of the calling routine.  ThusOUTFLD must distinguish sizes and lable values correctly.  For example, the
dynamics can call OUTFLD with a latitude slice and lat index.  When a physical parameterization routine calls OUTFLD the chunk index is supplied.  Buffers indexed by lat and chunk are kept in place in their appropriate distributed location until a gather is peformed from the call to WSHIST.

History accumulation buffers distributed as in physics and as in dynamics use a time variable for labels, etc which is  independent of time variable in parameterizions.
 

  2.2.6 Restart/Regeneration Interface
The driver, dynamics and physics packages will interface with the restart/regeneration I/O package.  In particular,
the dynamics will provide a structure that contains everything needed to initialize and to read/write to a restart file.  It would only be used by the package that defines it and would contain all time history information necessary to initialize/restart the package.  Since this structure is separate from the other model interface structures it may be possible, with little extra effort, to run multiple instances of the dynamics package (or physics) simultaneously.

The physics restart structure would include absorptivities and emissivities as well as radiative heating rates.
 

 2.3 Utility Layer
  Several modules and data types are common to multiple packages and functions are used by multiple packages.  These shared data types are particularly important for the interface routines linking the packages and constitute the utility layer.
Currently, the atmospheric model makes no attempt to use the utilities available from the Model Coupling Toolkit or other intercomponent utility routines.  The use of these routines for intracomponent interfaces, though desirable in promoting a unified CCSM design, is subject to performance consideations.  Similarly, the employment of CCSM wide user defined types is desirable in the long run, but not implemented in this software design.  The user defined types described here are only applied to the atmospheric models.
 

  2.3.1 User Defined Types

The grid data types are fundamental to the numerics of the dynamical cores as well as to the data structures employed for the calculation of physical parameterizations.  The DYN_GRID and PHYS_GRID modules also hold the information on the parallel decompositions used for physics and dynamics.  These two modules provide interface information to DP_COUPLING module to implement the data transformations and transpositions.

The Lin-Rood dynamical core uses an equally spaced  Latitude-Longitude grid with staggering of grid variables to support the finite volume discretization.  The vertical grid is defined by the "floating" Lagrangian control volumes. The parallel decompositions supported for Lin-Rood are a 1-D latitude decomposition and a 2-D latitude-level decomposition. Transposition between these decompositions occurs internally to facilitate vertical coupling when levels are decomposed in the 2-D.

The Eulerian Spectral and Semi-Lagrangian Spectral dynamical cores use a Gauss grid and a reduced Gauss grid in the horizontal direction.  A hybrid sigma- pressure coordinate is used in the vertical direction.  All grids can be defined in a tensor product formulation which supports efficient interpolation schemes. The parallel grid point decompositions supported are a 1-D latitude decomposition and a 2-D lat-lon decompostion.

The grid structure used in the physical parameterization package is the chunked data structure.  A grid chunk consists of an arbitrary collection of atmospheric columns. The column index is first, level index is second, in each chunk and the number of columns may vary between chunks.  Chunk dimensioning is as(pcols,plev) and indexing from 1,...,ncol, 1,...,nlev.  The number of chunks of the decomposition is determined by performance considerations.  Both parallel granularity and performance of the MPI decomposition are supported by the chunking as well as local memory/cache utilization in the serial and OpenMP decomposition.

It is assumed that fields for the dynamical cores are defined over decomposed grids.  Thus the parallel grid decomposition is the same for the field decomposition.  That is, a field is defined on a grid.  The user defined field types in the dynamical core relate to the prognostic and diagnostic fields as well as restart and history I/O functions.

Dynamics state - contains the prognostic variables and is indexed by block as (i,k,j,lblk).  The lat index and block index are bounded using beglat:endlat and begblk:endblk.  ASo the block index is almost analogous with the chunk index.  A similar grid structure is implied.

Parameter values for blocks are set in PMGRID along with the other indexing information.  The PROGNOSTICS module has the dynamics state variable allocations and declarations.  DYN_GRID contains the dynamics grid information.

Some, though not all, fields are subject to the spectral transform methods.  Associated with these fields is a set of spectral coefficients, referred to as the associated spectral space.  Spectral space is decomposed for the calculation of the semi-implicit (Helmholtz) solve.  Vertical levels are not decomposed for this part of the calculation.

The regridding of fields to couple with the physical parameterization package is accomplished through the DP_COUPLING module.

The physical parameterizations packages represent fields with a particular index ordering, (i,k,lchnk).  The collection of these columns defines the physics field type.
 Physics state  - the fields input to the physical parameterization package are distributed across the processors so as to load balance the physics calculation.  The indexing is f(pcols,plev,*,pchnk) where the distribution is on only
the last index matching the decomposed grid structure.  This is a generalized 2-D decompostion since pchnk may equal plon x plat.  For a single processor vector machine pchnk might more effieciently be 1 and pcols = plon x plat.
 Physics tendency - fields provided back to the dynamics as tendency information.

The physics package "owns" the data type (defined in PHYS_TYPES and the PHYS_GRID module) in the sense of controlling allocation and managing pointers. Similarly, the dynamics packages own the dynamics grids and fields (as defined in PROGNOSTICS and the DYN_GRID module).

  2.3.2 Distributed Data Structures
The user defined field types described above are distributed data structures. The processor maps and accounting for message passing operations using these fields are contained in decomposition modules specific to each dynamical core and to
the physics.  The transformation and data transposition required to couple the dynamics package with the physics package are accomplished in the DP_COUPLING module.  This module interfaces at the package level, providing state and tendency information --a bundled set of fields.

  2.3.3 Methods and Procedures
Interpolation for the semi-Lagrangian Dynamical cores requires halo update operators for the distributed fields.
Interpolation operators are provided by the dynamics packages using the grids and fields.  These are necessary for the semi-Lagrangian interpolation and (possibly different operators) for the interpolation to fields defined on the observational grid.
Interpolation is also required to move between non-uniform and uniform or centered and staggered grids.

 2.3.4 Use of CCSM Utilities
The utility layer defined in the atmospheric model is currently seperate from the intercomponent utility layer of the CCSM.  In particular, the model coupling toolkit has not been utilized or extended to support the three dimensional field transformations and regriddings required by the atmospheric models utility layer.  As performance may suffer from the unification of the utility layers, this step will be approached cautiously in the future.

2.4 Parallelism
To exploit clusters of shared memory multiprocessors a hybrid distributed/shared memory programming paradigm will be adopted.  In general, the distributed data decompositions will be among the nodes ofthe cluster and shared memory will be expoited within a multiprocessor node.  This implies that the distributed memory message passing constructs occur at a higher level in the call tree than the shared memory constructs.  For example, the multiple Fourier transforms in the spectral method are decomposed by latitude and level via a distributed data transpose.  A shared memory optimized FFT library performs the multiple (local) Fourier transforms on a node.
  2.4.1 Shared Memory Constructs
OpenMP directives will be used to specify shared memory parallel tasks.  Shared memory parallelism on loops should be specified at a high level in the call tree and isolated in driver, utility and library routines.
  2.4.2 Message Passing Constructs
MPI standard message passing calls will be used to implement distributed memory parallelism. The parallel data decomposition and associated processor mappings are associated with the data objects.  Message passing constructs should be used only in driver, utility or library routines with the data decompostion and distributed data transposition occuring at a high level in the call tree.
2.5 Library Layer
  2.5.1 Math Libraries
Library routines for single processor (or multitasked) spherical harmonic transforms, ordered, real Fast Fourier transforms, random number generation, solution of a general dense linear system with condition number estimation, scalar (dot) product of vectors, vector exponential, vector logrithm, vector trig functions, vector square root.
  2.5.2 I/O Libraries
The NetCDF library is required for reading initial conditions and for output of histories.
  2.5.3  DataTranspose and Transformation Libraries
The data transpose routines are used by the driver and interface routines by calls to shared module procedures.  Thus the driver or interface routines may require a data transpose of a dynamics field to a physics field.  The shared module procedures are available to effect the transposition calling these libraries.
 2.5.4 Simulation Calendar Routines
The PHYSICS DRIVER receives basic calendar and simulation time information (not sure how to do this if parameterizations sub-step).  History tape requires time and date information for header variables. This should be independent of time information for parameterization.

2.6 Utilities
  2.6.1 Vector routines
Vector masks for conditionals and index manipulations will include intmax, ismax, ismin, isrchfgt, isrchfle, wheneq, whenfgt, whenfle, whenflt, whenne.
  2.6.2 Timing and performance monitoring
Time of day stamps, a wall clock time, second timer will be included.   For performance monitoring in a MPI/OpenMP (hybrid) mode, these timers must be thread safe and provide some syncronization method.  More specific hardware performance monitoring routines and routines that diagnose parallel performance will be included in the utility layer so that instrumentation is properly encapsulated.

  2.6.3  Production managment
Routines that return the time remaining in a run and other allocation/queue parameters.

2.7 Build Mechanism
A portable build mechanism is used to configure, compile and test the code.  The runtime configurable parameters include the number of processors and parallel decomposition.  Compile time configurable parameters include the model resolution, the segment size, the floating point precision and processor characteristics.

2.8 Tests Before Code Checkin

Standard NAMELIST and *.h files and datasets will be created for these tests
 

DEFINITIONS
NCAR production machines = SGI, IBM, Compaq, or local versions
NCAR development machines = Sun, Linux PC
default model = 30-level T42 full physics (+NDENS = 1)
bit-for-bit = two day run with multiple advected and non-advected
              constituents

              Entire standard daily averaged history tape checked.
                 e.g. UNIX diff on the output from 2 ncdump commands,
                 or Rosinski's cprnc program on history tapes.

Basic difference growth test
         perform two runs plotting the point by point, mass weighted, rms difference of temperature as function of time
             1) one two day run writing out T,PS every timestep (nstep = 144)
                a) set PERGRO namelist flag to true
                b) #define PERGRO in misc.h
             2) second two day run writing out T,PS every timestep (nstep = 144)
                a) set PERGRO namelist flag to true
                b) #define PERGRO in misc.h
             3) run Rosinski's cprnc program on output history tapes.
             4) using output from cprnc plot the difference of the mass weighted rms value of temperature.
 

Perturbation growth on single machine:
         One run 'as is'
         Second run with
           set PERTLIM namelist variable to bit level perturbation
                   (1.e-14 on ieee machines)

Difference growth test on different machines:
         Two runs, each on a different machine.  For this test the PERTLIM namelist flag should not be set.  The perturbation is generated from running on different platforms using different libraries and compilers.
 

Difference growth test of changed code
         One run with previous TAG code
         Second run with new code
         For this test the PERTLIM namelist flag should not be set.  The perturbation arises from code differences.
 

Above two tests are successful if
         the difference grows no faster than a minimal perturbation on the single machine with smallest mantissa.  Curves should look very similar to plot produced by difference error growth test on  same machine.

Always enable the OpenMP directives when testing

Following tests are for 'code' changes, as opposed to 'algorithm'changes in the physics or dynamics. The correctness of the latter must be verified by the scientists somehow. There is no simple, safe strategy to test 'algorithm' changes. 'Code' changes and 'algorithm' changes should be isolated.

BASIC TESTS (default model)

o Compile default model on all NCAR production and development machines.
  (Compilers are still flaky enough that bugs are shaken out this way)

o Restart matches a continuous initial run bit-for-bit on all production and development machines. (Continuous 2 day run, restart day 1).   If the radiation code has been modified, also test restart from a non-radiation timestep.

o Different domain decompositions match bit-for-bit.

o Updated version compared against previous TAG
   1) bit-for-bit,
           a) one dynamical core on NCAR production and development machines
           b) three dynamical cores on one machine
   2) Difference growth test on failures of above.  Error Growth on same
      and two different production platforms.
   3) Perform basic timing and memory checks to make sure new code
      does not adversely affect performance or memory footprint.
      ( less than 10% increase over previous tag usually ok.)

BASIC TESTS (adiabatic version)

o Updated version compared against previous TAG
   1) bit-for-bit,
            a) one dynamical core on NCAR production machines (and occasionally on additional machines)
            b) three dynamical cores on one machine
   2) Difference growth test on failures of above.

ADDITIONAL TESTS

o If new fields are required on the initial dataset, test that model starts correctly from initial history tape written by the model
  (write_inithist.F90).

o If any mods were made to any dynamics code, the other two cores  should also be checked.  If appropriate, make sure the code changes  also percolate into the appropriate dynamics side of the CVS   directory tree (i.e. if you made a change to eul/inidat.F90, also make  the same change to sld/inidat.F90 and lr/inidat.F90).

o If any mods were made to a single branch of a cpp #define the other branch(s) should also be checked.  One example is making a change to code within a #define SPMD branch.  The non-SPMD branch should also be checked and modified accordingly.  Checking usually entails running a start/restart test exercising all branches of the cpp #define.

o Make sure code runs in both Hybrid MPI/OpenMP and just OpenMp (non spmd) mode.  The production configuration is hybrid mode but for development and debug purposes the non-spmd version of the model must work as well.

o Before code is checked in:
        1) Make sure code adheres to CCM coding standard.
        2) Check code with flint tool.
        3) Make note of any major timing or memory penalties and associated tests.
        4) Update documentation (Users Guide).
 

Issues and Discussion
 

From Byron:

I would like to raise 3 issues that we should address in the design
document.

----------------------------
1. The difference between a GCM with interactive chemistry and a
chemical transport model are relatively minor. We should encompass
both in the design.

 The "dynamical core" actually performs 2 nearly separate functions
   a. time stepping the dynamics
   b. advecting tracers

Although these functions are closely related, they are performed
fairly separately, in most "dynamical cores".

A chemical transport model needs to replace the time stepping of the
dynamics with reading of datasets, time interpolation, and a variety
of consistency requirements. The advection of tracers is unchanged

We could try to specify the interfaces inside the "core" more clearly,
so that only the dynamics could be replaced, or we could merely note
that the "dynamical core" could be an offline advection module.
 
 

----------------------------
2. Section 2. Package Interfaces

I think that we should more cleanly separate initializtion and restart
functions for the packages.

One way to do this would be to require each package to take an INOUT
state structure (object?) as an argument. This structure would contain
everything that the package needed to initialize and to read/write to
a restart file. It would be used only by the package
(dynamics/physics) which defines it, but would contain all time
history information.

        The input and output of the packages for use by
        other parts of the model would be contained in separate
        structures.

As in Suarez' GEMS package, multiple models could be
advanced in parallel by using multiple instances of the state
structures.

The physics state would include
    absorptivities and emissivities,
    radiative heating rates

----------------------------
3. The design (2.2.2) states that the time stepping method is considered
part of the dynamical core.

An interesting problem is posed by the cloud water. This
parameterization maintains global arrays of its outputs [T,q,cldwater]
to determine the change by all other processes between calls to cloud
water. The output values are passed back in on the next appropriate
time step. With leap frog dynamics, the appropriate step is 2 steps
later and 2 time levels of stored values are required.

Where should the storing and cycling of these arrays be handled? They
have nothing to do with the dynamics, except that knowledge of the
time stepping procedure is required.
 
 

  From:
       Shian-Jiann Lin <lin@dycore.gsfc.nasa.gov>

I agree with Byron's concept about a "dynamics module" for
CTM. Inside the lr dynamics, we actually have a pure "dynamics"
module, a "transport module", and finally a mapping module (to
go from Lagrangian to the Eulerian Eta coordinate). They are
cleanly seperated. However, to maintain transport consistency and
mass conservation, both winds and mass fluxes are needed as input to the
"transport module". So, the input to the "transport module" should be
"dynamics" (i.e., input met data) dependent.

I only have a comment to Byron's last issue.
I think we may eventually want to have cloud water advected (Phil
seems to agree). So cloud water will be carried and transported
by the dynamics. This needs to be taken into design consideration, in
addition to the [T,q,cldwater] values after cloud scheme.

Appendix A:  Pseudo Code Template
Code Template for the Atmospheric Model

--------------------------------------------------------------
------------------- Main Program -----------------------------
--------------------------------------------------------------
initialize

time loop

     physics(phys_state_phys_decomp, phys_tend_phys_decomp)

     p_d_decomp(phys_tend_phys_decomp, phys_tend_dyn_decomp)

     p_d_transformation(phys_tend_dyn_decomp, dyn_tend_dyn_decomp)

     dynamics(dyn_state_dyn_decomp)

     d_p_transformation(dyn_state_dyn_decomp, phys_state_dyn_decomp)

     d_p_decomp(phys_state_dyn_decomp, phys_state_phys_decomp)

end time loop
 
 

States:
dyn_state_dyn_decomp: us,vs,thetas,ps,qs on dynamics staggered grid
phys_state_dyn_decomp: u,v,T,p,q on unstaggered grid using dynamics decomposition
phys_state_phys_decomp: u,v,T,p,q on unstaggered grid using physics decomposition

Tendencies:
dyn_tend_dyn_decomp: xxx on dynamics staggered grid
phys_tend_phys_decomp: xxxx on unstaggered grid using physics decomposition
phys_tend_dyn_decomp: xxxx on unstaggered grid using dynamics decomposition

Notes:
 1)  In this template, the interface methods are
 d_p_transformation, d_p_decomp, p_d_decomp and p_d_transformation
These are only passed state information and not information on the mappings
between packages.  So the decomposition and mapping information along with
interface information constructed from these are hidden (private to the interface module).
 2)  The three states may or may not correspond to three copies of the data.
 3)  The data flow diagram for the time loop is
     phys_state_phys_decomp
            |
            V            (physics)
     phys_tend_phys_decomp
            |
            V            (p_d_decomp)
     phys_tend_dyn_decomp
            |
            V            (p_d_transformation)
     dyn_tend_dyn_decomp
            |
            V            (dynamics)
     dyn_state_dyn_decomp
            |
            V            (d_p_transformation)
     phys_state_dyn_decomp
            |
            V            (d_p_decomp)
     phys_state_phys_decomp
 

Questions:
 1) where are diagnositic state variables defined?
   theta, zm, zi, ln(pm), ln(pi), dp, ...
   in the state structure, or in a second structure?
 2) a similar transformation and decomposition will be required
for the history file and restart I/O.  Will need an
outfld(phys_state_phys_decomp) and outfld(dyn_state_dyn_decomp).
Then an wshist(h_state_h_decomp) with preceding transformations.
 3) the diagram does not show how state info is inout from physics

--------------------------------------------------------------
------------------- Package Interfaces------------------------
--------------------------------------------------------------
interface module

     p_d_decomp(phys_tend_phys_decomp, phys_tend_dyn_decomp)
     p_d_transformation(phys_tend_dyn_decomp, dyn_tend_dyn_decomp)
     d_p_transformation(dyn_state_dyn_decomp, phys_state_dyn_decomp)
     d_p_decomp(phys_state_dyn_decomp, phys_state_phys_decomp)

Notes:
 1)   The interface module holds routing information and the
map between the two index spaces.
 2)   The transformation method uses the information of how
to convert from one system to the other which is held by the
interface routine.
 3)   Interpolation routines for converting from staggered to non-
staggered grids should NOT be located here, rather they are methods
of the dynamics.  Should they be used and called from the interface?
 4)   The package interfaces all go through here.  What transformation
to perform could be identified from the type of input.
--------------------------------------------------------------
------------------- Decompositions ---------------------------
--------------------------------------------------------------
decomposition modules
     for each grid decomposition, define how the points are assigned
     to processes.  A different decomposition module is required for
     each package, eg. p_decomp and d_decomp_lr, d_decomp_eul, d_decomp_sld.
     These will be used
     by the interface module to determine how to pack and route state data.

Notes:
  1)  The current code uses commap and comspmd modules to hold this information.
The information is a lon() and lat() array with the global index as value
and reference from the local index, eg in p_decomp we would have
lat(ncol,lchnk).  The physics processor map will be using the lchnk
while the dynamics processor map is by latitude (for now).

--------------------------------------------------------------
------------------- Physics Package --------------------------
--------------------------------------------------------------
physics package
 call physics(state, tend)                         ----- this is physpkg
      call scan1bc(state(lchnk), tend(lchnk))    ----- state is inout, tend is out
    call linemsbc(state, tend)
         call tphysbc(state, tend)
       declare ptend                 ---- paramaterization tendency structure

       call param1_intr(state, ptend)
       call update(state, tend, ptend) --- modify state and cummulative tendency

       call param2_intr(state, ptend)
       call update(state, tend, ptend)

      call surf_or_coupler (state, fluxes)

      call scan1ac(state*(lchnk), fluxes(lchnk), tend(lchnk))
    call linemsac(state, tend)
         call tphysac(state, tend)
       declare ptend --- paramaterization tendency structure

       call param1_intr(state, ptend)
       call update(state, tend, ptend)

       call param2_intr(state, ptend)
       call update(state, tend, ptend)

      call final_tend(state(lchnk), state*(lchnk), tend(lchnk))
    if (time_split) tend = (state* - state) / dt