Scalable Systems Software Project Notebook - page 47 of 88

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

March-June Progress Report

Here is the text of the quarterly progress report I sent to Fred. If you see anything important that I left out, either let me know or make sure it gets into the next progress report.

thanks,
Al


Scalable Systems Software project Progress report for March – June 2003

Build & Configuration Management Working Group

Communication infrastructure progress includes developing a common XML command syntax that is flexible enough for all components. It is called the restriction syntax. It describes commands in terms of the operation itself, including all arguments, and the set of elements to operate on. We have found this to be a flexible scheme to describe commands. We have written a specification document that describes the communication infrastructure software components, the service directory and event manager, and project-wide agreements about wire protocols used for inter-component communications. This draft specification was the first document undergoing the adoption process. It has passed its first group vote nearly unanimously with a few minor amendments. These amendments have been incorporated into the document and implemented in the software stack.

All build and configuration management and process manager components use this new specification. We have augmented it to supply complete SQL-like semantics, and have rewritten the communication infrastructure component schemas to use this. The result is a set of interfaces that allow ready access to internal data structures, providing good system debugging facilities.

Scalable Systems components have been in use for daily management operations on Chiba City for the last 6-9 months. These components have undergone continued refinement and augmentation to support testbed activities. Work is underway on an implementation of the Build and Configuration Management stack based on OSCAR. This is planned for inclusion in the SC software release.

We have designed interfaces for a state management component that provides cluster diagnostic capabilities. Debugging capabilities are noticeably lacking on large-scale clusters. We hope that techniques developed will be useful in improving uptimes and the general use experiences on large clusters.

Resource Management Working Group

Version 2 of the Scalable Systems Software Resource Management and Accounting Protocol (SSSRMAP) was completed and released. This version of the protocol greatly enhances the security specifications. All messages now include an outer envelope and a message body to facilitate the inclusion of security authentication and encryption. It was concluded in the last Face-to-Face meeting that the SSSRMAP specification should be broken up into two separate parts, a Wire Protocol and a Message Format. This was done so that the SSSRMAP wire protocol could be cleanly included in ssslib as one of their supported wire protocols while allowing the message encoding to follow alternate formats. RMWG components are currently engaged in implementing SSSRMAP v2 (both wire protocol and message format). Gold, the accounting and allocation manager has already implemented version 2 of the protocol demonstrating proof of design.

A Job Object Specification version 2 was also created during this time. It defines an XML structure for representing a job with all of its complexities that can be passed around and understood by all RMWG components. This job object takes into account all phases of a job’s lifecycle and includes support for job steps (dependencies), task groups, dynamic jobs, preferences, weighted requirements, charges, meta-scheduling, usage statistics, and an approach to handle disparate feature support.

We are beginning to see some external adoption of the SSSRMAP standard. We have a commitment from SLURM (LLNL/Linux Networx) to write an interface to SSSRMAP. Clemson University wants to use the SSSRMAP interface to the queue manager to with its bproc-based scheduler. The CLUBMask resource manager being developed by Penn State to interface with scheduler via SSSRMAP. Interest has been expressed from CERN for a data manager interface to scheduler using SSSRMAP.

Basic security choices were narrowed to six security token types. These security mechanisms result in a variety of ways a component can be integrated into the security infrastructure policy for a site. These security tokens are 1) Symmetric (shared secret-key between client and server), 2) Asymmetric (public/private key pairs for each of client and server), 3) Password (assumes user passwords are known by client and server), 4) Cleartext (relies on the use of transport level security like SSL or IPSec), 5) Kerberos (uses Kerberos v5 tokens), and 6) GSI (utilizes a Public Key Infrastructure (PKI) where users have their own certificates). The default security token (which must be implemented by all components with security) is Symmetric. Gold, the accounting and allocation manager has already implemented authorization and encryption using the default security mechanisms defined in SSSRMAP v2. The other RMWG components are soon to follow.

Scheduler Progress:

The scheduler has now implemented the SSSRMAP XML client-server interface internally for 40% of its clients. New interfaces with documentation have been created to support generic resource loads (paging space, I/o, processor load, etc) for resource limit enforcement and tracking. Support has been added for multi-task group jobs as well as dynamic reservations (growing and shrinking to support MPI dynamic jobs). On the security front -- support has been added for a user specified keyfile containing the security token. Continued efforts in memory-footprint reduction improve performance. Fault tolerance has been improved substantially by the creation of a fallback server. An initial web interface is being developed that communicates directly with the Maui server.

Queue Manager Progress:

The queue manager has updated the service directory and event manager interfaces. It has implemented caching of service directory lookups and prioritized the wire protocol types returned. It has made some progress on implementing the standard interfaces and protocols in SSSRMAP v2 and SSS Job Object v2.

Accounting and Allocation Manager Progress:

Gold has added support for 95% of the functionality from QBank. Support was added for deposits, refunds, guaranteed quotes, transfers, debit and credit allocations, activation and expiration of allocations, etc. There have been many allocation design enhancements: allocations are now shareable by users, projects and machines (also supports exclusions), and support has been added for special wildcard types (ANY, NONE, MEMBER, DEFINED). Support has been added for hierarchical accounts (projects) including recursive trickle-down deposits, trickle-up withdrawals, queries, reservations, balance checks, etc.

Support has been added for operations (aggregate functions) on returned query fields (sort, sum, max, unique, count, group by, etc), the negation of options, and association metadata added to aid in GUI object navigation. Support has been enhanced for transaction logging, journaling, undo, redo and a more flexible charging algorithm has been implemented. Both SSSRMAP version 2.0 and the SSS Job Object version 2.0 have been implemented in Gold. Infrastructure was added for Role-Based Access Control as well as for method overriding and method scope resolution. There has been some progress on open source front (Gold and sss_xml front-ends): we have obtained approval from PNNL IP to apply a BSD open source license and have sought DOE approval to assert copyright. An Accounting and Allocation Manager Binding document has been created describing use of SSSRMAP protocol. We are beginning an effort to develop Web-based GUI based on Java Server Pages (JSP). Gold has newly implemented Symmetric key authentication and encryption as per the SSSRMAP v2 specification.

Meta-Scheduler Progress:

A very significant accomplishment is the new support for basic data scheduling. This has been also been initially tested within the Globus environment. A new interface for data-cache scheduling has been created. The job queue has been made persistent and Silver can now recover from network failure, system failure, loss of checkpoint files. There has been major documentation in all areas.

Process Management and Monitoring Working Group

This quarter the Process Manager team overhauled the XML interface to the Process Manager. The biggest change was in the format of querying for information about a job. This was defined in the "restriction syntax" being proposed by those providing the communication infrastructure. This interface was put forward in the notebooks and then presented at the June meeting.

This new XML was implemented in the Process Manager component as a test of the new syntax. Further, the MPD-based prototype implementation of the Process Manager was extended to support new features in the interface. Other changes were made to the MPD system, including complete support for the MPI standard “mpiexec” and some convenience extensions. Bugs found during testing of MPD with MPICH-2 were fixed.

A paper, "The Process Management Component of a Scalable Systems Software Environment", was written and submitted to the Cluster2003 Conference. This paper includes an overview of the Scalable Systems Software component-based environment, motivation for the definition of a Process Management component, what the current proposed XML interface is for the Process Management component, and the results of some scalability experiments for process startup on the Chiba City testbed.

This quarter the monitoring work has focused on a rewrite of the monitoring infrastructure implementation. Relative to the proof-of-concept implementation demonstrated at SC2002, this rewrite will be both more scalable and more flexible than its predecessor.

The new monitoring infrastructure is based on the idea of a data “warehouse”, a software entity that can act as both a sink and a source of monitoring data. The warehouses can be linked together in a tree topology to provide for scalable management of the data to be monitored. Rather than simply forwarding all information up the tree as it is received, each warehouse is capable of intelligent management of the monitoring data it receives. Among these capabilities is the generation of requested data on demand, rather than generating a continuous flow of periodic data that may not be used. Complementing the on-demand delivery of data is the ability of a warehouse to cache data received from its children, eliminating the need for global synchronization in the delivery of data. Additionally the caching of data allows for differential updates to be propagated through the tree rather than full updates.

At the present time the core infrastructure has been written, and the basic functionality has been tested with synthetic data. The next step is to begin testing with live data sources.

In the Checkpoint Manager team, the progress this quarter has been slower than in the past. The people normally working on the Checkpoint component have been redirected, for the last several months, to other projects for SC2003 preparations.

Through the end of May, most of the work on this project has been internal discussions of the XML interfaces, required documentation, and the plans to implement additional required features in the checkpoint runtime support. Most notably among these features is basic support for checkpointing and restarting processes with open files.

In June the members of the Checkpoint Manager team were able to resume active development and anticipate completing the documentation and XML efforts, as well as the basic open file support, for a late summer beta software release. Despite the recent slow down in work, the Checkpoint Manager team remains on schedule with respect to all technical issues, with the exception of the XML interface for the CM, which was to have been completed this quarter.

Validation and Testing Working Group

The Validation and testing working group has developed a system to enable efficient testing of supercomputers to determine their ability to run applications reliably and evaluate their performance by standard metrics. In particular to assure that a parallel supercomputer correctly and efficiently implements the MPI or SciDAC Scalable systems standards In the past, supercomputer makers tested their products through their own test suites as well as codes obtained from users. This process has not been efficient: Vendors are not completely consistent in constructing their own test suites. Furthermore, vendors get applications from users in an ad hoc manner, typically ending up with different applications or different versions of those applications.

Evaluating the performance of a supercomputer (benchmarking) faces a similar set of issues: To evaluate performance, one runs a standard job or jobs with standard input data and measure the running time to produce the (presumably correct) result.

The testing framework understands how to run tests on a parallel supercomputer at different levels – low levels like message passing through high levels like batches of large applications running over a period of hours or days. The testing framework also understands how to sequence and evaluate the output from tests that it runs.

For example, Sandia has an application (named CTH) that simulates the mixing of supersonic gas streams. This application can be used as a test by starting it with a known configuration of gas flows and simulating the mixing over a fraction of a second. If the underlying supercomputer is working properly, the final temperature and pressure should be very nearly the same every time this problem is run. When the test framework runs this as a test, it samples the temperature and pressure at the end of the run and makes a decision on whether to proceed with testing based on whether these numbers are within a specified percentage of the “correct” value.

In another example, the value of a supercomputer is determined by much more than its peak floating point rate (known as peak FLOPS): If a supercomputer crashes a lot and requires hours to restart, it will be less useful in running user jobs. To capture these effects, the supercomputer industry has standard supercomputer “loads” consisting of a series of user jobs of certain types and sizes – along with some rebooting. The value of a supercomputer can be represented fairly well by measuring the running time of one of these standard loads. They measure this running time in a consistent way, the testing framework can submit the necessary applications to the “batch scheduler,” and when everything completes compute the overall running time.