Resource Management and Accounting Notebook - page 94 of 150
RMWG Progress Summary for Phase 1
1 The Resource Management and Accounting Working Group
The Resource Management and Accounting Working Group (RMWG) encompasses the area of resource scheduling and accounting and is responsible for research and development revolving around four components: the Scheduler, the Job Queue Manager, the Accounting and Allocation Manager and the Meta-Scheduler. The Scalable Systems Software resource management system integrates with components being developed by other SSS working groups such as the Process Manager, the System Monitor and other infrastructure components.
The working group is chaired by Scott Jackson of PNNL. Development of the Scheduler and Meta-Scheduler is led by David Jackson of Ames Laboratory. The Job Queue Manager development is being led by Brett Bode of Ames Laboratory. Development of the Accounting and Allocation Manager component is led by Scott Jackson of PNNL. Enhancements to the PBS Resource Manager are also being done by PNNL.
Members of the working group meet weekly in a telephone conference call to coordinate activities and discuss design issues.
1.1 The Local Scheduler
Maui is a high performance optimizing cluster scheduler with extensive advance reservation support and policy control.
Under SSS, significant progress has been made towards virtually all of the specified scheduling line-item deliverables. Some items are in early phases of development, others in final phases of completion. The progress is summarized below.
220.127.116.11 Internal Design
Maui has been placed under revision control and SSS specification documentation has been completed. XML-based state checkpointing has been enabled and an object-based internal design implemented.
Maui has been enhanced to support the SSS-based socket protocols. The HTTP protocol is supported for allocation manager and resource manager interfaces. New SSS interfaces which have been added to Maui include the allocation manager interface (to query/modify account allocation state), the queue manager interface (to query/manage batch jobs), the system monitor interface (to query performance and configuration information for compute resources), the event manager interface (allow registration of and subscription to various batch system events), the service directory interface added (allow interface registration and automatic detection of peer component services), as well as a scheduling extension interface (allow scheduling plug-ins to enable to scheduling algorithms and capabilities). Select scheduler client commands have been rewritten to utilize XML based data (to allow easy GUI interface development to display batch system state data). Additionally, there has been enhanced native support for Loadleveler, PBS, SGE, LSF, and BProc based systems.
Improvements for usability include significantly enhanced web based scheduler documentation, additional scheduler command man pages for select commands, and standardization of scheduler client command line flags.
Support DES, HMAC, MD5, and external source secret key based algorithms has been implemented for client/server authentication. Improved buffer overflow protection has been added to critical scheduler interfaces. A generalized secret key management facility has been implemented for secure multi-party communication.
Scalability improvements to the Maui Scheduler include decreasing memory consumption by over 80%, enabling support for up to 8,000 nodes, enabling support for up to 32,000 processors, enabling support for up to 2,000 simultaneous active jobs, and enabling support for jobs requesting up to 16,000 hosts.
18.104.22.168 Fault Tolerance
Progress on the fault tolerance front includes the migration of all Resource Manager calls to a threaded Resource Manager interface (enabling scheduler survival of interface hangs and crashes), the incorporation of Resource Manager and Allocation Manager diagnostics and failure tracking statistics, as well as the implementation of improved data checking and handling routines to detect and correct corrupt Resource Manager data.
There has been significant improvements in functionality as well. Dynamic job support interfaces have been designed. Support has been added for job suspend/resume Resource Manager interfaces and support algorithms (PBS/LL). Support has been added for checkpoint/restart interfaces with the Resource Manager and support algorithms (PBS/LL). Resource utilization tracking interfaces and associated statistics collection/reporting facilities have been added. Support has been enabled for initial resource utilization based limits and violation policies. Extended QOS facilities allow improved control over resource and functionality access and targetted service delivery. Generalized hierarchical prioritization infrastructure have been enabled. Generalized throttling policies infrastructure has been added to allow low-level control of real time resource access to jobs. Limited support for generic resources has been enabled (i.e., software licenses, network bandwidth, global disk caches, etc.).
Maui is actively used at many of the world's largest and most advanced highest performance computing sites. It is estimated that Maui is now in use on more than 600 systems. Below is a select list of a few of the sites actively using Maui in production and/or involved in Maui-based scheduling research:
Ames Lab, Argonne National Lab, ASC, Brookhaven National Lab, Fermilab National Accelerator Laboratory, LBNL/NERSC, Los Alamos National Lab, MHPCC, NASA/JPL, National Institute of Science and Technology, Naval Research Laboratory, NCSA, NOAA, ORNL, PNNL, Sandia National Lab, SDSC, Wright-Patterson AFB
1.1.3 Future Work
While phase I consisted of heavy infrastructure changes required to support key SSS capabilites, phase II is where this effort really bears fruit. This phase focuses on capitalizing on this new infrastructure to bring mature functionality to production sites. This phase will bring about virtual partitioning through resource limit enforcement and tracking, advanced support of preemption capabilities utilizing suspend/resume and checkpoint restart, and will add support for malleable or dynamic jobs. Further, this phase will bring about further enhancements to a site's ability to manage cycle delivery through quality of service support for completion time guarantees, minimum service levels, and increased service access. Support for interactive job steering will be added and intelligent data pre-staging will be enabled to improve overall system utilization. In addition to these features, progress will continue in the realm of refining and extending interfaces to accommodate realized and anticipated enhancements in cluster and resource manager capabilities. Along these same lines, intra and inter component security features will be matured and made more flexible.
Phase III will continue to deliver production quality algorithms and features within both Maui and Silver and will continue to enhance inter component interfaces to address evolving system capabilities and requirements. Feedback received from partner sites will be rolled into existing capabilities and packaging and documentation matured. During phase III, earlier projects will be finalized and a number of very promising research projects will begin. These projects include efforts in the areas of peer-to-peer scheduling, incorporation of network topology into scheduling optimization algorithms, and resource reservation of bandwidth and data cache space. This phase will evaluate extending these newly developed capabilities into new realms and will develop intelligent fault tolerance features to proactively handle anticipated issues and maintain the highest possibly level of service availability
1.2 The Job Queue Manager
1.2.1 Component Design
As the resource management system’s design has evolved it was broken down into constituent components. Acting as a central job information hub the Queue Manager maintains a complete database of information about jobs, both present and past. The Queue Manager also acts as the entry point for jobs into the system. As such it must support a robust and full-featured job submission interface providing ease of use to the
user while also exposing all of the features available in the system including special scheduler instructions or other specialized component directives. Once a job start request is received the Job Manager turns it into a series of process start requests and sends them to the Process Manager. The Job Manager also manages node setup and teardown
for jobs and other tasks related to running jobs such as output delivery and job termination notification. However, an important feature of the design is that the Queue/Job Manager does not make policy decisions. Rather it relies on the Scheduler to make decisions about when a job is started and when to terminate a job if it exceeds
its resource allocation. Once the scheduler makes such a decision the Queue/Job Manager implements it.
The Queue/Job Manager interacts with many different components from each of the working groups. The most important interactions are with the scheduler and the process manager. The scheduler queries the Queue Manager for job information and instructs the Job Manager to start or stop jobs. The Job Manager then issues process start and stop requests to the Process Manager. Process termination notification is received via the event manager. In addition the Service Directory is used to look up the host and port information for the other components and in the future the Job Manager will retrieve resource usage information on running jobs from the Node Monitor and may request special node OS installation on behalf of a job from the Node Manager. The Job Manager
may also be required to manage data staging for users as well as for the checkpoint/restart system once job migration is supported.
A basic Queue/Job Manger has been created from scratch to meet the needs of the overall RMS design. The current focus has been on meeting the immediate needs for a basic functional component that meets the needs of the other SSS components. To this end we have created a component that handles job submission, signaling, monitoring and
deletion. To ease the transition for users the initial user command interface has been modeled after the same interface in the Portable Batch System (PBS). Thus existing PBS commands and job scripts are directly accepted by the new system. The current component has interfaces in place with the Scheduler, Service Directory, Event Manager, and Process Manager components. These interfaces and indeed the entire component infrastructure has been installed on several systems, including Chiba City and XTorc, and tested on several occasions with good results. The tests have included full job lifecycle
tests from submission through termination.
The current Queue Manager implements persistence by saving job data to flat files with a database interface currently in progress. Active job information is maintained in memory so most queries are very fast. The Job Manager currently handles single step jobs from startup to completion. The system has been designed such that there are no compile time limits imposed on the number of hosts, etc., rather they are limited only by the amount of available systems. The capabilities of the Job Manager are being enhanced with new features as they become available in the Process Manager.
1.2.4 Future Work
The current Queue/Job Manager is being extended in several ways. First, the data archiving capabilities are being enhanced such that all job data will be archived with the install time option of using either flat files or an SQL database. This will allow queries to the system for job data on any job, past or present. To enhance performance, data on current jobs will be cached in memory while old jobs will be stored to the backend files or database and looked up on demand. One capability we plan to implement is a database information search function to allow structured searches of the jobs archive. The search capability will not only be useful to users, but will provide a valuable tool to administrators. For example, an administrator might wish to query usage by a specific user, or the jobs that ran on a specific node prior to a failure. This type of search is often difficult or impossible to retrieve in many current resource managers.
Support for multi-step jobs is currently being added to the job manager. The system design will encompass two types of multi-step jobs. The first and most common are multi-step jobs that can be scheduled as a single entity. Since site-dependent node setup and teardown steps will be added to all jobs this type of job will be ubiquitous. The second type of multi-step job will include separately-scheduled, possibly dependent, job steps that may have completely orthogonal resource requirements. An example of this type of job might be a long parallel computation, followed by a data staging step, followed by a visualization step. Support for this type of job in the Queue/Job Manager requires support in the submission interface for a description of the job steps and their dependencies and a correct handling of job step transitions where data must be transferred from one set of nodes to another before node cleanup is performed.
Finally the XML interface continues to evolve to allow for the inclusion of advanced features such as multi-step and dynamic jobs. In addition the current relatively flat XML structure is being enhanced to a hierarchical design to better distinguish between groups of similar information such as requested resources versus consumed resources. This sort of design will also facilitate the addition of multi-step, and thus multi-requirement jobs. In addition the user interface will be extended with the addition of front ends to directly support job scripts written for other third party resource managers such as LoadLeveler and LSF.
1.3 Accounting and Allocation Manager
In order to efficiently use DOE’s high performance computers, a site must be able to allocate resources to the users and projects that most need them in a manner that is fair and according to mission objectives. Tracking the historical resource usage allows for insightful capacity planning and in making decisions on how to best mete out these resources. It allows the funding sources which have invested heavily in a supercomputing resource a means to show that it is being utilized efficiently. Additionally, accounting and allocation management are critical to being able to take advantage of the tremendous utilization gains afforded by meta-scheduling.
The accounting and allocation manager tracks and manages job and resource usage. Much like a bank, an allocation manager associates a cost to computing resources and allows resource credits to be allocated to users and projects and meted out in a fair and judicious manner. As jobs complete or as resources are utilized, projects are dynamically charged and resource usage recorded.
An accounting and allocation management component is being developed and integrated into the scalable resource management system to provide accounting and dynamic project charging. A flexible GUI is being developed to simplify use and the management of project and accounting data. QBank, a dynamic reservation-based allocation manager created at PNNL, is being enhanced and used in the initial SSS software offering. This will later be replaced by Gold, a redesigned accounting and allocation information system which is currently in an advanced prototype stage.
One of the first things that were done is that a software requirements specification document and a survey were created which were circulated and reviewed by over a dozen DOE and government sites with the largest high performance computational facilities. Feedback was collected and integrated into the next generation design.
A very powerful XML-based resource management interface standard was designed and produced to allow alternative components to easily swap into the resource management system. This interface was developed as a wire-level requests-response protocol supporting complex extensible objects and a very powerful query syntax. This design allows software components to be interchanged for others without the need for modifying or recompiling the software. The SSSRMAP protocol addresses important protocol mechanisms such as framing, encoding, error handling, parallelism, authentication, and transport security. An XML schema was developed to validate the standard interface. An allocation manager binding was generated and the interfaces tested with the scheduler component.
QBank, an existing allocation manager, was enhanced and packaged for large-scale use at other sites. QBank was placed under revision control. A test harness was installed, test suites created and bugs fixed. Security was strengthened. The install process was streamlined and QBank was packaged in RPMs and tarballs for Linux. Documentation was significantly improved including the creation of a user guide, a deployment guide, man pages, and updated online documentation. A support queue and mailing list have been created. QBank is currently being used or evaluated by dozens of sites including PNNL, ANL, ORNL, NCSA, MHPCC, and several universities.
Gold, the next generation allocation system, incorporates the design features collected from the industry survey. It is in an advanced prototype stage and has been built up to include core functionality and design features to the point of being included in the live demonstration of the SSS resource management suite at SuperComputing 2002. Gold implements the SSS resource management interface standard. It currently supports management features for accounts, users, machines, allocations, jobs, resources, usage and charging. It implements a powerful query/update interface including create, query, modify, delete and undelete actions, as well as support for operators, conjunctive expression combinations and object joined queries. It allows new object/record types and their fields to be dynamically created/modified through the regular query language (command line or GUI). This capability turns this system into a generalized information service. This capability is extremely powerful and can be used to manage all varieties of custom accounting data, to provide meta-scheduling resource mapping, or function as a persistence interface for other components. Gold implements a powerful journaling mechanism which preserves the indefinite historical state of all objects and records. This powerful mechanism allows bank statements to show balances for any arbitrary time in the past, provides an undo/redo mechanism for administrative mistakes, and allows any command to run as if it were an arbitrary time in the past.
Scalability testing was performed on both the QBank and Gold components. These scalability tests were carried out in three levels. Component-level testing was done to test timings to perform barrages of common accounting and allocation operations (charges, reservations, balance checks, etc.) Simulations were performed with the Maui scheduler to test transaction times with the allocation manager interface. And system tests were carried out with various combinations of the SSS resource management components.
QBank was packaged and released in the initial version (1.0) of the SSS resource management suite consisting of the enhanced pre-existing components. It has been made available for download from a link off of the Scalable Systems Software Center main page. This accomplishment represents a great deal of work not to exclude those involving the licensing and distribution mechanisms.
1.3.3 Future Plans
The accomplishments listed above fully meet the relevant deliverables touted for the first of three phases for the Scalable Systems Software Center.
In phase two, we will focus on the development of Gold into a production quality software component. We will fully implement the set of interfaces determined during phase 1, and we will release version 2 of the SSS resource management interface specifications. Flexible charging algorithms will be implemented. Additional features such as quotations, hierarchical accounts, debit and credit allocations, automatically expiring allocations and reservations, etc. will be incorporated into Gold. The allocation manager will be enhanced to track, allocate and charge for SMP resources such as memory, network and disk. Mechanisms will be researched and implemented to provide the scalability improvements necessary to support thousands of processors. Security mechanisms and role-based authentication will be implemented. During this period, portability will be emphasized, with QBank and Gold being ported to the architectures used at DOE’s largest scale computing facilities (AIX, Tru64, possibly Cray). Fault tolerance will receive some attention, supporting up to a 25% cluster loss. Gold will be released in a second SSS distribution which includes all components. A user oriented problem response system will be established for both QBank and Gold.
Phase three will be marked by advanced scalability research and polished and supported software. Forms of parallelism and multiplexing will be implemented to improve average transaction times. Fault tolerance will be heavily researched and incorporated into Gold -- with a possible solution of hot backup applications running on separate servers providing high availability. Peer to peer communication will be implemented to support accounting and allocation within meta-scheduled environments. User feedback from Phase 2 will be incorporated into the application and validated on the largest DOE systems where scalability tests will also be carried out. The user interfaces will receive significant attention to make them more secure, easy to use, and error averse. A third software release will implement version 2 of the resource management interface specifications. A problem reporting system and lifecycle software engineering will be actively maintained.
1.3.4 References and supporting docs
Allocation Manager Requirements
Allocation Management with QBank (technical paper)
Resource Management Interface Specs
Accounting and Allocation Manager Binding Doc
QBank and Gold Scalability Test Results
Allocation and Accounting Survey Results
A meta-scheduler is able to intelligently load balance across resources which span geographic and administrative domains providing better average response times for user jobs and better system utilization on the participating systems. The meta-scheduler is also capable of adding new services such as co-allocation of resources and support for massive jobs spanning multiple sites.
Silver is an advance reservation based meta-scheduler designed to effectively distribute workload within a campus grid or similar sized collection of HPC systems.
22.214.171.124 Internal Design
A requirement specification document was developed for Silver and it was placed under CVS revision control. The object-based internal design has been partially implemented.
Support has been added for Globus 2.0 and 2.2 based job staging. The initial information service interface has been designed.
The sqsub client was created enabling PBS style submission of meta jobs.
Silver security has been enhanced by adding Globus credential caching and enabling generalized secret session key management.
Initial resource feasibility checking has been added to minimize unusable resource queries.
126.96.36.199 Fault Tolerance
Support has been added for retrying resources.
Additional functionality includes the basic data management interface and an initial file staging capability.
Silver is involved in production meta-scheduling or grid research at a number of facilities. While still at an early phase in development, interest in immediate use has been pronounced. Below is a list of some of the sites/grids current using Silver or involved in Silver based research:
NCSA (TeraGrid), ORNL, University of Utah, Princeton University, Clemson University, University of Buffalo, Ohio Supercomputing Center (Cluster Ohio Grid), University of Indiana
1.4.3 Future Work
Phase II will also see significant development of the Silver meta scheduler allowing resources within a facility to be effectively shared and efficiently utilized. Silver will be modified to communicate using version 2 of the SSS Resource Management XML interface. The meta-scheduler will employ intelligent data pre-staging, a critical element for the efficient use of distributed systems to be successful. Also, peer-to-peer interaction between meta-schedulers will enhance the effectiveness of meta-scheduling by extending the view and reach of each meta-scheduler allowing greater resource access, a larger selection of jobs to choose from, and improved load balancing across systems.
In phase III the meta-scheduler will receive heavy emphasis. This is a period in which enhanced scheduling algorithms for the scheduler and the meta-scheduler will be investigated and implemented. Every scheduling decision will consider the cost on underlying network bandwidth, available data staging space, computational throughput, and start time. The scheduler and meat-scheduler will be enhanced to support co-allocation of resources (network, data, licenses) and locality scheduling. Locally submitted jobs will be able to migrate to systems where they could run sooner. The meta-scheduler and allocation manager will be enhanced to allow for monitoring, tracking, allocation and accounting across site and system boundaries.
1.5 The Resource Manager
In addition to creating a from-scratch resource management system, the Scalable Systems Software Center is also making enhancements to an existing Resource Manager (PBS) as an intermediate or alternative solution. This PBS Resource Management System is essentially an alternative swappable replacement for the Job Queue Manager, the Cluster Monitor, and the Process Manager which can fill the short-term resource management needs while other components mature from prototypes. We feel there are also long-term benefits to enhancing PBS to fit within the Scalable Systems Software model for sites which need some of the functionality only available through PBS.
The Portable Batch System (PBS) is a flexible batch queueing and workload management system originally developed by Veridian Systems for NASA. It operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems.
188.8.131.52 PBS Enhancements
Numerous scalability improvements created by the SSS team and from other sources have been rolled into a new SSS PBS distribution. These enhancements are packaged in the form of a patch file, as well as RPMs and a tarball. The PBS server was modified so that it does not need to poll the pbs_moms, rather a much more scalable approach is to have the moms push their information to the server. The server was modified to cache the information received by the client hosts (moms) so that when asked for this information by the scheduler, it can immediately provide it instead of having to request information each time from each of the nodes. A great deal of additional information can now be provided by the PBS server to aid in scheduling. Other patches include support for ia-64, boxes, scaling patches from NCSA, communication patches from ANL, etc. This new enhanced PBS distribution is being made available to the public in beta form.
184.108.40.206 PBS SSS XML Interface
A very powerful XML-based resource management interface standard was designed and produced to allow alternative components to easily swap into the resource management system. This interface was developed as a wire-level requests-response protocol supporting complex extensible objects and a very powerful query syntax. This design allows software components to be interchanged for others without the need for modifying or recompiling the software. The SSSRMAP protocol addresses important protocol mechanisms such as framing, encoding, error handling, parallelism, authentication, and transport security. An XML schema was developed to validate the standard interface.
The PBS/XML Server uses the SSSRMAP XML protocol in order to allow SSS clients to communicate with the PBS. This server uses the SSSRMAP XML protocol to communicate with SSS clients. It uses the PBS Application Programmer Interface (API) to communicate with the PBS. Thus the main purpose of the PBS/XML Server is that of a translator between the SSSRMAP XML and the PBS.
The current version of the PBS/XML Server supports both HTTP and half-open socket line protocols. It supports the following operations: job, node and queue queries, job starting, job cancelation and limited job modification. Future versions will support XML security and advanced job modification.
1.5.3 Future Work
We are planning to continue to augment PBS with the scalability improvements necessary to meet the needs of the largest DOE computational facilities and continue to enhance it to work within the SSS infrastructure as a proof of concept of the interface standard.