Process Management and Monitoring Notebook - page 64 of 74
Review Doc - Draft System Monitor Text
System Monitoring Component
This component is part of the scalable system software stack to provide the real-time state data of various components within large scale computational resource. It will focus not only on scalability, but also at extensibility into new environments. It will provide a framework that is a unified source of data collection that is often redundantly collected by multiple subsystems within an existing compute resource.
Scalability is central to the design of this component. The number of devices in systems has been dramatically increasing for large installations over the last few years. Concurrently, the availability of high quality data at the component level has expanded significantly as well. Network switches, interconnect switches, power controllers, host adapters, storage systems and many other devices now incorporate not only performance information, but temperature, voltage, fan speed information and other data that is useful in predictive failure analysis.
In traditional systems, there are multiple separate and often overlapping infrastructures to gather and interpret this information without a common interface to obtain the data. Resource managers, performance monitors, and administration/health monitors often use independent mechanisms for collecting their own information without any aggregation or sharing. This project will address a common interface specification for the collection and distribution of device data in an extensible and scalable manner.
This component has three phases of the development cycle. The first stage has been completed and in review, while the second is underway. Stage one involves designing a component prototype that collects the necessary data for the components it must interact with and to define an extensible XML interface. In addition, this interface was designed and tested to provide a framework that accommodates the expansion of new types data and devices to be monitored. Existing software used for the collection and visualization of system performance data was adapted to use this new communication mechanism, and new applications are being tested to demonstrate the flexibility of the component interface and provide the data graphically. In addition to the collection of system performance data, an application has been developed to view the registration and communications between the software systems contained within the overall project. This application is helpful not only during the debugging phase of the component interactions, but also as a visual aid in demonstrating the communication paths within the entire scalable systems software stack.
Current development is underway in phase 2 for the design and implementation of the software foundation for a scalable extensible monitoring hierarchy. This first involves daemons to monitor hardware/software systems. The design of this software incorporates the use of an abstraction layer that partitions the software between platform/system specific sections, and infrastructure object software. This is partially accomplished by developing portable software objects for the internal data storage and communication mechanisms, while defining a functional interface for the collection and querying of the data. The implementation of this functional interface is provided in dynamically linked shared libraries. This allows a single daemon to collect and export different or new information based on the implementation of libraries linked at run time. This model requires the use of internal data stores that are flexible in the in the manner they retain their data.
This design also requires additional isolation of the data content from the processing at the middle layers of the collection hierarchy. In this phase, research into the scalability/ data quality/ extensibility tradeoffs will be required to tune the software layer used to aggregate the data and export it via the XML interface. Finalization of the light weight semi-intelligent protocol used between devices producing performance/state data and the aggregation components will be completed during this phase of development.
In the third phase, new visualization software, performance archiving and meta-monitoring components will be developed to demonstrate and test the functionality of this system. For an example of a meta-monitor, a software component can be used to gather job information from the queue manager, process information from the process manager, and correlate the real-time performance data from the system monitoring system to provide job-based monitoring. This could use standardized metrics like actual memory or CPU usage at the job level, or provide new types of job based data as a result of site specific implementations of the data collection shared libraries. In the case of a cluster, this may be current bytes/sec of the Myrinet network for a specific job.
Also in the third phase, new functionality to improve administration and scheduling capabilities would be a query engine to the performance database. This could be used to provide an XML response to devices that meet a certain criteria. A scheduler may query online hosts with available CPU, memory, disk, rather than gathering all of the necessary scheduling parameters suitable and unsuitable hosts to arrive at the same conclusion. We will also be looking at visualization techniques for choosing and displaying meaningful data for very large scale systems. This involves both dense data representation, and the creation of metrics that provide sufficient contrast to illuminate the important differences between devices and jobs. An early example of current functional job/system monitoring is represented by NCSA’s Clumon  project.
The primary areas of focus to this infrastructure design center around defining an inter-component communication protocol that minimizes the latencies of a hierarchical structure, and balancing the performance tradeoffs between extensibility and data quality at new scales. Secondary interests involve the creation of a reasonably portable reference implementation at all layers of the hierarchy, and defining a simple functional interface to site-specific shared library examples.
1. Myricom’s myrinet interconnect: http://www.myri.com
2. Clumon: Cluster monitor used on NCSA’s production clusters