Process Management and Monitoring Notebook - page 40 of 74

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

System monitoring component

Sorry about the lateness of this info

I have a copy of the info on http://pitts.ncsa.uiuc.edu/scidac

This should be a beginning for discussion on communication methods.

NCSA SCIDAC Software Development Project

Process Monitor Architecture

Initial implementation:

·        Goals              

o       Development of Compute Node Monitor Daemons

o       Add Single Middle Layer for data collection

o       Provide XML interface to Middle layer

·        Functionality

o       Node Monitor Daemons

§         collect statically defined datasets.

§         Linux support only

§         Uses SGI’s PCP

§         Collects data required to replace PBS mom functionality

§         Maintain process history statistics

·        Can be queried for old process information

o       Middle Layer

§         Acts as a data cache and collection point for data.

§         Statically defined datasets

o       Provide component to aggregate data

§         Requires Job and process information

§         Produces aggregate statistics related to job

·        Total memory usage

·        Total cpu time

o       Overall

§         Simple ascii interfaces between Node monitor daemons and Middle layers

§         Single Middle layer

 

Long term Goals:

·        Lightweight communication protocols

o       Differential data plus heartbeat

·        Data statistic collection only when monitored

·        Hierarchy in middle layer

·        Middle layer data caches are dataset inspecific

o       Support new devices and data without middle layer changes

·        Add support for monitoring different devices

o       Node monitor daemons

o       Network devices

o       Storage servers

·        Administrative applications

o       Scalable viewers

o       Archive databases for trend info

Example Communications

Requests

Event:

Metric                                       rate   Threshold  Extended Data

Host10:Process.id.finish              0    0                1530 

Streaming

Metric                             rate   Threshold Extended Data

Host01:Cpu.idle              15          2             none

            Meaning: send me the cpu idle time every 15 seconds unless the value has changed by less than 2

Polling

Metric                             rate   Threshold Extended Data

Host*:Cpu.idle                -1          -1             none

            Return machines named host… cpu idle 1 time

Query

Host5:*       -1 -1 list

            Something like this could return available metrics

 

System Structure

 

Node Monitor Daemon


(Node Monitor daemon structure)

Communication Structure


(Communication Structure)
Job monitor gets running job information from the job manager and process ID/ process group info from the process manager. It then aggregates the data for the individual processes into job based statistics.