|Date and Author(s)|
This component has two purposes…
Creation of new data and aggregation of existing data, and this is my proposal of what that data is…
Collected - Performance metrics collected at compute nodes and produced for appliances (possibly myrinet switch for example). This includes process data (cputime, memory usage) and system data. (Things like memory, processor, disk info, and temperature sensor data) Also, there will be some indicator of machine availability state. (Am I unable to connect the machine and collect information)
Aggregated – Given job and process information, use above collected data to produce memory usage for all processes in a job, or cpu time used by the job.
Machines / devices to be monitored…
which hosts the jobs are running on
which processes on a host are associated with that job
Possible conflicts. Job state, node state (reachable/unreachable), possibly configuration data… I will have a configured value for number of cpus and total memory, and it may be redundant with the information service.