Process Management and Monitoring Notebook - page 34 of 74

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Minutes from 1/3/2 meeting

 
Scalable Systems Software - Process Management Working Group 
Teleconference 1/3/2 10:00 AM (PDT) 
------------------------------------------------------------------------------- 
 
Working Group Chair:                    Paul Hargrove 
Meeting Minutes:                        Eric Roman 
 
Attendees 
--------- 
Eric Roman          LBNL 
Al Geist            ORNL 
Brett Bode          Ames 
Rusty Lusk          ANL 
Narayan Desai       ANL 
Scott Jackson       PNNL 
 
Agenda 
------ 
  Discuss direction for next few calls 
  Checkpoint/restart status 
  Overview of RMWG status 
 
Action Items Assigned 
--------------------- 
Num  Activity                                               Who       Status 
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ 
 
Issues Discussed 
---------------- 
Num  Issue                                                            Status 
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ 
I18  Can the PMWG monitors be used to replace pbs_mom?                Open 
I19  What does PBS mom provide?  What does Maui scheduler need?       Open 
I20  1 page requirements overview for components is needed.           Open 
 
Discussion 
------------------------------------------------------------------------------- 
 
RM WRAP UP: 
Brett Bode gave a summary of the previous resource management working group 
meeting. 
* Time allocation and goals for the rest of the year 
* Development of a working job server 
* Dave & scott are working on getting Maui and QBank going 
* Looks like the RMWG will be able to have something working for Fred by 
  the next scalable systems meeting 
 
PROCESS MANAGER STATUS: 
Rusty stated that his group may have something working by the next call. 
They are close to having a complete list of process manager requests ready. 
The list includes: 
* suspend and resume definitions 
* status request definitions 
* job start definitions 
They plan to implement the process manager concurrent with specification. 
Right now they are trying to receive XML requests and process them. 
 
PROCESS MONITOR REQUIREMENTS: 
Brett reported that his group was starting with the PBS server, and writing 
modules to send XML.  He asked what the schedule was for a node monitoring 
API.  He also asked whether the PMWG had anything that could be used in place 
of PBS MOM. 
 
Q [Eric]:  Eric asked whether the RMWG needed more info than mom provides. 
A [Brett]:  Brett stated that Maui requires information that the moms cannot 
provide.  Dave may have documentation on this. 
 
NODE FAILURE: 
Q [Eric]:  Eric asked whether the process manager should be queried to find 
out when nodes fail. 
A [Rusty]:  Rusty stated that the process monitor should hold this information. 
 
Q [Eric]:  Can we query the process manager for anything at all? 
A [Rusty]:  Rusty stated that the process manager should be responsible for 
starting and stopping jobs only.  All status information needed about jobs 
should be queried from a separate process monitor. 
 
Q [Eric]:  Should we query the process manager or the process monitor when 
we want to find out whether a process manager job has finished?  (i.e. wait()) 
A [Rusty]:  Rusty stated that the process manager should track the state 
of its own jobs. 
D [Brett]:  Brett believes that there should be a separate node monitor 
component that can be queried for information on disks, filesystems, free 
memory, etc. 
D [Rusty]:  Rusty believes that the node monitor should export information 
specific to a node.  The job monitor should export information about a job. 
There should be 3 components: a node monitor, a job monitor, and a process 
manager.  A process manager daemon will exist on each node. 
 
The group agreed that it would be best to design the interfaces separately, 
even if they were merged together during implementation. 
 
CHECKPOINT RESTART STATUS: 
Eric stated that the checkpoint/restart project should have its requirements 
ready in a few weeks for discussion. 
 
REQUIREMENTS: 
Scott asked whether the group should start looking at requirements documents. 
He would like to use requirements as a means to scope out the work for the 
groups, and to show to users on the outside.  He is hoping for a one page 
description of the components. 
 
The issue was tabled for discussion at the next week. 
 
QUARTERLY REPORT & NEXT MEETING: 
Al reported that a draft of the next quarterly report would be on the main 
notebook soon. 
 
The next meeting is scheduled at the airport Marriott in Houston for 2/21 
and 2/22.  He'd like to settle down the Scalable Systems meetings in Dallas. 
 
Schedule next call 
------------------ 
The next call is scheduled for 1/10/2 at 10:00 PDT. 
 
 To Attend: 
  Long Distance users call 1-877-252-5250, 
  Local users call 510-647-3480, 
 
press 1, enter 160910# and follow the instructions.