Process Management and Monitoring Notebook - page 41 of 74

First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Minutes of 02/07/2002

Scalable Systems Software - Process Management Working Group  
Teleconference 2/7/2002 10:00 AM (PST)   
Working Group Chair:                    Paul Hargrove  
Meeting Minutes:                        Paul Hargrove  
Paul Hargrove       LBNL  
Jason Duell         LBNL  
Al Geist            ORNL  
John Muggler        ORNL  
Brett Bode          Ames  
Mike Showerman      NCSA  
Rusty Lusk          ANL  
Scott Jackson       PNNL  
  Discuss RMWG requirements (notebook page 38) 
  Discuss latest PMan interfaces from ANL (notebook pages 35&36) 
Action Items Assigned  
Num  Activity                                               Who       Status  
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^  
A32  Post references to XML authoring and validation tools  Rusty     New 
Issues Discussed  
Num  Issue                                                            Status  
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^  
Began by going through Scott's "Node Manager Spec v0.001", even though Scott 
was not present until later. 
Identified the first two or three items as "static configuration" and thus 
part of the Node Build & Config WG.  Narayan was not present to refute this. 
Tracking and aggregation of usage is consistent with Mike's view of the 
Job and Node Monitoring components. 
Node health is likely to be a site-dependent script run as a "probe" linked 
into the monitors and providing a string as its "value". 
We were unsure about the item "Node manager should notify job manager via an 
event when no statistics are available for a requested session ID (i.e. job 
has terminated).  This seemed to contradict Scott's assertion of the previous 
week.  (Scott joined the call later and thinks this item is in error - we will 
disregard it). 
Discussion of this (later disregarded) item lead to a discussion of a simple 
event service for job terminations.  Leading to... 
The PM-Start-Processes request will take an argument to specify who to contact 
with a job completion event.  The form of this argument and the event are TBD. 
A discussion of the laundry list of attributes to monitor lead to a discussion 
of how some of these are easy to do per-node but very hard or impractical to 
do per-session.  We agreed the specification should leave room for all these 
attributes to be reported per-process and per-session, but should leave a way 
to report them as "not available", or similar. 
Before shifting discussion to the ANL schemas, Rusty mentioned he has become 
aware of a schema editor and a validating XML document editor. 
AI: Rusty to post info on these tools and/or advertise them in Houston. 
Rusty mentions the status of the XML for the Process Manager.  He says the 
current versions are there so they have something concrete to code and to 
interact w/ other components.  The XML details are still subject to change. 
Rusty began a walk-through of p36 of the notebook (PM messages sent and 
received).  The issue of the name "process-group-id" was visited again. 
New names suggested include "session-group-id" or "process-manager-object-id". 
This is a low priority issue, but it is desirable to have a name which avoids 
Scott resurrected the issue of who provides the identifier.  The work thus far 
has had the RMWG and PMWG each assuming they specified the identifier at the 
time a "process group" was started.  Rusty says that if the start request 
includes an identifier then the Process Manager can store it.  Subsequent 
requests which desire to query or manipulate "process group" may use either 
of the identifiers, but with different XML attributes to avoid confusion. 
The Process Manager will accept an optional identifier in a start request. 
Subsequent requests may use either this identifier or the one returned in 
the response to the start request, with distinct XML attributes. 
Scott says the RM folk expected the (node,sid) pairs in the response to 
the start request.  Rusty advocates a separate call get-process-info to 
gather this info.  Rusty envisions his implementation having a non-blocking 
start and a blocking get-process-info.  Separating the calls means we don't 
care which are blocking or non-blocking. 
The question came up as to how an executable (binary or job script) arrives 
on the nodes.  The way PBS does it was discussed.  We want to get away from 
that idea.  Instead we envision "staging" of the job script and any executables 
as another "job step" which might be done as a pm-start-processes request. 
Based on earlier discussions we have added (verbally) a job-termination-event 
as a sent message in the list on notebook page 36. 
Schedule next call  
The next call is scheduled for 2/14/2 at 10:00 Pacific Time.  
 To Attend:   
  Long Distance users call 1-877-252-5250,      
  Local users call 510-647-3480,     
press 1, enter 160910# and follow the instructions.