Process Management and Monitoring Notebook - page 20 of 74

First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Minutes of 2001.10.11 conf call

Scalable Systems Software - Process Management Working Group
Teleconference 10/11/01 10:00 AM (PDT) 

Working Group Chair:                    Paul Hargrove
Meeting Minutes:                        Eric Roman

Jason Duell         LBNL
Paul Hargrove       LBNL
Eric Roman          LBNL
Mike Welcome        LBNL
Scott Jackson       PNNL
Brett Bode          Ames
Erik DeBenedictus   SNL

Issues Discussed
Num  Issue                                                            Status
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^
I4   What happens when two jobs join together through MPI_Connect?    Open
I6   How do we specify that one job runs after another one?           Open
I7   How do we specify that one job runs at the same time as another? Open
I8   What happens if two create-process requests are issued with the  Open
     same set of nodes?  Queue?  Kill?  Error?  Preempt?  Undefined?
I9   How do we stop an epilogue script from blowing away a temporary  Open

Job dependencies
Q [Paul]:  How do dependencies between jobs work?  How do you say "this job
must run after that one?  This job must run at the same time as that one?"
(I6, I7)

D:  This scenario may occur during coscheduling.

Q [Jason]:  How do you run one job after the other one, and say that the
filesystem must be preserved between jobs.  (i.e.  Don't delete /scratch)?

Q [Paul]: Paul restated the question as "What type of support does the
process manager need to support or help dependencies between jobs?"

D:  Erik suggests specifying the input and output files of the job inside
the request through an object similar to a file descriptor.  He mentioned
possible utilization problems when running shell scripts on a system.

D:  Scott asked whether the scheduler and resource manager would need to
look inside the job script to determine where the jobs standard output files
need to go.  He suggested using a "preserves" list inside the process manager
request to ask it not to blow away certain files.  (I9)

D:  Tentative solution.  The process manager would accept a list of files
to leave on the node.  "Please leave these files".  The process manager 
would also accept a field that said: "Please make sure that these files
are on the node when I start."

D:  Another tentative solution.  Another alternative is to request a "stable
storage" resource in some temporary file space.  This storage would be 
considered persistent.

Q:  How do we let 2 "connecting" MPI jobs know how to talk to each other?  (I4)

D:  An intercommunicator object could store that information.  A type of
rendezvous object would be instantiated.  The rendezvous object would
hold the information that the job needed to know to speak to it's other half.

Q [Erik]:  How does the resource management system deal with deadlocks?
What if there is a cyclic dependency between two jobs?

A:  This is probably a programming error.  It shouldn't be possible.

Q [Eric]:  What happens if the process manager is sent two requests, where
both requests ask to run an application on the same node(s)?  Does the PM queue
the second request, return an error, or run it?  How do I run concurrent
jobs on the same machine?  (I8)

D:  This needs to be defined.

Status of action items
The discussion was tabled until the 10/18/01 conference call.

Schedule next call
The next call is scheduled for 10/18/01 at 10:00 PDT.

 To Attend: 
  Long Distance users call 1-877-252-5250,    
  Local users call 510-647-3480,   

press 1, enter 160910# and follow the instructions.