Process Management and Monitoring Notebook - page 65 of 74

First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

New XML draft for Process Manager

Proposed Change XML Interface to the Process Manager  
This is an informal specification of part of the XML interface to the  
Process Manager Component.  It is a substantial revision of and addition  
to what has been used up to now.  The main addition is to the query  
message (what jobs are running on what hosts, etc.), and this has been  
made consistent with the "restriction syntax" for information queries in  
general originally developed by Narayan and Andrew, and tweaked by  
Narayan and Rusty.  The format of a query request is here defined for  
the first time.  The process group submission request currently in  
use has been changed to be consistent with this format.  It is proposed  
that the signal-process-group and kill-process-group messages be  
generalized to take advantage of the restriction format, so that one can  
signal and kill jobs based on their properties, without first retrieving  
their process-group id's.  Then the return messages contain descriptions  
of the process groups and processes signalled or killed.  
The component of a parallel job as seen by the Process Manager is called  
a process-group.  It is submitted by a single user, identified by a  
single process-group id, or pgid, and consists of some (original) number  
of processes.  How stdio is handled is specified for the group as a  
whole.  The processes themselves each consist of an executable filename,  
path in which to search for it, arguments and environment variables, a  
user name, and working directory.  All of these may vary from process to  
process within the process group.    
To make the specification scalable, we use the "range" attribute rather  
than duplicating entries.  Ranges are in the "squash" format, e.g.  
" range='1-10,15,17,20-25' ".  Ranges start at 1.  
The specific names of entities and attributes are only suggestions.  
Surely improvements can be made in the interests of clarity and consistency.  
To request that an attribute value be returned in a query, we use '*' as  
a value for the attribute in the query or implied query.  
To create a process group, send the Process Manager a command like the  
one below.  We separate specifications for the processes from the  
specifications for the hosts because the specifications for the  
processes are likely to be the same for all processes or, as here,  
consist of only a small number of different specifications, while the  
host specifications are likely to be complete (one for each process) as  
selected by a scheduler, or else missing altogether, to allow the  
process manager to choose hosts.  For the host specification, we use   
a text node to specify an ordered list of host names.  
    <arg idx='1' val='-loops' />  
    <arg idx='2' val='1000'   />  
    <env name='TV_LICENSE' val='23416784' />  
    <env name='TV_LICENSE' val='23416784' />  
For multiple processes on one host, one can repeat the host name.  
An alternative is to use the "squash" format in host names, e.g.  
<hostspec name='ccn-%d:64-73' />  
To find out about running jobs, send the process manager a message in  
the following format.  The idea is to permit logically complex queries:  
"What are the process groups with processes running on either ccn-20 or  
ccn-21 with executable b.out?"  To avoid clutter, the process  
restriction entities are assumed to be and'ed together, and the  
process-group entities are assumed to be or'ed together.  Since any  
logical formula can be put in disjunctive normal form (disjunction of  
conjunctions) this is sufficient, if not always most convenient.  
The goal is to avoid clutter for simple queries.  A value of '*' for an  
attribute means that one wants the value returned (see format of  
returned process groups below).  If an attribute is missing then the  
value is ignored (neither used as a restriction nor returned).  
The following example retrieves the pgid's of processes that were  
submitted by lusk or desai, and in lusk's case, only returns the process  
groups that have processes running on two specific hosts.  The  
restrictions are on the process groups; we always return all the  
processes in a process group.  
  <process-group submitter='lusk' pgid='*' totalprocs='*' >  
    <process-group-restriction pid='*' exec='*' host='ccn-70' \>  
    <process-group-restriction pid='*' exec='*' host='ccn-230' \>  
  <process-group submitter='desai' pgid='*' >  
The message returned by such a query is a set of process groups, with  
details on their processes filled in as requested by the query.   
  <process-group submitter='lusk' pgid='4521' totalprocs='10'>  
    <process pid='3456' exec='cpi_master' host='ccn-64' />  
    <process pid='1324' exec='cpi_slave' host='ccn-65' />  
    <process pid='7654' exec='cpi_slave' host='ccn-66' />  
    <process pid='6758' exec='cpi_slave' host='ccn-67' />  
    <process pid='9601' exec='cpi_slave' host='ccn-68' />  
    <process pid='5634' exec='cpi_slave' host='ccn-69' />  
    <process pid='7865' exec='cpi_slave' host='ccn-70' />  
    <process pid='9876' exec='cpi_slave' host='ccn-71' />  
    <process pid='6524' exec='cpi_slave' host='ccn-72' />  
    <process pid='3452' exec='cpi_slave' host='ccn-73' />  
  <process-group submitter='lusk' pgid='23' totalprocs='1'>  
    <process pid='5554' exec='mpd' host='230' />  
  <process-group submitter='desai' pgid='244' >  
The signal-process-group (deliver a specified signal to a process group) and   
kill-process-group (completely clean up a process group) are extended to   
allow one to describe the process groups being signalled or killed to be   
specified with the same syntax as get-process-group, and they return the  
same format (, defined above, to indicate which  
processes they acted on.  
Signals can be specified by either name or number.  
The following command sends a signal 3 to all the processes of all jobs   
submitted by lusk, and returns the details of which processes groups  
they were.  
<signal-process-group  signal='3'>  
  <process-group submitter='lusk' pgid='*'  
  <process-group submitter='*'>  
    <process-group-restriction host='ccn-56' >  
The above command kills all process groups with processes running on  
ccn-56, and returns their submitters, so that they can be told the sad news.