|Date and Author(s)|
Proposed Change XML Interface to the Process Manager 2/1/03 This is an informal specification of part of the XML interface to the Process Manager Component. It is a substantial revision of and addition to what has been used up to now. The main addition is to the query message (what jobs are running on what hosts, etc.), and this has been made consistent with the "restriction syntax" for information queries in general originally developed by Narayan and Andrew, and tweaked by Narayan and Rusty. The format of a query request is here defined for the first time. The process group submission request currently in use has been changed to be consistent with this format. It is proposed that the signal-process-group and kill-process-group messages be generalized to take advantage of the restriction format, so that one can signal and kill jobs based on their properties, without first retrieving their process-group id's. Then the return messages contain descriptions of the process groups and processes signalled or killed. The component of a parallel job as seen by the Process Manager is called a process-group. It is submitted by a single user, identified by a single process-group id, or pgid, and consists of some (original) number of processes. How stdio is handled is specified for the group as a whole. The processes themselves each consist of an executable filename, path in which to search for it, arguments and environment variables, a user name, and working directory. All of these may vary from process to process within the process group. To make the specification scalable, we use the "range" attribute rather than duplicating entries. Ranges are in the "squash" format, e.g. " range='1-10,15,17,20-25' ". Ranges start at 1. The specific names of entities and attributes are only suggestions. Surely improvements can be made in the interests of clarity and consistency. To request that an attribute value be returned in a query, we use '*' as a value for the attribute in the query or implied query. To create a process group, send the Process Manager a command like the one below. We separate specifications for the processes from the specifications for the hosts because the specifications for the processes are likely to be the same for all processes or, as here, consist of only a small number of different specifications, while the host specifications are likely to be complete (one for each process) as selected by a scheduler, or else missing altogether, to allow the process manager to choose hosts. For the host specification, we use a text node to specify an ordered list of host names.
<create-process-group pgid='job23' submitter='lusk' totalprocs='10' output='discard' > <process-spec range='1' exec='cpi_master' user='ell' cwd='/home/ell/rundir' path='/home/ell/progs' coprocess='tvdebuggersrv' > <arg idx='1' val='-loops' /> <arg idx='2' val='1000' /> <env name='TV_LICENSE' val='23416784' /> </process-spec> <process-spec range='2-10' exec='cpi_slave' user='ell' cwd='/home/ell/rundir' path='/home/ell/progs' coprocess='tvdebuggersrv' > <env name='TV_LICENSE' val='23416784' /> </process-spec> <host-spec> ccn-64 ccn-65 ccn-66 ccn-67 ccn-68 ccn-69 ccn-70 ccn-71 ccn-73 </host-spec> <create-process-group\>For multiple processes on one host, one can repeat the host name. An alternative is to use the "squash" format in host names, e.g. <hostspec name='ccn-%d:64-73' />To find out about running jobs, send the process manager a message in the following format. The idea is to permit logically complex queries: "What are the process groups with processes running on either ccn-20 or ccn-21 with executable b.out?" To avoid clutter, the process restriction entities are assumed to be and'ed together, and the process-group entities are assumed to be or'ed together. Since any logical formula can be put in disjunctive normal form (disjunction of conjunctions) this is sufficient, if not always most convenient. The goal is to avoid clutter for simple queries. A value of '*' for an attribute means that one wants the value returned (see format of returned process groups below). If an attribute is missing then the value is ignored (neither used as a restriction nor returned). The following example retrieves the pgid's of processes that were submitted by lusk or desai, and in lusk's case, only returns the process groups that have processes running on two specific hosts. The restrictions are on the process groups; we always return all the processes in a process group. <get-process-groups> <process-group submitter='lusk' pgid='*' totalprocs='*' > <process-group-restriction pid='*' exec='*' host='ccn-70' \> <process-group-restriction pid='*' exec='*' host='ccn-230' \> </process-group> <process-group submitter='desai' pgid='*' > </process-group> </get-process-groups>The message returned by such a query is a set of process groups, with details on their processes filled in as requested by the query. <process-groups> <process-group submitter='lusk' pgid='4521' totalprocs='10'> <process pid='3456' exec='cpi_master' host='ccn-64' /> <process pid='1324' exec='cpi_slave' host='ccn-65' /> <process pid='7654' exec='cpi_slave' host='ccn-66' /> <process pid='6758' exec='cpi_slave' host='ccn-67' /> <process pid='9601' exec='cpi_slave' host='ccn-68' /> <process pid='5634' exec='cpi_slave' host='ccn-69' /> <process pid='7865' exec='cpi_slave' host='ccn-70' /> <process pid='9876' exec='cpi_slave' host='ccn-71' /> <process pid='6524' exec='cpi_slave' host='ccn-72' /> <process pid='3452' exec='cpi_slave' host='ccn-73' /> </process-group> <process-group submitter='lusk' pgid='23' totalprocs='1'> <process pid='5554' exec='mpd' host='230' /> </process-group> <process-group submitter='desai' pgid='244' > </process-group> </process-groups>The signal-process-group (deliver a specified signal to a process group) and kill-process-group (completely clean up a process group) are extended to allow one to describe the process groups being signalled or killed to be specified with the same syntax as get-process-group, and they return the same format ( , defined above, to indicate which processes they acted on. Signals can be specified by either name or number. The following command sends a signal 3 to all the processes of all jobs submitted by lusk, and returns the details of which processes groups they were. <signal-process-group signal='3'> <process-group submitter='lusk' pgid='*' </signal-process-group> <kill-process-group> <process-group submitter='*'> <process-group-restriction host='ccn-56' > </process-group> </kill-process-group>The above command kills all process groups with processes running on ccn-56, and returns their submitters, so that they can be told the sad news.