|Date and Author(s)|
Translate DAG, or sequence of steps within a job,
into multiple, single step jobs
Must translate/sanity check job
Multi-step jobs must be supported
Should allow specification of required resources for both active and suspended jobs resources
Should accept multi-step job script and translate it into multiple job steps with peer job constraints.
Should manage unique job ID's
Should determine/track job state and failures
Should collect/process job performance statistics.
Must make job performance statistics available via API and user command client.
Should launch prolog, job task, epilog via process
Should handle routing of interactive STDIN/STDOUT/STDERR between submission client and job executable
Should required aspects of users environment via submission client (ie, Environment Variables, Credentials-UID,GroupList,DCE,AFS,K5,etc, default working directory?, default shell)
Should launch prolog, job task, epilog via process manager (admin and user level prolog/epilog support)
Must hold and manage credentials (credential refresh for queued jobs, etc.)
Maintains available state information on all active and past jobs
Allow dynamic job modification
Provide pass through support for job suspend/resume, checkpoint/restart, signalling, and cancelling
Should request process manager launch job on a specific set of nodes
Should contact node manager to handle job requested node modification (i.e., OS rebuild)
Should manage data migration (recovery, continuation, etc.)
Should determine/maintain job state include prestart, start, poststart, data staged, running, completed, etc.
Must track job checkpoint state (and all associated constraints including node allocations requirements, checkpoint file staging requirements, etc.)
Phase 1: Target: April 1, 2002
Base feature list:
accept job submission
support command line only
|create submission client|
|maintain queue of persistent job information|
|create polling 'job query' interface
feed scheduler/node monitor
|create 'job launch' interface (receive request)|
|create 'job launch' interface
|assign unique job id to submitted jobs|
|store job state, session ID|
Phase 2: Target: N/A
Base feature list:
What credentials will be managed/passed via the job
manager/process manager interface?
Group list? ???
Delegated global credentials? ???
Scheduler need only be aware of credentials, it does not need to possess or manage them. (outside of potentially refreshing them)
Should the concept of a 'job group' be maintained? A epilog/prolog need only be run once for collection of jobs within the same job group on the same set of nodes. Job steps must explicitly request constant node allocation if they should be run on identical nodes.
Both scheduler and job manager should provide user
interfaces to initiate checkpoint requests.
The job manager does NOT have a direct point of presence of the node and thus must route all requests requiring direct node information or control through a service possessing a direct point of presence on all nodes such as the process manager daemon.
Low level checkpoint requests will be sent to the process manager because of the requirement for direct process contact.
Can the process manager launch a job without a hostlist?