Resource Management and Accounting Notebook - page 34 of 150

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Job Manager Spec

componentinterfaces

Job Manager Spec v1.0


Purpose:

    Translate DAG, or sequence of steps within a job, into multiple, single step jobs
    Must translate/sanity check job

Required Services:

    Multi-step jobs must be supported
    Should allow specification of required resources for both active and suspended jobs resources
    Should accept multi-step job script and translate it into multiple job steps with peer job constraints.
    Should manage unique job ID's
    Should determine/track job state and failures
    Should collect/process job performance statistics.
    Must make job performance statistics available via API and user command client.
    Should launch prolog, job task, epilog via process
    Should handle routing of interactive STDIN/STDOUT/STDERR between submission client and job executable
    Should required aspects of users environment via submission client (ie, Environment Variables, Credentials-UID,GroupList,DCE,AFS,K5,etc, default working directory?, default shell)
    Should launch prolog, job task, epilog via process manager (admin and user level prolog/epilog support)
    Must hold and manage credentials (credential refresh for queued jobs, etc.)
    Maintains available state information on all active and past jobs
    Allow dynamic job modification
    Provide pass through support for job suspend/resume, checkpoint/restart, signalling, and cancelling
    Should request process manager launch job on a specific set of nodes
    Should contact node manager to handle job requested node modification (i.e., OS rebuild)
    Should manage data migration (recovery, continuation, etc.)
    Should determine/maintain job state include prestart, start, poststart, data staged, running, completed, etc.
    Must track job checkpoint state (and all associated constraints including node allocations requirements, checkpoint file staging requirements, etc.)

Development Schedule:

Phase 1:    Target: April 1, 2002

    Base feature list:
        accept job submission
        support command line only 
 
Activity Timeframe Dependencies
create submission client
maintain queue of persistent job information
create polling 'job query' interface
feed scheduler/node monitor
scheduler/node monitor
create 'job launch' interface (receive request)
create 'job launch' interface
(send request)
assign unique job id to submitted jobs
store job state, session ID

Phase 2:    Target:  N/A

    Base feature list:
 

Credential Management:

    What credentials will be managed/passed via the job manager/process manager interface?
    UID? yes
    GID? yes
    Group list?  ???
    DCE/K5?    ???
    Delegated global credentials?     ???
    Scheduler need only be aware of credentials, it does not need to possess or manage them. (outside of potentially refreshing them)

Thoughts:

    Should the concept of a 'job group' be maintained?  A epilog/prolog need only be run once for collection of jobs within the same job group on the same set of nodes.  Job steps must explicitly request constant node allocation if they should be run on identical nodes.

    Both scheduler and job manager should provide user interfaces to initiate checkpoint requests.
    The job manager does NOT have a direct point of presence of the node and thus must route all requests requiring direct node information or control through a service possessing a direct point of presence on all nodes such as the process manager daemon.
    Low level checkpoint requests will be sent to the process manager because of the requirement for direct process contact.
    Can the process manager launch a job without a hostlist?