Process Management and Monitoring Notebook - page 27 of 74
Date and Author(s)
Interfaces for Checkpoint/Restart
Interfaces for Checkpoint/Restart and related mechanisms
Interfaces for Checkpoint/Restart and related mechanisms.
I would like to see calls which specify the INTENT in a way that "send SIGSTOP"
or "send SIGCHECKPOINT" cannot. Specifically I advocate the following
This call is the first half of the suspend/restore pair used for preemption.
The intent is to temporarily place a running job into a state
where is ceases to consume (most) resources. The simplest implementation
sends SIGSTOP to the processes. A more ambitions implementation might
transfer the memory image to a disk file to free the physical memory and
swap space for the incoming job, or perform a full checkpoint to a local
disk. Because the intent is for a quick transition to a temporary suspended
state, there is no guarantee that this job would be restartable after a node
failure. Consequently no location is specified for persistent storage
of image files.
This call is intended is to take a job previously idled by a SUSPEND call,
and restore it to a fully running state.
CHECKPOINT(jobid, storage, ?other options?)
This call is intended to place a running job into a state where it consumes
no resources except storage; from which state it can be returned to a running
state at some later time. The types of system changes which can take
place (reboots, migration to different nodes, O/S upgrades, library upgrades,
etc.) and still allow a successful restart are implementation specific. The
stored file(s) will be placed in a specified location, presumably where it
will survive any subsequent system changes.
RESTART(storage, nodeList, ?other options?)
This call takes a job stored by a CHECKPOINT call and restores it to a running
state. Conceptually this is a call to start a job where most of the
"arguments" are embedded in the checkpoint file(s). However, the list
of hosts to run on should probably be an explicit argument. I expect
a new jobid to be returned along with status information.
MIGRATE(jobid, nodeList, ?other options? )
Conceptually this call is a CHECKPOINT followed by a RESTART on a different
set of nodes. The value of having this is a separate call is that an
implementation could do something more intelligent. For instance the
checkpoint files might pass (in parallel) from the source nodes to the destination
nodes without passing through a common filesystem bottleneck. If only
a subset of the processes are actually changing nodes, there might be optimizations
with respect to the stationary processes .
There are some things not explicit in the above descriptions which we need
to consider. Please add anymore you think of, or comment on the existing
Validating a RESTART or MIGRATE call PROBLEM: We desire a mechanism by which the Scheduler can determine
if a RESTART or MIGRATE will succeed with a given nodeList - a mechanism
which is lighter weight than trying and failing. In terms of implementation,
this could be done primarily by matching information pulled from the checkpoint
file(s) against the node configuration database. SOLUTION 1 (depricated): I suggest "dryRun" option to both the RESTART and MIGRATE
calls (like -n in make) which would return status information telling if the
operation is certain to fail. This result is not intended to be a guarantee
that the call would succeed, but cases which are certain to fail could be
eliminated inexpensively. SOLUTION 2: An solution which was more popular on the 2001.12.12 conf
was to have a mechanism for determining the dependencies/requirements of a
checkpoint, so the scheduler could select nodes what met these requirements.
Here is a suggestion:
The CHECKPOINT call returns the requirements for RESTARTing the job.
A new QUERYDEPS(storage) would extract the dependencies from a
checkpoint, for when the info returned by the CHECKPOINT call has been lost.
In the MIGRATE case we need to avoid the case where the requirements
of a running job get more strict between the time we Query and the time we
make the MIGRATE call. Thus we make MIGRATE a two-step process, requiring two
calls to complete. This makes MIGRATE more like the other calls, which
come in pairs.
MIGRATE_PART1(jobid): conceptually this is like a CHECKPOINT
without specifying a storage location. The job stops and the call returns the requirements that must
be met by destination nodes.
MIGRATE_PART2(jobid, nodeList, ?other options?): this
performs the actual migration of the stopped job.
Utilizing application-level support PROBLEM: When the application implements its own checkpoint
files, these are typically much more efficient than any mechanism supplied
by the runtime environment. Therefore we wish to use the application's
own mechanisms when available. SOLUTION: We need a way to communicate the application's capabilities
to the Process Manager (PM). I suggest that these capabilities be an
optional argument to the CreateProcess call. This information would
originate from the user's batch request. The argument(s) would also
need to express the mechanism, such as "send SIGUSR1".
Variable level of support present in runtime environments PROBLEM: The Scheduler needs to know which of these calls a given
PM supports, on a per-node basis. It seems necessary that CHECKPOINT,
RESTART and MIGRATE should be made optional features to cover many existing
systems. We need a more polite way to let the Scheduler know this than
simply by failing the first time it requests one of these actions. SOLUTION: We will need a call that lets the Scheduler query the
features supported by the PM, on a per-node basis.
Variable level of resource reclaimation in SUSPEND implementations PROBLEM: Almost any system can support the most basic version
of SUSPEND/RESUME using SIGSTOP and SIGCONT. However, these will not
free the resources consumed by the job, such as virtual memory space and
file descriptors. Because the job which is preempting the present job
may need these resources (especially the memory), it seems that the Scheduler
would need to know what resources will and will not be released when a job
is suspended. This could vary between nodes. SOLUTION: Resource reclaimation support could be advertised along
with the list of supported features, described above.