LoadLeveler on the ORNL SP


Contents


Introduction

LoadLeveler is the batch-job scheduler for the ORNL IBM RS/6000 SP. It also allocates nodes for interactive parallel jobs on the SP. This document provides information for getting started with the batch facilities of LoadLeveler.


Classes

In the LoadLeveler parlance, the term "class" is analagous to the term "queue" for other batch systems. Different users may have access to different classes, and different classes may have different job limits or may target different nodes.

Use the "llclass" command to see the current list of classes.

Each "Slots" number represents the number of "job instances" that may be started in the given class. For MPI jobs, this is the number of MPI processes that may run under the given class. It is typically equivalent to the number of processors that allow the class. "Max Slots" represents the total number of slots configured on the system, and "Free Slots" represents the number of slots that are not currently occupied. This number is misleading, however. A four-processor node may have four slots each for five classes, for example. Because nodes are typically dedicated to a single job, only four of the node's 20 slots can be allocated at a time. The rest appear to be "free", although they are not useable.

"MaxJobCPU" and "MaxProcCPU" indicate the per-job and per-process aggregate CPU time limits. None of the classes listed here have CPU time limits; this is not particularly useful information because the classes do have wall-clock time limits.

You can get more information on a class, such as it's wall-clock time limit, using "llclass -l".

The most useful information here is the "Wall_clock_limit", which is set to two hours. This is a hard upper limit for any job submitted to the "ptest" class. The "-1" indicates there is no soft limit. You may wish to "grep" for useful nuggets like this from the full listing, like in the following example.


System status

Through the "Free Slots" entries, the "llclass" command can give some information about the status of the system and what your chances are for running jobs immediately. As mentioned above, however, this information is misleading. For more accurate information about the load on the system, use the "llstatus" command.

The "Schedd" column indicates whether the machine is able to schedule LoadLeveler jobs; "Avail" means it can. "InQ" gives the number of current jobs submitted from (not running on) the given machine, and "Act" gives the number of those jobs that are actually running (on other nodes). "Startd" indicates whether any jobs are running on the given machine, and "Run" indicates the number of job instances that are running.

For ORNL SP nodes, the "Run" number will be equal to or less than the number of processors in that node. The same job can have more than one instance running on a given node; for example, a four-processor node may have four MPI processes from the same job. SP nodes are typically dedicated, however, so a given node will only run instances of a single job at a time, though it may run multiple instances of that job.

"LdAvg" is the Berkely one-minute load average, and "Idle" is the time in seconds since the last keyboard or mouse activity on the machine. For SP nodes, "Idle" is often "9999".

The lines at the bottom of the output indicate that 65 machines are currently under the control of LoadLeveler, and two are disconnected. On the 65 machines, 4 jobs are running, and those 4 jobs consume 112 slots. Because one slot can represent a single-thread or multiple-thread process, slots are neither equivalent to processors nor nodes.

Some of the columns of default "llstatus" output are not particularly useful, and "llstatus" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llstatus" on the command line. Here is an example configuration.

This example prunes out some of the default information and adds "MaxT" and "FreeVMemory". "MaxT" gives the maximum number of job instances (regardless of class) that may run on the given host at a time, and "FreeVMemory" gives the available swap space, in kilobytes. See "man llstatus" for more information on configuring output. You may want to create an "alias" for the "llstatus" configuration you prefer.


Job command files

To run a batch job under LoadLeveler, you first need to write a job command file. Here is an example file for a parallel MPI job.

The file has two components: LoadLeveler keyword statements and shell commands. The LoadLeveler keyword statements are preceded by "#@", making them appear as comments to a shell. The shell commands follow the "#@ queue" keyword statement and represent the executable content of the batch job.

Here is a description of each line.

#@ job_type = parallel

Use multiple nodes for parallel commands. This keyword is required for parallel jobs. The keywords "tasks_per_node", "node", "network", etc. won't work without it.

#@ output = $(host).$(jobid).out

Send standard output to the file "$(host).$(jobid).out". "$(host)" is a LoadLeveler variable that represents the host where the job was submitted. It is not necessarily related to where the job runs. "$(jobid)" is a number ID of the running job. Each "$(jobid)" is unique for a given job submitted from a particular host. Each "$(jobid)" is not necessarily unique across LoadLeveler; two jobs submitted from two different hosts can have the same value for "$(jobid)". The combination of "$(host).$(jobid)" is unique, however. Example: "131.out" and "131.out" versus "bearcat.131.out" and "bobcat.131.out".

Unless you specify a full path, the output file is stored in the directory from which you submitted the job. If you don't specify the "output" keyword, the standard output is not saved.

#@ error = $(host).$(jobid).err

Send standard error output to the file "$(host).$(jobid).err". See the information above for the "output" keyword. You can send standard output and standard error to the same file.

#@ network.MPI = css0,not_shared,US

For MPI communication, use the SP switch with the User Space protocol in non-shared mode. This line requests that parallel MPI programs use the fastest form of internode communication available on the SP, User Space (US) protocol over the SP switch (device "css0"). Specifying non-shared mode guarantees exclusive use of the switch adapter on each assigned node for this job.

A separate "network" keyword is allowed for IBM's Low-Level Application Programming Interface, "network.LAPI", and for implementations of PVM not built on top of IBM's MPI, "network.PVM". LAPI may use the US protocol, but generic PVM can only use IP.

#@ tasks_per_node = 4

Use four tasks per node for parallel jobs. A task is equivalent to a process, and a single task may have multiple threads. This line specifies that four tasks, four MPI processes in this case, should be started on each node.

#@ node = 8

Allocate 8 nodes for parallel commands. Yes, the keyword is "node", not "nodes". See below to specify what kind of nodes you want according to processor count.

#@ node_usage = not_shared

Do not allow any other LoadLeveler jobs on the allocated nodes. This line guarantees that LoadLeveler will schedule no other jobs on the nodes assigned to this job for the duration of the job. It is otherwise possible for LoadLeveler to share nodes between multiple jobs. For parallel jobs, this is usually not desirable.

#@ queue

Queue the job! This keyword is critical. Without it, no job is created. Each "queue" keyword uses the environment specified by the keywords listed before it, so make sure to put it after the other relevant keywords.


The remaining lines of the file specify the shell commands to be executed by the batch job. All sequential commands, such as the first three commands in this example, run on only the first node allocated to the job. Parallel commands start multiple processes spread across all allocated nodes.

pwd

Display the name of the current working directory. The job starts in the directory where the job was submitted. This behavior is different from some other batch systems, which always start jobs in the user's home directory.

echo $LOADL_PROCESSOR_LIST

Display the nodes allocated to this job. LoadLeveler automatically sets the value of the environment variable "LOADL_PROCESSOR_LIST" to a list of the nodes allocated for the given job. Printing this list in each job can help diagnose system problems.

export MP_SHARED_MEMORY=yes

Use shared memory for MPI. IBM's MPI can now implement communication within a node using shared memory. This implementation greatly improves the bandwidth and latency of on-node communication without affecting communication between nodes. Because it uses extra memory, this implementation is not used by default, however. Setting the "MP_SHARED_MEMORY" environment variable to "yes" turns it on. The above line does this under "ksh". For "csh", use the following command instead: setenv MP_SHARED_MEMORY yes

To take advantage of this shared-memory optimization, an MPI code must be compiled with the thread-safe version of the MPI library, i.e. using "mpxlf_r" or "mpcc_r".

poe a.out

Run 32 copies of "a.out" across 8 nodes. If "a.out" is not a parallel program, this command will run 32 identical copies on 8 different nodes. If "a.out" is parallel (compiled with "mpxlf", "mpcc", etc.), it will run as a single 32-process application across 8 nodes. Specifying "poe" is optional for programs compiled to be parallel. POE options specified through LoadLeveler keyword commands ("node", "tasks_per_node", "network", etc.) override options on the "poe" command line.


Submitting jobs

Use "llsubmit" to submit a job command file for batch execution.

The job shell will inherit the working directory from where you submitted the job. Also, unless you use full path names, the standard output and standard error files will be saved in this same directory.


Job status

Use "llq" to check the status of submitted jobs.

The first column is the name of each job step, the second column is the owner of the job, and the third column is the time when the job was first submitted to LoadLeveler. The "ST" column gives the status of each job. Here are some common status values.

The "PRI" column gives the priority of the job, and the "Class" column gives the class specified in the job command file. The final column, "Running On", gives the first node assigned to each running job. Only this first node appears, even for parallel jobs running on many nodes.

Some of the columns of default "llq" output are not particularly useful, and "llq" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llq" on the command line. Here is an example configuration.

In addition to the owner, job name, and status, this format gives "NM", the number of nodes used by the job, "Disp. Date", the time the job was started, and "Queue Date", the time the job was queued. See "man llq" for more information on configuring output. You may want to create an "alias" for the "llq" configuration you prefer.

Why isn't my job running?

You can verify that your job is not running by checking the "ST" column of "llq" output. You can then use "llq -s" with the job name to find out why it isn't running. The output created by "llq -s" is long, so you may want to pick out the useful lines using "sed". The following example demonstrates how to display lines of "llq -s" output between the line "SUMMARY" and the line "ANALYSIS".

The LoadLeveler cluster may not have sufficient resources for a variety of reasons. Nodes may be busy with other jobs, for example. Unfortunately, LoadLeveler cannot distinquish between a temporary reduction of resources and permanent system limitations. Therefore, if a job requests more nodes than the system has, the job will wait, and "llq -s" will return the message above, despite the fact that the job will never be able to run.

What nodes is my job using?

You can use "llq -l" to display detailed information about LoadLeveler jobs, including a list of the nodes allocated for each job. You can use "grep" to isolate this node list. If the job command file is written to use only SP nodes, you need only "grep" for "eagle"; otherwise, you also need to search for "bearcat", "bobcat", and "morgan".

Notice that each line of this example has 4 "css0" entries. This indicates that 4 MPI processes are running on each node.


Stopping jobs

You can use "llcancel" with a list of job names to cancel those jobs. The command removes waiting jobs and aborts running jobs.

You can also keep a job from running without removing it from LoadLeveler using "llhold" with a list of job names. You can then use "llhold -r" to release held jobs and allow them to run.

The "llhold" command has no effect on running jobs.


Documentation

The compile-and-test servers have "man" pages for each of the LoadLeveler commands. In addition, the compile-and-test servers have local HTML documentation that can be read using "lynx" or "netscape". The documentation is available at the following location.

Full HTML and PDF documentation is also available from IBM's website.

The document entitled Using and Administering is particularly useful.


[neesc] [search] [sitemap] [getting started] [accounts]
[training] [research] [software] [hardware] [visualization] [about NEESC] [what's new]

URL http://www.csm.ornl.gov/neesc/LL.html
v2-4/25/2000
author | webmaster