LoadLeveler is the batch-job scheduler for the ORNL IBM RS/6000 SP. It also allocates nodes for interactive parallel jobs on the SP. This document provides information for getting started with the batch facilities of LoadLeveler.
In the LoadLeveler parlance, the term "class" is analagous to the term "queue" for other batch systems. Different users may have access to different classes, and different classes may have different job limits or may target different nodes.
Use the "llclass" command to see the current list of classes.
$ llclass
Name MaxJobCPU MaxProcCPU Free Max Description
d+hh:mm:ss d+hh:mm:ss Slots Slots
interactive -1 -1 254 270 Interactive runs
sys -1 -1 254 270 System administration
pprod8 -1 -1 32 32 Parallel production runs, 8-processor nodes
cats -1 -1 6 6 Compile and test servers
ptest8 -1 -1 32 32 Development and testing, 8-processor nodes
ptest -1 -1 216 232 Development and testing, 4-processor nodes
climate -1 -1 216 232 Climate Change Prediction Program
bio -1 -1 216 232 Human Genome Program
pprod -1 -1 216 232 Parallel production runs, 4-processor nodes
Each "Slots" number represents the number of "job instances" that may be started in the given class. For MPI jobs, this is the number of MPI processes that may run under the given class. It is typically equivalent to the number of processors that allow the class. "Max Slots" represents the total number of slots configured on the system, and "Free Slots" represents the number of slots that are not currently occupied. This number is misleading, however. A four-processor node may have four slots each for five classes, for example. Because nodes are typically dedicated to a single job, only four of the node's 20 slots can be allocated at a time. The rest appear to be "free", although they are not useable.
"MaxJobCPU" and "MaxProcCPU" indicate the per-job and per-process aggregate CPU time limits. None of the classes listed here have CPU time limits; this is not particularly useful information because the classes do have wall-clock time limits.
You can get more information on a class, such as it's wall-clock time limit, using "llclass -l".
$ llclass -l ptest
=============== Class ptest ===============
Name: ptest
Priority: 0
Admin:
NQS_class: F
NQS_submit:
NQS_query:
Max_processors: -1
Maxjobs: -1
Class_comment: Development and testing, 4-processor nodes
Wall_clock_limit: 0+02:00:00, -1
Job_cpu_limit: -1, -1
Cpu_limit: -1, -1
Data_limit: -1, -1
Core_limit: -1, -1
File_limit: -1, -1
Stack_limit: -1, -1
Rss_limit: -1, -1
Nice: 20
Free: 216
Maximum: 232
The most useful information here is the "Wall_clock_limit", which is set to two hours. This is a hard upper limit for any job submitted to the "ptest" class. The "-1" indicates there is no soft limit. You may wish to "grep" for useful nuggets like this from the full listing, like in the following example.
$ llclass -l | egrep "Name|Wall_clock_limit"
Name: interactive
Wall_clock_limit: 0+01:00:00, -1
Name: sys
Wall_clock_limit: 30+00:00:00, -1
Name: pprod8
Wall_clock_limit: 0+12:00:00, -1
Name: cats
Wall_clock_limit: 0+02:00:00, -1
Name: ptest8
Wall_clock_limit: 0+02:00:00, -1
Name: ptest
Wall_clock_limit: 0+02:00:00, -1
Name: climate
Wall_clock_limit: 30+00:00:00, -1
Name: bio
Wall_clock_limit: 1+00:00:00, -1
Name: pprod
Wall_clock_limit: 0+12:00:00, -1
Through the "Free Slots" entries, the "llclass" command can give some information about the status of the system and what your chances are for running jobs immediately. As mentioned above, however, this information is misleading. For more accurate information about the load on the system, use the "llstatus" command.
$ llstatus Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys bearcat.ccs.ornl.gov Avail 1 1 Idle 0 0.05 27 RS6000 AIX43 bobcat.ccs.ornl.gov Avail 0 0 Idle 0 0.00 8311 RS6000 AIX43 eagle05s.ccs.ornl.gov Avail 0 0 Idle 0 0.00 9999 RS6000 AIX43 ... eagle77s.ccs.ornl.gov Avail 0 0 Idle 0 0.00 9999 RS6000 AIX43 morgan.ccs.ornl.gov Avail 0 0 Idle 0 0.00 1832 RS6000 AIX43 RS6000/AIX43 65 machines 4 jobs 112 running Total Machines 65 machines 4 jobs 112 running The Central Manager is defined on morgan.ccs.ornl.gov The following 2 machines are marked absent eagle23s.ccs.ornl.gov eagle29s.ccs.ornl.gov
The "Schedd" column indicates whether the machine is able to schedule LoadLeveler jobs; "Avail" means it can. "InQ" gives the number of current jobs submitted from (not running on) the given machine, and "Act" gives the number of those jobs that are actually running (on other nodes). "Startd" indicates whether any jobs are running on the given machine, and "Run" indicates the number of job instances that are running.
For ORNL SP nodes, the "Run" number will be equal to or less than the number of processors in that node. The same job can have more than one instance running on a given node; for example, a four-processor node may have four MPI processes from the same job. SP nodes are typically dedicated, however, so a given node will only run instances of a single job at a time, though it may run multiple instances of that job.
"LdAvg" is the Berkely one-minute load average, and "Idle" is the time in seconds since the last keyboard or mouse activity on the machine. For SP nodes, "Idle" is often "9999".
The lines at the bottom of the output indicate that 65 machines are currently under the control of LoadLeveler, and two are disconnected. On the 65 machines, 4 jobs are running, and those 4 jobs consume 112 slots. Because one slot can represent a single-thread or multiple-thread process, slots are neither equivalent to processors nor nodes.
Some of the columns of default "llstatus" output are not particularly useful, and "llstatus" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llstatus" on the command line. Here is an example configuration.
$ llstatus -f %n %mt %r %l %v %scs %sts Name MaxT Run LdAvg FreeVMemory Schedd Startd bearcat.ccs.ornl.gov 2 0 0.09 2211996 Avail Idle bobcat.ccs.ornl.gov 2 0 0.01 2220648 Avail Idle eagle05s.ccs.ornl.gov 4 0 0.00 2166544 Avail Idle ... eagle77s.ccs.ornl.gov 8 0 0.00 2226084 Avail Idle morgan.ccs.ornl.gov 2 0 0.02 887824 Avail Idle RS6000/AIX43 65 machines 4 jobs 112 running Total Machines 65 machines 4 jobs 112 running The Central Manager is defined on morgan.ccs.ornl.gov The following 2 machines are marked absent eagle23s.ccs.ornl.gov eagle29s.ccs.ornl.gov
This example prunes out some of the default information and adds "MaxT" and "FreeVMemory". "MaxT" gives the maximum number of job instances (regardless of class) that may run on the given host at a time, and "FreeVMemory" gives the available swap space, in kilobytes. See "man llstatus" for more information on configuring output. You may want to create an "alias" for the "llstatus" configuration you prefer.
To run a batch job under LoadLeveler, you first need to write a job command file. Here is an example file for a parallel MPI job.
#@ job_type = parallel #@ output = $(host).$(jobid).out #@ error = $(host).$(jobid).err #@ network.MPI = css0,shared,US #@ tasks_per_node = 4 #@ node = 8 #@ node_usage = not_shared #@ queue pwd echo $LOADL_PROCESSOR_LIST export MP_SHARED_MEMORY=yes poe a.out
The file has two components: LoadLeveler keyword statements and shell commands. The LoadLeveler keyword statements are preceded by "#@", making them appear as comments to a shell. The shell commands follow the "#@ queue" keyword statement and represent the executable content of the batch job.
Here is a description of each line.
#@ job_type = parallel
Use multiple nodes for parallel commands. This keyword is required for parallel jobs. The keywords "tasks_per_node", "node", "network", etc. won't work without it.
#@ output = $(host).$(jobid).out
Send standard output to the file "$(host).$(jobid).out". "$(host)" is a LoadLeveler variable that represents the host where the job was submitted. It is not necessarily related to where the job runs. "$(jobid)" is a number ID of the running job. Each "$(jobid)" is unique for a given job submitted from a particular host. Each "$(jobid)" is not necessarily unique across LoadLeveler; two jobs submitted from two different hosts can have the same value for "$(jobid)". The combination of "$(host).$(jobid)" is unique, however. Example: "131.out" and "131.out" versus "bearcat.131.out" and "bobcat.131.out".
Unless you specify a full path, the output file is stored in the directory from which you submitted the job. If you don't specify the "output" keyword, the standard output is not saved.
#@ error = $(host).$(jobid).err
Send standard error output to the file "$(host).$(jobid).err". See the information above for the "output" keyword. You can send standard output and standard error to the same file.
#@ network.MPI = css0,not_shared,US
For MPI communication, use the SP switch with the User Space protocol in non-shared mode. This line requests that parallel MPI programs use the fastest form of internode communication available on the SP, User Space (US) protocol over the SP switch (device "css0"). Specifying non-shared mode guarantees exclusive use of the switch adapter on each assigned node for this job.
A separate "network" keyword is allowed for IBM's Low-Level Application Programming Interface, "network.LAPI", and for implementations of PVM not built on top of IBM's MPI, "network.PVM". LAPI may use the US protocol, but generic PVM can only use IP.
#@ tasks_per_node = 4
Use four tasks per node for parallel jobs. A task is equivalent to a process, and a single task may have multiple threads. This line specifies that four tasks, four MPI processes in this case, should be started on each node.
#@ node = 8
Allocate 8 nodes for parallel commands. Yes, the keyword is "node", not "nodes". See below to specify what kind of nodes you want according to processor count.
#@ node_usage = not_shared
Do not allow any other LoadLeveler jobs on the allocated nodes. This line guarantees that LoadLeveler will schedule no other jobs on the nodes assigned to this job for the duration of the job. It is otherwise possible for LoadLeveler to share nodes between multiple jobs. For parallel jobs, this is usually not desirable.
#@ queue
Queue the job! This keyword is critical. Without it, no job is created. Each "queue" keyword uses the environment specified by the keywords listed before it, so make sure to put it after the other relevant keywords.
The remaining lines of the file specify the shell commands to be executed by the batch job. All sequential commands, such as the first three commands in this example, run on only the first node allocated to the job. Parallel commands start multiple processes spread across all allocated nodes.
pwd
Display the name of the current working directory. The job starts in the directory where the job was submitted. This behavior is different from some other batch systems, which always start jobs in the user's home directory.
echo $LOADL_PROCESSOR_LIST
Display the nodes allocated to this job. LoadLeveler automatically sets the value of the environment variable "LOADL_PROCESSOR_LIST" to a list of the nodes allocated for the given job. Printing this list in each job can help diagnose system problems.
export MP_SHARED_MEMORY=yes
Use shared memory for MPI. IBM's MPI can now implement communication within a node using shared memory. This implementation greatly improves the bandwidth and latency of on-node communication without affecting communication between nodes. Because it uses extra memory, this implementation is not used by default, however. Setting the "MP_SHARED_MEMORY" environment variable to "yes" turns it on. The above line does this under "ksh". For "csh", use the following command instead: setenv MP_SHARED_MEMORY yes
To take advantage of this shared-memory optimization, an MPI code must be compiled with the thread-safe version of the MPI library, i.e. using "mpxlf_r" or "mpcc_r".
poe a.out
Run 32 copies of "a.out" across 8 nodes. If "a.out" is not a parallel program, this command will run 32 identical copies on 8 different nodes. If "a.out" is parallel (compiled with "mpxlf", "mpcc", etc.), it will run as a single 32-process application across 8 nodes. Specifying "poe" is optional for programs compiled to be parallel. POE options specified through LoadLeveler keyword commands ("node", "tasks_per_node", "network", etc.) override options on the "poe" command line.
Use "llsubmit" to submit a job command file for batch execution.
$ llsubmit command_file
The job shell will inherit the working directory from where you submitted the job. Also, unless you use full path names, the standard output and standard error files will be saved in this same directory.
Use "llq" to check the status of submitted jobs.
$ llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- bearcat.123.0 richardson 1/01 08:42 R 50 pprod eagle08s bearcat.124.0 clinton 1/01 08:45 R 50 bio eagle33s bearcat.126.0 gore 1/01 09:32 ST 50 climate eagle52s bearcat.127.0 madia 1/01 09:35 I 50 sys 4 job steps in queue, 1 waiting, 1 pending, 2 running, 0 held
The first column is the name of each job step, the second column is the owner of the job, and the third column is the time when the job was first submitted to LoadLeveler. The "ST" column gives the status of each job. Here are some common status values.
| R | Running |
| ST | STarting |
| I | Idle, waiting for resources |
| H | Held by the user |
| S | held by the System |
| RP | Remove Pending, being removed |
The "PRI" column gives the priority of the job, and the "Class" column gives the class specified in the job command file. The final column, "Running On", gives the first node assigned to each running job. Only this first node appears, even for parallel jobs running on many nodes.
Some of the columns of default "llq" output are not particularly useful, and "llq" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llq" on the command line. Here is an example configuration.
$ llq -f %o %id %nh %st %dd %dq Owner Step Id NM ST Disp. Date Queue Date ----------- ------------------------ ---- -- ----------- ----------- richardson bearcat.123.0 32 R 01/01 08:43 01/01 08:42 clinton bearcat.124.0 8 R 01/01 08:45 01/01 08:45 gore bearcat.126.0 20 ST 01/01 09:33 01/01 09:32 madia bearcat.127.0 16 I 01/01 09:35 4 job steps in queue, 1 waiting, 1 pending, 2 running, 0 held
In addition to the owner, job name, and status, this format gives "NM", the number of nodes used by the job, "Disp. Date", the time the job was started, and "Queue Date", the time the job was queued. See "man llq" for more information on configuring output. You may want to create an "alias" for the "llq" configuration you prefer.
$ llq -s bearcat.127.0 ... (pages of information) ... $ llq -s bearcat.127.0 | sed -n '/SUMMARY/,/ANALYSIS/p' SUMMARY This LoadLeveler cluster does not have sufficient resources at the present time to run this job step. ANALYSIS
The LoadLeveler cluster may not have sufficient resources for a variety of reasons. Nodes may be busy with other jobs, for example. Unfortunately, LoadLeveler cannot distinquish between a temporary reduction of resources and permanent system limitations. Therefore, if a job requests more nodes than the system has, the job will wait, and "llq -s" will return the message above, despite the fact that the job will never be able to run.
You can use "llq -l" to display detailed information about LoadLeveler jobs, including a list of the nodes allocated for each job. You can use "grep" to isolate this node list. If the job command file is written to use only SP nodes, you need only "grep" for "eagle"; otherwise, you also need to search for "bearcat", "bobcat", and "morgan".
$ llq -l bearcat.129.0 | grep eagle
Allocated Hosts : eagle06s.ccs.ornl.gov::css0(0,MPI,us),css0(1,MPI,us),css0(2,MPI,us),css0(3,MPI,us)
+ eagle08s.ccs.ornl.gov::css0(0,MPI,us),css0(1,MPI,us),css0(2,MPI,us),css0(3,MPI,us)
+ eagle10s.ccs.ornl.gov::css0(0,MPI,us),css0(1,MPI,us),css0(2,MPI,us),css0(3,MPI,us)
+ eagle12s.ccs.ornl.gov::css0(0,MPI,us),css0(1,MPI,us),css0(2,MPI,us),css0(3,MPI,us)
Notice that each line of this example has 4 "css0" entries. This indicates that 4 MPI processes are running on each node.
You can use "llcancel" with a list of job names to cancel those jobs. The command removes waiting jobs and aborts running jobs.
$ llcancel bearcat.131.0 llcancel: Cancel command has been sent to the central manager.
You can also keep a job from running without removing it from LoadLeveler using "llhold" with a list of job names. You can then use "llhold -r" to release held jobs and allow them to run.
$ llhold bearcat.132.0 llhold: Hold command has been sent to the central manager. $ ... ... $ llhold -r bearcat.132.0 llhold: Hold command has been sent to the central manager.
The "llhold" command has no effect on running jobs.
The compile-and-test servers have "man" pages for each of the LoadLeveler commands. In addition, the compile-and-test servers have local HTML documentation that can be read using "lynx" or "netscape". The documentation is available at the following location.
Full HTML and PDF documentation is also available from IBM's website.
The document entitled Using and Administering is particularly useful.