Interactive Parallel Jobs on the ORNL SP


Using the Parallel Operating Environment

If resources are available on the ORNL SP, you can run parallel jobs interactively using the Parallel Operating Environment (POE). To run a program in parallel, you specify the number of processors and/or nodes, the communication library, and a particular ``pool'' of nodes. POE then uses LoadLeveler to acquire a set of nodes in the specified pool. If nodes are not available, the command fails.

Use the command ``showbf'' to determine if nodes are available.

LoadLeveler uses the class ``interactive'' for all interactive jobs. Use the following command for information on this class, including wall-clock run-time limits.

If an executable is compiled for parallel execution, it will run under POE as a single program with multiple processes. If the executable is sequential, POE will start multiple copies of it across the acquired nodes.

You can specify values for number of processors, communication library, pool, etc. using environment variables or command-line arguments to the ``poe'' command. Command-line arguments override environment variables. The following table summarizes the important options.

``poe'' option Environment variable Description
-procs n MP_PROCS=n The number (``n'') of parallel processes. Use with either ``-tasks_per_node'' or ``-nodes''.
-nodes n MP_NODES=n The number (``n'') of nodes. Use with either ``-procs'' or ``-tasks_per_node''.
-tasks_per_node n MP_TASKS_PER_NODE=n The number (``n'') of parallel processes per node. Use with either ``-procs'' or ``-nodes''.
-rmpool 1 MP_RMPOOL=1 The resource-manager pool that LoadLeveler will use to allocate nodes. The compute nodes of the ORNL SP are in pool ``1''.
-euilib xx MP_EUILIB=xx Communication library. Valid values for ``xx'' are ``ip'' for Internet Protocol and ``us'' for User Space. The recommended values is ``us'', though ``ip'' is the default.
none MP_SHARED_MEMORY=yes Use shared memory for MPI communication within a node. Requires compilation with the thread-safe MPI library (i.e. using ``mpxlf_r'', ``mpcc_r'', etc.).

The following example runs ``a.out'' on 8 processors across 2 compute nodes using US over the SP switch.

For more information on ``poe'' options, see ``man poe''. Online documentation for IBM's Parallel Environment, including POE, is available at the following URL.

Using the TotalView Parallel Debugger

Etnus TotalView is a debugger for sequential, parallel, and threaded programs, and it has a powerful graphical interface. On the ORNL SP, it works with MPI and pthreads programs, but not yet with OpenMP. As of the most recent release, TotalView also has a command-line interface.

The ability to run parallel programs interactively simplifies the use of TotalView. Still, starting TotalView is nontrivial because of current limitations of ``rsh'' under DCE on the SP. To simplify the procedure, add the following line to the file ``.Xdefaults'' in your home directory.

This line resets the string TotalView uses to start the debugger daemons that monitor the tasks in a parallel job. The ``-F'' causes ``rsh'' to forward your DCE credentials to the remote shells that will execute the daemons.

In order for ``rsh'' to forward your credentials, however, these credentials must be forward-able. They are not by default. Therefore, before running TotalView, you must issue ``kinit -f'' and give your password.

You credentials will be forwardable for the remainder of the session.

Since TotalView is an X-Window application, you must have the ``DISPLAY'' environment variable set to point to your local display. You may also need to issue an ``xhost eagle'' on your machine to allow the SP login node to display there.

To debug an executable using TotalView, you actually run ``totalview'' on ``poe'', with your executable appearing as an argument after ``-a''.

The following example starts TotalView on 8 processors across 2 compute nodes using US over the SP switch.

After you run the ``totalview'' command, two windows should appear. In the larger window, you will see the assembly code for ``poe''. Type ``G'' (capital G) in this window to cause all processes to ``Go''. If you have not already issued a ``kinit -f'', TotalView will fail at this point. Otherwise, TotalView will run for a few seconds and then ask if you'd like to stop your processes before entering ``MAIN''. Answer, ``yes'', to stop your program at the beginning, so you can add breakpoints, etc. before running.

For more information on using TotalView, see ``man totalview'' or type ``?'' within a TotalView window. The TotalView User's Guide is available on the SP in the following location.

For information on the TotalView command-line interface, see the TotalView Command Line Interface Guide, available on the SP in the following location.

Documentation for the TotalView graphical interface and command-line interface is also available directly from Etnus at the following URL.
[neesc] [search] [sitemap] [getting started] [accounts]
[training] [research] [software] [hardware] [visualization] [about NEESC] [what's new]

author | webmaster