PICL 2.0 represents a significant upgrade over earlier versions of PICL, including a new trace file format and new tracing functionality, but the message passing interface is essentially unchanged. Current PICL programs should run without problem, with one caveat. The current design philosophy behind PICL is that it is a low overhead compatibility library that supports the writing of portable programs, but does not enforce it. Platform-specific commands or functionality can still be invoked when necessary to get good performance. For example, nonblocking communication commands are important for good performance on Intel multiprocessors, and so are supported in the PICL ports to Intel machines. Simulating nonblocking communications on machines without native support leads to performance degradation and potentially confusing changes in the order in which messages are received. Calling the nonblocking PICL commands on such systems will cause PICL to print an error message and exit. This document is meant to serve as a brief summary of PICL 2.0 commands. When used with the original documentation and the description of PICL programming models found in picl2.models, it should be sufficient to understand how to use PICL. New documentation, in the form of man pages, is in preparation. _______________________________________________________________________________ PICL routines fall into three general categories: low level point-to-point communication routines, high level global communication routines, and tracing routines. In the following, these three categories are briefly reviewed and the corresponding C routines described. For most platforms, the Fortran callable routines have the same names and parameters as the C routines, modulo the usual differences in parameter passing. But, on some platforms, C and Fortran external names are indistinguishable. To force the correct routines to be linked, PICL 2.0 also provides versions of the Fortran callable routines with an "f" suffix in the name. Thus, for example, to be portable between SUN and RS6000 workstations, use CALL SEND0F instead of CALL SEND0 when sending a message in a Fortran program. LOW LEVEL COMMUNICATION ROUTINES -------------------------------- The low level commands are listed below. See picl2.models for a description of the typical host and node program, and the roles that the low level commands play. There are eleven low level commands in PICL 2.0 that were not in original release (in order of appearance): host0, sendbegin0, sendstatus0, sendend0, wait0, recvbegin0, recvstatus0, recvend0, clocksync0, getdata0, setdata0. These are described in somewhat more detail than the original commands. See the reference guide for full descriptions of the original commands. a) ENABLING COMMANDS: Host ---- void open0(int *numproc, int *me, int *host) - allocate (up to) the indicated number of processors and open the host's communication channel, returning number of node processes and local/host id void load0(char *file, int node) - spawn processes using the indicated executable void close0(int release) - close host's communication channel, releasing processors if so indicated (release == 1) Node ---- void open0(int *numproc, int *me, int *host) - open node's communication channel and return number of node processes, local id, and host id void close0() - close node's communication channel and exit b) INFORMATION COMMANDS: int host0() - indicate whether executable loaded by a host program (>0) or not (0) void recvinfo0(int *bytes, int *type, int *node) - get information on length, type, and source of most recently received message void who0(int *numproc, int *me, int *host) - return number of node processes, local id, and host id PICL supports both hostless and host/node programming models. host0 is used to differentiate between the two models. c) SEND-RELATED COMMANDS void send0(char *buf, int bytes, int type, int node) - send a message (blocking) void sendbegin0(char *buf, int bytes, int type, int node) - request that a message be sent and return immediately (nonblocking) int sendstatus0(int type) - check whether the oldest outstanding send request of indicated type has completed (1) or not (0) void sendend0(int type) - wait until oldest outstanding send request of indicated type has completed void message0(char *message) - Send a message of type MSG_TYPE back to standard out. If there is a host, then send the message via the host. Otherwise send it directly. The message should be a string of length less than 80 characters. The commands sendbegin0, sendstatus0, and sendend0 correspond to isend(), msgdone(), and msgwait(), respectively, on the Intel machines, and to mpc_send(), mpc_status(), and mpc_wait() on the IBM SP machines. On the other multiprocessors, these commands produce error messages and abort the process. The parameters for sendbegin0 are the same as those for send0. sendend0 uses type to identify the "sendbegin0" it is matched with. If multiple sends of that type are outstanding, sendend0() waits until the first "sendbegin0" of that type is finished. d) RECEIVE-RELATED COMMANDS int probe0(int type) - test whether a message of the specified type is waiting to be received (1) or not (0) void wait0(int type) - wait until a message of the specified type is available to be received void recv0(char *buf, int bytes, int type) - receive a message void recvbegin0(char *buf, int bytes, int type) - post a nonblocking receive request int recvstatus0(int type) - check whether the oldest outstanding receive request of indicated type has completed (1) or not (0) void recvend0(int type) - wait until oldest outstanding receive request of indicated type has completed The commands recvbegin0, recvstatus0, and recvend0 correspond to irecv(), msgdone(), and msgwait(), respectively, on the Intel machines, and to mpc_recv(), mpc_status(), and mpc_wait() on the IBM SP machines. On the other multiprocessors, these commands produce error messages and abort the process. The parameters for recvbegin0 are the same as those for recv0. recvend0 uses type to identify the "recvbegin0" it is matched with. If multiple receive requests of that type are outstanding, recvend0() waits until the first "recvbegin0" of that type is finished. e) SYNCHRONIZATION: void clocksync0() - clock synchronization routine void sync0() - barrier synchronization clocksync0() is used to calculate clock offset and drift among the allocated processors, to normalize subsequent clock0() times, and to synchronize the processors. If the sync option is specified in tracenode, then clocksync0 is called implicitly. If clocksync0 is called explicitly before tracing begins, then tracenode does not have to be called at the same time on all processors (i.e. the sync option does not need to be specified) in order for the trace records to make sense to ParaGraph. For "everyday" type processor synchronization, sync0() is the better choice. f) SUPPORT FOR HETEROGENOUS DATA TYPES void getdata0(char *datatype) - get implicit datatype used in interprocessor communications void setdata0(char *datatype) - set implicit datatype used in interprocessor communications (supported data types include "character", "short", "integer", "long", "float", "real", and "double") PICL was originally designed for homogeneous tightly-coupled distributed memory multiprocessors, and its intrumentation philosophy is based on minimizing the perturbation of tracing on such machines. But, for ease of program development, it is also sometimes useful to run PICL programs on a network of workstations. To support this, one implementation of PICL has it layered on top of PVM. In the network setting, the workstations may not all have the same data formats. To allow PVM to take care of this for PICL programs, the command setdata0 can be used to define what (homogeneous) type of data is represented by the byte stream being sent or received in PICL calls. Note that the size of the message passed to, for example, send0 or recv0, is still in bytes. setdata0 only affect what PVM does with the messages. g) SUPPORT FOR BUFFERED COMMUNICATION void buffer0(int buffer, int bufsize, int nummsgs, char *userbuf) - enable system buffering for blocking send requests Buffered communications for blocking sends is supported on all current PICL platforms. For the Intel and nCUBE machines and for the T3D, the amount of space available for buffering messages can be set at load time. The amount of system buffer space on the IBM SP can be specified at run time (once), and the buffer0 command has been introduced to support this functionality. buffer0 is ignored on the other platforms. The "buffer" parameter indicates whether to use buffered communication (!= 0) or not (=0), The "bufsize" parameter indicates the number of bytes to allocate for buffering, The "nummsgs" parameter indicates the number of outstanding buffered messages to support. The "userbuf" parameter is ignored currently. In the MPI implementation of PICL (under development), "userbuf" will be an (optional) pointer to a user buffer to use for message buffering. h) MISCELLANEOUS: void check0(int checking) - turn PICL parameter checking off (0) or on (1) double clock0() - read high resolution real time clock char *malloc0(int bytes) - PICL version of dynamic memory allocation request (obsolete) TRACING ROUTINES ---------------- The tracing instrumentation and user interface have changed the most in going from the original PICL to PICL 2.0, but most of the original tracing functionality has been retained. The change in user interface that a user is most likely to notice is in the interpretation of the tracing levels. (See tracelevel description.) The primary changes in functionality deal with support for collecting profiling-style statistics and for tracing user-defined events. There is a great deal of flexibility supported by the tracing interface, but the following tracing "template" describes how tracing is most often used in node processes. Note that trace data is never flushed explicitly in the template. Rather, close0 takes care of the handshaking required to guarantee that flushing does not interfere with the performance being measured. open0 can be called any time before close0 in this template. ----------------------------------------------------------------------------- tracefiles - specify temporary and/or permanent trace files (required if no host or host does not call traceenable, otherwise data will never be saved. If node process 0 opens a tracefile, then data from other processes that have not done so will be funneled through process 0.) tracestatistics - specify which user events to collect statistics for (optional) tracelevel - specify level of tracing (optional, but no data collected if levels not set) tracenode - begin tracing (sync option required if collecting data for ParaGraph; sync option not required if only collecting statistics) . . . traceevent - to record user events (optional, and called as often as needed) tracedata - to record special user event data (optional, and called as often as needed) . . . close0 - turn off tracing, wait until all processes are finished, renormalize clocks if necessary, then flush trace data to disk one process at a time ----------------------------------------------------------------------------- a) ENABLING COMMANDS Host ---- void traceenable(char *tracefile, int verbose) - trace enable routine. (Must be called before open0 for host to establish control over node trace collection.) - tracefile is the name of the disk where trace data will be written - verbose == 1, fields in trace records are labelled != 1, fields are not labelled (ParaGraph readable form) void tracehost(int tracesize, int flush) - host trace initialization routine: - tracesize is the number of bytes to be allocated for trace data storage - flush == 1, if trace array fills up, send it back to secondary storage and flush == 2, if trace array fills up, overwrite it otherwise, if trace array fills up, stop tracing Node ---- void tracenode(int tracesize, int flush, int sync) - node trace initialization routine: - tracesize is the number of bytes to be allocated for trace data storage - flush == 1, if trace array fills up, send it back to secondary storage and flush == 2, if trace array fills up, overwrite it otherwise, if trace array fills up, stop tracing - sync == 0, do nothing == 1, execute clocksync0 to sync the processors. b) TRACING CONTROL void tracefiles(char *tempfile, char *permfile, int verbose) - used for specifying temporary disk storage for trace data, and for specifying a trace file distinct from that specified by the host in traceenable: - tempfile is the prefix (including directory) of the name of the disk to be used for temporary storage of trace data. A suffix (the node number) is tacked on to make all temporary files unique. - permfile is the name of the disk file where this node's trace data should be sent for "permanent" storage. - verbose == 1, fields in trace records are labelled != 1, fields are not labelled (ParaGraph readable form) void tracestatistics(int events, int picltime, int piclcnt, int piclvol, int usertime, int usercnt) - trace events initialization routine: - events is the number of user events for which statistics are to be collected. The event types for which statistics will be recorded are types {0,...,events-1} - picltime is a switch (0/1) indicating whether time spent in PICL events within user events should be measured - piclcnt is a switch (0/1) indicating whether PICL event occurrences within user events should be measured - piclvol is a switch (0/1) indicating whether PICL event volumes within user events should be measured - usertime is a switch (0/1) indicating whether time spent in (other) user events within user events should be measured - usercnt is a switch (0/1) indicating whether (other) user event occurrences within user events should be measured void tracelevel(int picl, int user, int trace) - set the types of tracing data collected: - picl: tracing level for low level PICL commands - user: tracing level for user-specified events and global communication commands - trace: tracing level for TRACE commands if < 0, then tracing disabled. if >= 0, then statistics collected. if > 0, then event records generated. void traceinfo (int *remaining, int *picl, int *user, int *trace) - get tracing information: - remaining: approximate number of trace messages that can be saved in the remaining free storage in the trace array - picl: tracing level for low level PICL commands - user: tracing level for user-specified events and PORT commands - trace: tracing level for TRACE commands void traceexit() - stop tracing void traceflush() - send trace data to the temporary or permanent trace file and flush the data array. (Implicitly called in close0 if tracing has ever been enabled.) void traceport(int enable, int base) - enable/disable the tracing of global communication routines as user events and set the eventtype "base" for these events: - enable = 0; tracing disabled = 1; tracing enable - base = barrier0 event type base+1 = bcast0 event type base+2 = bcast1 event type base+3 = gather0 event type base+4 = gcomb0 event type Statistics records are a good way to determine general performance without requiring the collection of detailed trace data. But collecting "nested" statistics (recording occurrences of events within a given event) requires potentially a large amount of internal storage. tracestatistics is used to specify what user event statistics are to be collected, and whether nested statistics are to be collected. Note that PICL low level and trace commands are not nested, so tracestatistics only affects user and global communication events. tracestatistics only allocates memory for the collection. tracelevel must still be used to enable the collection of statistics. c) USER EVENTS void traceevent(char *recordstring, int event, int nparams, int *params) - used to mark the beginning ("entry"), ending ("exit"), or simple occurrence ("mark") of a user event, to label the event ("label"), or to write a message to the trace file immediately ("message"). The data associated with the entry, exit, and mark records should be integer. The data associated with the label and message records should be character. void tracedata(int event, int dataid, char *datatype, int items, char *data) - used to save event data (supported data types include "character", "integer", "long", "float", "real", and "double") void traceblockbegin(int event, int location, int instance) - used to mark the beginning of a block of code (obsolete; replaced functionally by traceevent) void traceblockend(int event, int location, int instance) - used to mark the end of a block of code (obsolete; replaced functionally by traceevent) void tracemark(int event) - used to mark the occurrence of an event (obsolete; replaced functionally by traceevent) void tracemarks(int *markarray, int size) - used to record an array of integer data (obsolete, replaced functionally by tracedata or traceevent) void tracemsg(char *message) - used to send a message to the trace file immediately (obsolete; functionality now provided by traceevent) Support for tracing user events has evolved over time, and many of the "new" commands that were added have since become obsolete. The recommended commands are now traceevent and tracedata. traceevent is used to record an event (mark or entry/exit), to label an event, or to send any high priority messages directly to the trace file. tracedata is used to record noninteger event data or integer event data that is reasonably associated with the mark, entry, or exit records. d) NON-PICL SYSTEM EVENTS The following commands are used to mark the beginning ("entry"), ending ("exit"), or simple occurrence ("mark") of a piece of code involving a system event not supported by PICL. void fclose1(char *recordstring, int channel) - used to mark closing a disk file or a channel. void fopen1(char *recordstring, char *string, int channel) - used to mark opening a disk file or a channel. void read1(char *recordstring, int bytes, int channel) - used to mark reading from a disk file or a channel void write1(char *recordstring, int bytes, int channel) - used to mark writing to a disk file or a channel I/O is an important aspect of performance on multiprocessors. But the vendor-specific I/O interface is constantly evolving, and standardization is only just now being addressed. With these commands, PICL 2.0 provides a mechanism for recording performance data on I/O despite the lack of a corresponding PICL command. e) GENERIC EVENTS The following commands are used to mark the beginning ("entry"), ending ("exit"), or simple occurrence ("mark") of a piece of code representing a generic system or user event, as described below. void idle1(char *recordstring) - used to mark code as representing idle time void overhead1(char *recordstring) - used to mark code as representing overhead void system1(char *recordstring) - used to mark code as representing system overhead I/O is not the only non-PICL system level performance issue. Also, the user often knows that a section of code represents redundant or superfluous work, and wants to label it as such, for example, when visualizing the trace data with ParaGraph. These commands provide a mechanism for doing recording such events. HIGH LEVEL COMMUNICATION ROUTINES --------------------------------- The global communication routines have changed very little, and the original documentation is still essentially correct. The only significant change is that there are now default values for the architectural parameters, and setarc0 need not be called before using the global communication routines. These routines and brief descriptions are listed below for completeness. a) ENABLING COMMANDS: Host ---- void setarc0(int *nprocs, int *top, int *ord, int *dir) - set architectural parameters or restore default values: - top=1 (hypercube), dir=1 (forward), ord=0 (natural)) Node ---- void setarc0(int *nprocs, int *top, int *ord, int *dir) - get architectural parameters from host or return default values if hostless: - top=1 (hypercube), dir=1 (forward), ord=0 (natural) b) SYNCHRONIZATION: void barrier0 ( ) - barrier synchronization using dimensional exchange generalized for nonpower of two numbers of processors. Can not be called from by the host. c) BROADCAST: void bcast0(char *buf, int bytes, int type, int root) - broadcast message buf, of given length and type, to all processors, using given interconnection topology. This version meant for synchronous applications in which all processors communicate at about the same time. void bcast1(char *buf, int bytes, int type, int root) - broadcast message buf, of given length and type, to all processors, using given interconnection topology. This version is meant for asynchronous applications in which communication and computation are overlapped by pipelining. Host versions simply send message to node 0, which in turn broadcasts the message to remaining nodes. d) GATHER: void gather0(float *vec, int n, int type, int root) - gather components of a vector into root from all processors using given topology. Caution: vec is overwritten. To "gather" results into host, host should post a receive with indicated message type. e) GLOBAL REDUCTION: The global reduction commands are all based on calls to gcomb0, described in detail below. The other predefined global reduction operators are listed following gcomb0. void gcomb0(char *buf, int items, int datatype, int msgtype, int root, void (*comb)) - componentwise combination of a vector over all processors, using given topology: - buf array of data to be combined. Caution: buf is overwritten. - items number of items in array buf. - datatype code number for type of data: = 0 : char = 1 : short (int*2 in Fortran) = 2 : int = 3 : long (int*4 in Fortran) = 4 : float (real*4 in Fortran) = 5 : double (real*8 in Fortran) - msgtype user-defined id to distinguish messages. - root processor in which final result will reside. - comb name of user-supplied function to be applied in combining data. The operation defined by comb must be associative and commutative in order for result to be well defined independent of topology and order. Examples are max, min, +, *, &, |, ^. Host simply receives combined result from node 0, which has collected combined results from remaining nodes. void gand0 (char *buf, int items, int datatype, int msgtype, int root ) - componentwise logical and of a vector over all processors, using given topology. void gmax0(char *buf, int items, int datatype, int msgtype, int root ) - componentwise maximum of a vector over all processors void gmin0(char *buf, int items, int datatype, int msgtype, int root ) - componentwise minimum of a vector over all processors void gor0(char *buf, int items, int datatype, int msgtype, int root ) - componentwise logical or of a vector over all processors void gprod0(char *buf, int items, int datatype, int msgtype, int root ) - componentwise product of a vector over all processors void gsum0 (char *buf, int items, int datatype, int msgtype, int root ) - componentwise sum of a vector over all processors void gxor0 (char *buf, int items, int datatype, int msgtype, int root ) - componentwise logical exclusive or of a vector over all processors f) MISCELLANEOUS: int antipode0 (int root) - compute antipodal node from root in ring topology. Used in bidirectional ring communication. int datasize0 (int datatype) - compute size of dataitem in bytes, using gcomb0 datatype encoding. void getarc0 (int *nprocs, int *top, int *ord, int *dir) - return architectural parameter values int ginv0(int i) - inverse Gray code function int gray0 (int i) - Gray code function int neighbor0 (int i, int node) - neighbor i links away from given node in a ring topology. A positive value for i gives i-th neighbor in forward direction, a negative value gives i-th neighbor in backward direction. Values of 1 and -1 give immediate neighbors, larger values, 1 < i < p, give more distant "neighbors".