Resource Management and Accounting Notebook - page 7 of 150

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Resource Management Dictionary

Account: A collection of resource credits allocated to a particular project against which credit or debit transactions can be applied. There may be multiple users associated with a single account and a user may be a member of multiple accounts. An account is also a credential to which scheduling policies and statistical tracking may apply.

Allocation Manager (also Allocation Bank): An allocation manager is a program that rations computational resources to projects and users. A user account is debited upon completion of each job, with the individual being unable to run jobs once the account has been exhausted.

Class (also Queue): A class is a logical container to which particular species of jobs may be associated. Classes may be tied to particular resources and be associated with specific policies or credentials.

Collector:

Disk: A quantity of local disk available for use by batch jobs. Disk is a consumable resource.

Group: A scheduling or authorization credential typically directly mapping to a UNIX groupid.

Information Service: Acts as a repository for storing and retrieving information of interest to the resource management system. The information service must support remote requests and protect the information from unauthorized access. It should support structured data and complex queries (joins).

Job: A complex unit of work; submitted to the queue. A job may comprise a mixture of jobs or tasks. Jobs can have recursive structure, meaning that jobs may be composed of subjobs and/or tasks. There might or might not be dependencies between tasks. The simplest job consists of a single task.

Machine (also System): A named collection of (one or more) nodes defined to a resource management system as an autonomous resource on which jobs can be run. This could be an IBM SP, an Origin 2000, a single HP workstation, or a cluster of PCs. Jobs can span multiple machines in the case of co-allocation.

Memory: A quantity of physical memory (RAM). Memory is a consumable resource.

Node: A single processing element of the parallel computer which is the smallest possible separate computational unit of the parallel computer. A node is a logical container that provides resources to jobs. A node might be a single CPU (e.g., an IBM SP2 thin node) or it might be a multi-CPU SMP node (e.g., a dual Pentium Pro system that is part of a networked cluster used as a parallel machine). Each node is assumed to have local virtual memory and a means of communicating with the other nodes in the system.

Node Monitor (or Node Manager): A system that collects dynamic node information about every node in a compute system. This information may include load average, free memory, etc. The node monitor should include both a daemon on each node and a collection server agent to provide a central point of contact for other components.

Processor (same as CPU): The smallest processing unit. A processor is a consumable resource. Nodes typically consist of one or more processors.

Quality of Service (QOS): A credential or specifier used to assign special services, resources, etc.

Queue Manager: Provides full job static job information database. Archives data for completed jobs. Provides command routing based on job status (e.g. qdel on a queued versus a running job).

Reservation: A hold reserving a specific collection of resources or resource credits for a specific timeframe for use by jobs which meet specific conditions.

Resource: Something that can be used for a period of time. It might be shared or used exclusively. It may or may not be renewable.

Resource Manager: A system that divides up available resources in such a way as to maximize resource utilization, while ensuring that resources are not oversubscribed. Also ensures that each user has a "fair share" of the available resources.

Scheduler: A program that maps jobs to resources subject to local policies.

Swap: A quantity of virtual memory. Swap is a consumable resource.

Task: The smallest schedulable unit of work requiring resources. A task may be a computational task or a non-computational task such as file staging. Tasks may have dependencies on other tasks. A job may be composed of multiple tasks.

Task Manager (also Process Manager, Job Execution Manager, Task Launcher): As a whole these components setup nodes for a users job, start the job, monitor the job, possibly stop the job, and cleanup the nodes upon job termination.

User: A uniquely identifiable entity which places demands upon the system, for instance by submitting jobs, querying status, etc. A "User" might be a research group rather than a single individual person (unless prohibited by site policy). Conceivably a "User" might not be a living entity, but could be a programmatic construct which submits jobs. A user is represented by a local handle and typically maps directly to a UNIX userid. It may additionally have a mapping to a Global User ID, which ties it to a single entity or purpose.

Wallclock Limit:The estimated maximum duration of a job.

Wallclock Time: The duration of a job as accounted in real time.


Addendum: Paul Hargrove Date: Wed Sep 5 22:12:18 2001 (GMT)
Supplied definitions marked [PHH].

I am uncertain about the distinction desired between the terms "Host" and "Node". I introduced "CPU", which I find unambiguous, and assigned a meaning to "Node" which some might feel belongs to "Host".



Addendum:   John Kochmar   Date: Thu Sep 6 19:35:39 2001 (GMT)
I've heard people make the following (admittedly weak) distinctions bewteen nodes and hosts: when they talk about a cluster, they refer to the machines as "nodes", implying they're not really "stand alone" but part of some larger entity. When the use the term "host", they mean to imply that it's a resource in and of itself, like a large SMP. I'm not sure I really care much for the distinction, which becomes even more blurred as you start having clusters of SMPs, and meta-clusters of clusters. I'd like to have a name for an entity (be it cluster or SMP or whatever), a name to describe seperable components (like host or node), and something to describe each "computational unit" that can run a dedicated process of the job (CPU is OK).