# $Id: initial-nest-reqs.txt,v 1.20 2004/06/17 21:36:00 tjn Exp $ # # Initial pass at nest requirements based on Neil's description email. # # NOTE THE WORD 'initial'! I expect these will need some refinement # and more importantly the implementation overview will likely need to be # refined. Hopefully this will get the ball rolling and we can get a # 'nest' prototype knocked out that fills the current requirements ASAP. # Please send feedback to 'oscar-devel' (and/or "naughtont at ornl.gov") # --tjn 6/11/04 # # KEY: # "Q:" - question # "N:" - note # "opkg" == "OSCAR Package" # NEST: (N)ode (E)vent and (S)ynchronization (T)ools NEST is concerned with the synchronization of "OSCAR Packages" for a node based upon the definitions in the database (ODA) for the given node. The primary focus is to improve scalability by limiting global operations, i.e., only run an API script if necessary not just "re-run for all". Additionally, the node descriptions are more fine grained (individual) allowing for improved diversity among the cluster nodes without tons of overhead. In order to keep things within reason the configuration tracking is limited to "OSCAR Packages". The following is a general list of requirements for the NEST tool. An attempt was made to seperate the implementation details from the tools requirements (isn't that what the books says to do :). Additionally, where possible an attempt was made to seperate functionality that might reduce the overall complexity of the tool, e.g, Node State Mgmt. Lastly, notes and a brief implementation outline (based on discussions) are listed at the bottom of this document. #---------------------------------------------------------------------- 1) DATA - storage o central database ODA o ODA (database) records all node config data (limited to "OSCAR Packages") #N: See also 'NOTE-JE1' o ODA (database) is network accessible - nodes o node_name -- the network addressable node identifier (IP, hostname, ...) o node_grp_name -- group of node_name's, resolves to unique list of names - "nest" config data o nest configuration (behavior) parameters are stored in ODA (database) o data type - *only* "OSCAR Packages" #N: Limited only to "OSCAR Packages" (and underlying RPMs, DEBs, etc.) + softw pkgs (RPMS, DEBs) + config scripts + identified by NAME, NAME+VERSION + location of new data/configs (ex. oscar_server:/tftpboot/rpm) #N: I think this can also be http://oscar_server/rpms, etc. - node config cache o "config cache" stored on local node filesystem #N: This is used to assist in the 'diff calculation' - node groups #Q: I'm not sure if this is a requirement, shown up lots in discussions # but may just be implementation details cropping up. o node groups return a unique list of nodes, e.g., grpA ::= (node1, node2, node3) grpB ::= (node3, node4, node5) grpC ::= (grpA, grpB) grpC => (node1, node2, node3, node4, node5) 'node3' listed only once! #N: This is important b/c when you split operations over node_grps, you # won't get duplicate operations when adding to the node queue(s) - node [operation] queues #Q: I'm not sure if this is a requirement, shown up lots in discussions # but may just be implementation details cropping up. o "oscar_nest" on server marshalls "synch" commands to avoid excessive load on server/ODA -- done via queue entries o the node queue stores the operations that node needs to perform (e.g., del OPKG-foo, add OPKG-bar, config OPKG-baz,... o node queues to preserve order of operation & for scalability #Q: Not sure where these fit? # - comments on serializing things was to reduce load on server & ODA # - server responsible for seperation/marshalling queues for node operations #N: # ORIGINAL TEXT, # "When anyone has made a change in the oscar database that # needs to be reflected on one or more nodes, they can issue # a nest 'synchronize' command on the oscar server, specifying # one or more node names and/or node group names, and the nest # daemon will queue the changes, making sure that each node only # has one nest synchronize operation happening on that node at any # given time. When synchronize commands to many nodes occur in # rapid succession, nest can limit the total number of nodes being # synchronized at any given time, leaving the rest of the # synchronizations requests in the queue, to prevent having too # many nodes pounding the central database or running into some # other cluster limitation. When a synchronize command for the all # of a node or for the same package(s) on a node comes in too # quickly after a previous synchronize command finishes that # overlapped one or more packages being synchronized on the same # node, nest will delay the new synchronize commands. Nodes by # default will be limited to running only one synchronize operation # per node at once." - node state #N: inspired by SSS NodeStateMgr states o types: {admin, run} + admin: build, synch, boot, diag, on, off + run: online, offline #N: Std system service just care about "run" states, the # administrative compononents (eg. NEST) care about "admin" states # too. o state descriptions + admin:build -- initial phases where entire build (pxe, etc.) occurs + admin:synch -- maint phase where node configs are synchronized + admin:boot -- node in startup phase, not ready for general cmds + admin:diag -- node in diagnostic phase, not avail for general cmds + admin:on -- node in normal running mode (not nec. online though) + admin:off -- node is powered off + run:online -- node is running & available for general user use + run:offline -- node is not available for general user use #Q: NEED TO DEFINE STATE TRANSITION COMMANDS SEPERATE FROM NEST. - pkgs and OSCAR pkgs should be same, possibly create very "simple" oscar pkgs that just contain RPMlists to keep things consistent? #JE: This depends on how removals are handled, and how users would accept # the notion of being forced to modify ODA in order to add/remove an rpm # from a node(s). ... user satisfaction in this case would # hinge on the interface provided to them. #---------------------------------------------------------------------- 2) LOCATION - runs on node(s) - runs on server node #N: Server node is treated as just another node so operations are uniform - runs as a standard 'init.d/' type of service - service is called "oscar_nest" #N: nest/oscar server doesn't *have* to be headnode -- this is noted # for the ability to later improve security/restriction to this # oscar mgmt. node, possibly turning off SSH access from compute # nodes to this mgmt. node #---------------------------------------------------------------------- 3) ACTIONS - operations to perform o private + query ODA (database) for node config data (softw pkgs, etc.) #N: The query is based on a node_name or node_grp_name # which resolve to a list of node_name(s). # The queries provide what "OSCAR Packages" should be on the node. #XXX: This needs to be better defined/scoped!!! + calculate diff (used internally by something like "synch") o public + delete opkg(s) + add opkg(s) + update opkg(s) + "synch" node [cmp & convert into list of del/add//upgrd cmds] #N: only one "synch" operation is allowed per node at a given time + "config" opkg(s) [configure/reconfigure] #N: File level changes are only managed to the extent that they are # from within the OSCAR Packge API - description (what not how) #Q: There is some confusion about what is done from the node and what is # done on the server side. Sometimes it seemed that nodes figured out # what they were supposed to do based on simple query...other times it # seems they just look at the server, still via a query, and just pull # steps out of the node's "operation queue" ... this implies that the # server stuffed the operations into the queue and in turn figured # out/calculated what operations were needed...I thought the point was # that this was to be farmed out to the nodes so they could figure it # out for themselves. Seems like a conflict that definately must be # resolved before implementation b/c it implies to different approaches # to tell nodes how to decide what they should do...or figure out what # they should do. o query -- queries the ODA database for a given node's data o calc_diff -- looks at node pkg info (from a query) and compares against the current node config to determine what operations are necessary to make the node match the config data in ODA (note currently limiting to only "OSCAR Packages"). The use of timestamps on node configs in ODA & local node cache are to speed/assist with this calculation. #Q: This calculation is done on the node side? o delete -- remove an "OSCAR Package" on the node. this operation is performed local to the node, ie. not via remote C3 type of command, based on operations stored in the node's operation queue. o add -- add an "OSCAR Package" on the node. This operation is performed local to the node, ie. not via remote C3 type of command, based on operations stored in the node's operation queue. o update -- basically an atomic 'delete/add' combo for an "OSCAR Package" from NAME-verA -> NAME-verB. This operation is performed local to the node, ie. not via remote C3 type of command, based on operations stored in the node's operation queue. #N: this commands requires fully qualified pkg names to # keep things distinct, ie. 'update foo' must be # 'update foo-1.0 foo-2.0'. Also must have distinction in # filesystem, currently just have 'foo' and update clobbers. o synch -- general command to look synchronize a node to the list of "OSCAR Packages" (ie. config) in the database (ODA). This involves the use of 'query' and 'calc_diff' in order to create a list of operations to perform. o config -- this command allows one to run (or re-run) a given script for a node. This is necessary b/c it reduces the need for global operations and removes unnecessary global script runs, e.g., "post_install" on all nodes, when only needed on node5. - optional/required operations (scalability/performance/time-to-implement) #Q: I think all of the cited commands are required after my initial # survey...true? - constraints (scalability/performance/time-to-implement) #Q: Don't know...seemed like something that should be mentioned? - transport of data? (scp, http-put/get?) o password-less SSH/SCP access to server from node o access to ODA (database) running on server from node + copy /etc/odapw (db passwd file) from server to node + copy /etc/odafunctions (db functions dir) from server to node + use password-less 'scp' to copy files from server to node #Q: Don't know...seemed like something that should be mentioned? # Should probably refine these requirements. - "oscar_nest" service commands (init.d/ script) o start o stop o restart o status #---------------------------------------------------------------------- 4) MODE OF OPERATION/INVOCATION - service: daemon vs. invokable script #Q: passive/active? - input/output format (commands) #N: No format was specified, might make since to just say the commands # will be ASCII...maybe XML? - one nest daemon/service instance per node (lock file) #---------------------------------------------------------------------- 5) COMMUNICATION - input/output format (XML, etc.) - network (tcp/udp?) - transport of data? (scp, http-put/get?) o data (updates, etc.) is "pulled" from the server to the node #N: No format was specified, might make since to just say the commands # will be ASCII...maybe XML? #---------------------------------------------------------------------- 6) REQUIRES (consumes) - oda (database) o must have network accessible access to database (for queries etc.) #N: Many of the implementation discussion hinged upon ODA layout/features # and I'm adding this comment to simply say that NCSA's thoughts included # the use of a hierachical node_grouping mechanism. # Stock system default groups (think sets): # {all, oscar_clients, oscar_server, nodenameX} # where "nodenameX" would be a group of a particular node's name, # a 1:1 mapping. In this scheme, # most general -> "all" group # most specific -> "nodenameX" group # The basic rule for resolving conflicts on add/del operations: # - most specific overrides most general # That is to say "nodenameX" group info overrides "all" group info # and similarly throughout the group items in between. o copy /etc/odapw (db passwd file) from server to node o copy /etc/odafunctions (db functions dir) from server to node o use password-less 'scp' to copy files from server to node - remote cmd invocation [password-less] (rsh/ssh) - native package mgmt system dependency resolution (ex. DepMan) - native package mgmt system installation abstraction (ex. PackMan) - state mgmt service? (NodeStateMgr) - depends upon oscar-httpd (to get files from server via http get) #N: not sure oscar-httpd is a hard requirement per se #---------------------------------------------------------------------- 7) PROVIDES (produces) - softw synchronization for a node (limited to "OSCAR Packages") #---------------------------------------------------------------------- 8) USERS (nest consumers) - OSCAR node mgmt GUI o synchronize a node via triggers from install/maint GUI running on server #---------------------------------------------------------------------- 9) USAGE - Summary of steps for initial cluster installation process: 1. Build basic/general image (SIS) 2. Add OSCAR Pkgs to ODA per node 3. Push image (SIS) to node(s) 4. Run NEST a. query ODA from node for OSCAR Pkgs (for "self"/node) b. calculate (del/add/reconfig) list for OSCAR Pkgs #N: Do "deletes" before "adds", initial install so only "adds" c. resolve dependencies for native pkgs (ie. RPM deps for files in OSCAR Pkgs' ) d. copy/pull necessary files to node (OSCAR Pkg {RPMS, scripts,...}) e. update/create timestamp #Q: I'm not sure where this timestamp lives or it's format - Summary of steps to remove a package from a particular node, e.g., "node3". 1. Config opkg (via Configurator), selecting nodes to have config a. set opkg to *not* install on "node3" #N: this relies upon the "grouping" mechanism provided by ODA # where more specific configs override more general settings. b. Save configuration & group binding to ODA #N: "group binding" is abstracted from user 2. Run NEST "synch" to have nodes determine change (only "node3" diffs) #Q: The timestamp stuffo is part of this, not sure about details, # partly for a performance improvement, not sure if a requirement. #============================================================== A1) IMPLEMENTATION OUTLINE #N: For lack of a better format, I'm going to use short sentences/paragraphs # to outline my current understanding of the proposed implementation plan # of NEST from discussions with NCSA (Jeremy, Neil, Jason, Terry). # Also, this really doesn't detail all of the implementation detail but is # more of a usage summary to outline what operations are done and how # they're to be performed (high level, as opposed to API specs). - The nest tool depends on ODA and it's ability to provide "node groups" with the initial system defaults being {all, oscar_server, oscar_clients}. - Additionally singleton groups are setup for each node that is defined. - NEST sets up "operation queues" for each node. These are later used by the nodes to get the next 'operation' (add pkg-foo, etc.) from their queue. #N: This needs to be refined. - NEST needs to have a few things setup on the node, either via the SIS image or some other mechanism at startup. This copes the ODA password (/etc/odapw) and ODA db functions directory (/etc/odafunctions) to the node. [possibly via SSH/SCP?] - After the nodes have a basic image installed, as is done currently using SIS, they will run the 'oscar_nest' service upon reboot/startup. - This service queries the database (ODA) for operations to perform. This includes the initial creation of a node's "cache"...which would be empty on the first pass but populated after getting info from the central server. These additions on the node are added to the "cache" file to speed subsequent checks/synch on future node updates. TIMESTAMPS... #N: There is some talk of timestamps to keep from flooding the server # but i'm not sure how to fit it in. Leaving space for discussion here. - The 'oscar_nest' service could continue to run as a daemon with synch intervals on the nodes or could simply be start/stop'd via a C3 command from headnode as needed. - The GUI/Maint tools (running on server) will trigger nest 'synch' operations on the nodes when some change occurs to a package or if the package should be re-configured. #N: Only those nodes that have this package will be perform any actions # With possible optimizations that could limit this to only necessary # nodes if flags are made available to pkg authors for such improvements. - #Q: Someone is in charge of adding operations to the "operation queues" and # it appears to be NEST...but i'm not certain how that aspect of NEST # figures all this out. Those implementation details are not # given/listed. - Basically when changes are made to the cluster, softw update, delete, etc. the information is written to ODA (database) and NEST reads from ODA to determine what operations are to take place, given it's small command set of: delete, add, update, synch, config - Command order of operation seems important, and the general feeling is that it should have 'delete' *before* 'add' so that you don't have problems with dependency analysis. - All package dependency anaylsis is done by the OSCAR::PackMan abstraction layer, (ie. something other than NEST does the anaylsis). - Some care must be taken when doing node 'synch' operations (which simply determine what software, OSCAR Pkgs, are to be dealt with). This includes limiting the number of concurrent operations & instances of NEST on a node. It seems that for now this is limited to 1 operation/instance per node at a given time. Because of this some state must be kept. The discussions have had NEST maintaining the state but I (Thomas) have encouraged and written things to have some other entity maintaining cluster state. This is in part to reduce the complexity of NEST and in part to make the facility (state mgmt) more general...adding 'online/offline' for things like PBS, etc. #============================================================== A2) IMPLEMENTATION OUTLINE (Version 2) #N: These are notes based on a 6/15/04 teleconf w/ Terry, Jason, & Neil # outlining steps and other info related to NEST. 1. OPDer 2. configure how & setup cluster - define node (node group dynamically created, w/o user interaction) (transparently defines node group) - can create new node group which can contain any existing node group (recursive) 3. assign pkgs to node group - pkg foo -> configurator, set fields/info o pkg author o installed? bit o list of available node groups (multi select box) * Possible conflicting configurations that will be flashed to user stating that there are conflicts to be resolved by the user manually. - "Assign" -> does the checking for conflicts * Conflict Anaylsis (need to do!) o when adding a new node to a node group o when adding pkg to configuration set, which is associated with another conflicting configuration set o whenever you do anything that might add config data into DB * Question about whether you should ever add conflicting/wrong data to the database? 4. Build image w/ or w/o OSCAR pkgs (NEST used if not adding to image) 5. run NEST on node - nodes synch to server this is powerful b/c needs to config data in ODA config_sets -> changes propogate via NEST - active/passive? NEST -> to have nodes check Timestamps (optimization) edit config -> ok write to DB w/ new timestamp ---------- nodes keep records of timestamp(s) from node's last synch this lets them just check the timestamp on a synch to see if anything has changed and act accordingly Operation Queues (optimization) to handle load from node synchs to server ---------------- * queue node requests * possibly use a request counter to throttle node access e.g., ftp "login limit" or dhcp client limits * per node queues #============================================================== A3) NOTES - ODA (database) records softw pkgs per node #Q1: ARE THE SOFTW PKGS, OSCAR PKGS OR ALL THE RPM LISTS PER NODE? #NOTE-JE1 #JE: I believe this is to fall out as an rpm list per node. I don't think # users will be happy if OSCAR is uninstalling their manually installed # rpms as a matter of syncronization though... my feeling is that the # "removal" part of synchronization should probably only take place on # rpms which are part of OSCAR packages. Any extras could stay. That # *could* be asking for some problems w/ conflicts, nodes out of sync, # etc. We'd have to wait and see how it played out in userland... to see # what they like. The alternative for the users is to dynamically modify # a node(s)'s rpmlist in ODA and re-run a sync. That may seem a bit too # complicated to simply install an rpm. But then, they might like it # too. Kind of a hands-off, management model, where users only tweak the # definitions of nodes (in ODA), and then everything falls into place # automatically on the next sync. I think user acceptance of that model # would completely depend on the interface offered. It would have to be # EASY to view/modify node rpm list info. - compare softw pkg(s) list from query w/ current softw pkg(s) on node #JE: This action could potentially be optimized out by depending on the # configuration timestamp. This would open up the door for a potential # problem though... if a user removed an rpm somewhere, and we were # depending on timestamps to know if action was needed. However, the user # can always screw stuff up that way. I think it would be a good idea to # be able to call a NEST synchronization w/ a switch or something... that # ran a brute force syncronization as if no timestamps matched. That # way, the optimization could be bypassed if the user had done something # manually that was causing problems. - actions performed on node: #Q2: ARE THESE OSCAR PKGS OR JUST SIMPLE RPMS? HOW DO YOU "CONFIGURE" # THEM? SHOULD THERE BE ANY DISTINCTION? #JE: These are definitely OSCAR packages. Of course, rpms may underly them, # but this step has nothing to do w/ rpms unless the packages is being # added or removed. The idea is that when a configuration falls out of the # database, it's ODA timestamp (i.e. birth date) is compared against a # local cache. If the local cache for that config is non-existent or # non-matching, then the package's config script is re-run. Basically, # this is the API script set to replace the post_install scripts. It's # re-runnable, and does all the configuration needed after the rpms are # installed, using configuration parameters saved to ODA if needed. - the state of a node (current operation/action) is centrally maintained #Q11: IS THIS STATE MAINTAINED BY NEST ALSO??? SEEM LIKE A LOT FOR ONE # tool, possibly have another tool that maintains these states. #JE: It makes sense to me that the "state of NEST" should be written to ODA # by NEST itself. Who would better know NEST's state than NEST itself? ;-) #TN: seems that someone should be in charge of state in general, # and rolling that into NEST means you must use nest to do node soft mgmt # and to just mantain general node state (online/offline, etc...) # {extend NSM to have arbitrary states that are ignored by others} #Q12: WHAT ARE THE VALID STATES? HOW DO YOU SIGNAL A STATE CHANGE? #JE: To start off with, normal, synching, shut_down... but more detail could # be available too. For example, a state could show which package it's # synching at the moment (synching_torque). This kind of detail could # potentially help in failure analysis if something is getting stuck or # taking too long. How to signal it? I think this just involves a write to # a state table in ODA, right? Any kind of GUI tool written at a later date # could be monitoring the table and displaying the last recorded state. #TN: This seems to be more detail than is necessary and is more of a # log (synch'd torque, synch'd pvm, ...) and I agree writing that to a # central place that could be monitored/watched by another component seems # to make since, even if that's a simple CLI "tail -f" sort of thing or a # fancy GUI color flashy thingo. Regardless, data is central and seems # like different (more) data than simple state. State should be fairly # basic to assist with demarcation & state transition. - defined valid node states, e.g., building, booting, synching, etc... #JE: Not sure how we'd have "building" as a reported node state. NEST # wouldn't have ever run on the node yet. #TN: All the more reason to have a third-party maintaining state so that # nest could potentially be a consumer of this simpler service. It could be # signled by things being sent into the "build" or "synch" phase.