|Date and Author(s)|
2003SEP13 Al bugged the group at the September meeting to do more uploads to the notebooks. Having been taught both the value and the skill of notebooking early in my education, I feel bad about being one of those that has fallen down on the job. So here's my attempt to get back into it. Just as a reminder, I'm developing a set of tools for passing information through tree of processes that I call "warehouse". The idea is that each entity passing information has separate parts to take in information, store that information, and then pass it to other entities. The entity does not need to know what the information means or how it is formatted to be able to handle it effieciently. The warehouse software will be able to be used in many applications, but initially it is being written as a reference implementation of the SSS System Monitor component. The System Monitor is responsible for collecting and caching information about the state of the nodes in a cluster, and giving that information to other SSS components that need that information. Developing the warehouse infrastructure involved producing classes that would handle information and requests for that information in a threaded setup, with data locks to eliminate race conditions that would corrupt the data under high load. The next step was to create the "source" and "sink" functions so that when information was passed to one warehouse process, another warehouse process on another machine could connect to it and get a copy of that information. The basic capabilities of the warehouse software were successfully tested on Sunday, September 7. We intend to produce a beta software release of the SSS suite at SC 2003. For the warehouse System Monitor to be a part of that release, it has to have real monitoring data fed into the bottom of the warehouse tree, and the top of the tree must server out that data in the role of SSS System Monitor. Here is my checklist of things that need to be done to the warehouse software. Some of the things on the list do no need to be finished before SC, but are the next steps of development. a) Feed real monitoring data into the warehouse by contacting the existing node monitor daemon (nmd). This is an interim solution, because in its current configuration, nmd depends on the pcp monitoring library, and it itself is unmaintained. b) Create XML parser to interpret and respond to requests for node information from the Scheduler component. Verify functionality with the Scheduler (with Scott and Dave) c) Create code to register with the Service Directory component so that other components can find us easily, and so that we can send and subscribe to events. d) Add some level of documentation to the package. e) Package up warehouse as an rpm. This will need to have two install images, one for the nodes on the cluster, and one for where the information is collected f) After spending a week attempting to get my rpm file to integrate into sss-oscar, spend a long evening at home, drinking Margaritas, and trying to remember why I got a job in computer science when my training is in physics. g) get warehouse-SM.rpm successfully integrated into sss-oscar. --- this is what needs to be finished by SC --- h) eliminate nmd; access monitoring libraries directly i) [with Scott and Dave Jackson] update the Scheduler and warehouse to utilize ssslib wire protocols for communications j) [with ANL folks] develp XML schema for updating the System Monitor via the Build and Configuration Manager and so on... The current status: a) is finished b) is 80% done, and I have information to finish it out. c) should be simple and straightforward. Plan to get b) and c) done this coming week. We need to be up through g) before SC. Finishing h) and i) would be nice before SC as well. h) is possible on that time frame, it will depend on our progress at the feature freeze date. Item h) eliminates dead code, makes things much easier to build and maintain, and lessens requirements. Item i) does not add significant functionality, and so probably won't be pursued until after SC. 2003 October 13 Update: item "b" in the list above is finished. Before I left on vacation last week Wednesday, Dave Jackson and I were able to get the scheduler to talk to the warehouse System Manager and get a picture of the system using just that information. There are certainly tweaks to be made, but we have established the base functionality. The code freeze is coming right up, and I need to get item "c" finished before that time. I will be working on that in the next couple of days so that doesn't lag behind. After that, and after we get a few more things ironed out with the System Monitor/Scheduler communication, it will be time to start working on packaging. Actually, as it turns out, item "h" is finished as well. I couldn't build the nmd on the compute nodes on chiba, so I threw together some code that would extract the information from a "hostname" call, and harvest system numbers out of /proc. This got us going, and will suffice for SC, but it makes the code non-portable for the moment.