Resource Management and Accounting Notebook - page 105 of 150

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Warehouse (System Monitor) status/progress

2003SEP13     
     
Al bugged the group at the September meeting to do more uploads to the     
notebooks.  Having been taught both the value and the skill of notebooking     
early in my education, I feel bad about being one of those that has fallen     
down on the job.  So here's my attempt to get back into it.       
     
Just as a reminder, I'm developing a set of tools for passing information     
through tree of processes that I call "warehouse".  The idea is that each     
entity passing information has separate parts to take in information, store     
that information, and then pass it to other entities.  The entity does not     
need to know what the information means or how it is formatted to be able     
to handle it effieciently.       
     
The warehouse software will be able to be used in many applications, but     
initially it is being written as a reference implementation of the SSS     
System Monitor component.  The System Monitor is responsible for collecting     
and caching information about the state of the nodes in a cluster, and     
giving that information to other SSS components that need that information.  
      
     
Developing the warehouse infrastructure involved producing classes that     
would handle information and requests for that information in a threaded     
setup, with data locks to eliminate race conditions that would corrupt the     
data under high load.  The next step was to create the "source" and "sink"     
functions so that when information was passed to one warehouse process,     
another warehouse process on another machine could connect to it and get a     
copy of that information.  The basic capabilities of the warehouse software     
were successfully tested on Sunday, September 7.     
     
We intend to produce a beta software release of the SSS suite at SC 2003.      
For the warehouse System Monitor to be a part of that release, it has to     
have real monitoring data fed into the bottom of the warehouse tree, and     
the top of the tree must server out that data in the role of SSS System     
Monitor.       
     
Here is my checklist of things that need to be done to the warehouse     
software.  Some of the things on the list do no need to be finished before     
SC, but are the next steps of development.    
     
a) Feed real monitoring data into the warehouse by contacting the existing     
node monitor daemon (nmd).  This is an interim solution, because in its     
current configuration, nmd depends on the pcp monitoring library, and it     
itself is unmaintained.       
     
b) Create XML parser to interpret and respond to requests for node     
information from the Scheduler component.  Verify functionality with the    
Scheduler (with Scott and Dave)    
     
c) Create code to register with the Service Directory component so that    
other components can find us easily, and so that we can send and subscribe    
to events.     
     
d) Add some level of documentation to the package.    
    
e) Package up warehouse as an rpm.  This will need to have two install    
images, one for the nodes on the cluster, and one for where the information    
is collected    
    
f) After spending a week attempting to get my rpm file to integrate into    
sss-oscar, spend a long evening at home, drinking Margaritas, and trying to    
remember why I got a job in computer science when my training is in physics.    
    
g) get warehouse-SM.rpm successfully integrated into sss-oscar.    
    
--- this is what needs to be finished by SC ---    
    
h) eliminate nmd; access monitoring libraries directly    
    
i) [with Scott and Dave Jackson] update the Scheduler and warehouse to    
utilize ssslib wire protocols for communications    
    
j) [with ANL folks] develp XML schema for updating the System Monitor via    
the Build and Configuration Manager    
    
and so on...    
    
The current status:    
a) is finished    
b) is 80% done, and I have information to finish it out.  c) should be    
simple and straightforward.  Plan to get b) and c) done this coming week.    
    
We need to be up through g) before SC.  Finishing h) and i) would be nice  
before SC as well.  h) is possible on that time frame, it will depend on  
our progress at the feature freeze date.  Item h) eliminates dead code,  
makes things much easier to build and maintain, and lessens requirements.   
 Item i) does not add significant functionality, and so probably won't be  
pursued until after SC.  
  
2003 October 13 
 
Update: item "b" in the list above is finished.  Before I left on vacation 
last week Wednesday, Dave Jackson and I were able to get the scheduler to 
talk to the warehouse System Manager and get a picture of the system using 
just that information.  There are certainly tweaks to be made, but we have 
established the base functionality. 
 
The code freeze is coming right up, and I need to get item "c" finished 
before that time.  I will be working on that in the next couple of days so 
that doesn't lag behind.  After that, and after we get a few more things 
ironed out with the System Monitor/Scheduler communication, it will be time 
to start working on packaging. 
 
Actually, as it turns out, item "h" is finished as well.  I couldn't build 
the nmd on the compute nodes on chiba, so I threw together some code that 
would extract the information from a "hostname" call, and harvest system 
numbers out of /proc.  This got us going, and will suffice for SC, but it 
makes the code non-portable for the moment.