Scalable Systems Software Project Notebook - page 80 of 88

First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Meeting Notes for May 10-11 Meeting at Argonne

Hi Folks,

Here are my notes from our Scalable Systems meeting at Argonne. At this meeting we had four big results:


Al Geist ORNL
Paul Hargrove LBNL
William McLendon SNL
Ron Oldfield SNL
Neil Pundit SNL
Craig Steffen NCSA
Narayan Desai ANL
Stephen Scott ORNL
Thomas Naughton ORNL
Scott Jackson PNNL
Brett Bode Ames Lab
Rick Bradshaw ANL
Rusty Lusk ANL

Final Agenda and ppt slides

Tuesday May 10                         
 8:00  Continental Breakfast                                                      
 8:30  Al Geist - Project Status                                                    
 9:00  Discussion of future ISIC directions presented to Strayer                                                              
9:30  Scott Jackson - Resource Management update                                                            
10:30  Break                                                      
11:00  Will Mclendon - Validation and Testing update       
       Ron Oldfield - Integrated SSS test suites                                               
12:00  Lunch (on own at Caffeteria)                                                    
 1:30  Paul Hargrove - Process Management and Monitoring update                                                        
 2:30  Narayan Desai - Node Build, Configure update (PDF not PPT)          
 3:30  Break          
 4:00 Craig Stefan - SSSRMAP in ssslib (no slides)                 
 5:00 Discussion of getting SSS users and feedback                                                                                        
 5:30  Adjourn for dinner           

Wednesday May 11 8:00 Continental Breakfast 8:30 Thomas Naughton - Next SSS OSCAR software releases and schedule 9:30 Brett Bode - Discussions and vote on Queue Manager API 10:30 Group discussion of plans for SciDAC-2 (see next page and geist slides ) 11:00 Discussion - FastOS meeting in early June SciDAC PI Meeting in late June (Stephen Scott to prepare a poster) Set next meeting date: August 17-19, 2005 at ORNL 12:00 Meeting Ends

Meeting Notes

Al Geist – presents project overview and goals for this meeting 
then ideas for CS ISICs in next round of SciDAC (see slides) 
Scott Jackson – production use at more places eg. U. Utah Icebox 430proc  
Incorporation of SSSRMAP into ssslib in progress 
Paper accepted and new documents (see RM notebook)  
SOAP considered as basis for SSSRMAP v4 
Discussion of pros and cons (scalability issues, but ssslib can support) 
Fault tolerance in Gold being developed using hot failover 
New Gold release v2 b2.10.2 includes distributed accounting  
Todo: Simplify allocation management 
      Enabled support for mysql database  
Bamboo QM v1.1 released 
New fountain component alternate to Warehouse used in Work for  
support for SuperMon, Ganglin, and Nwperf 
Maui – improved grid scheduler 
   multisite authentication. Support for Globus 4 
Future Work -  increase deployment base, ssslib integration, portability 
   support for loadlever-like multi-step jobs, and PBS job language 
   release of Silver meta-scheduler 
Will McClendon – APITest project status current release v 1.0 
Latest work – new look using cascading style sheets 
   new capabilities – pass/fail batch files, better parse error reporting  
   User Guide Documentation done (50 pages) and SNL approved 
SW requirements: Python 2.3+, ElementTree, MySQL, ssslib, 
  Twisted (version 2.0 added new dependencies) 
Helping fix bad tests – led to good discussion of this utility 
Future work: config file, test developer GUI, more… 
Ron Oldfield – Testing SSS suites 
2 wks ago hired full time contractor (Tod Cordenbach) plus summer student 
Goals and deliverables for summer work  
  performance testing of SSS-OSCAR 
  comparison to other components 
  write tech report of results 
What is important for each component: scheduler, job launch, queue, I/O,… 
Discussion of metrics. Scalability?  User time, Admin time,  
  HW resource efficiency  
Report what works, what doesn’t, what is performance critical 
Paul Hargrove – PM update 
Checkpoint (BLCR) status: users on four continents, bug fixes,  
Works with Linux2.6.11, partial AMD64/EM64T pot 
Next step is process groups/sessions 
OpenMPI work this summer ( student of Lumsdane) 
Have sketch of less restrictive syntax API 
Process manager status: complete rewrite of MPD more OO and pythonic 
  provided a non-MPD implementation for BG/L using SSS API 
Narayan Dasi – BCM update 
SSS infrastructure in use at ANL: clusters, BG/L, IA32, PPC64 
Better documentation now in place 
LRS Syntax: spec done, SDK complete, todo ssslib integration 
BG/L: arrived in January, initial Cobalt (SSS) suite on February 
  many features being requested eg, node modes set in mpirun 
  DB2 used for everything 
Cobalt – same as SW on Chiba City.  
  All python components implemented using SSS-SDK 
  several major extensions required for BG/L 
Narayan Dasi - Cobalt update for BG/L 
Scheduler (bgsched): new implementation 
  needed to be topology aware, use DB2 
  partition unit is 512 nodes. 
Queue Manager (cqm): same SW as Chiba 
  OS change on BG/L is trivial since system rebooted for each job 
Process Manager (bgpm): new implementation 
  computer nodes don’t run full OS so no MPD 
  mpirun complicated 
Allocation Manager (am): same as chiba 
  very simple design 
Experiences: SSS really works 
Easy to port, simple approach makes system easy to understand 
Agility required for BG/L 
Comprehensive interfaces expose all information 
  Admins can access internal state 
  component behavior less mysterious 
  extracting new info is easy 
Shipping Cobalt to a couple other sites 
Craig Stefan – (no slides) 
Not as much to report. Sidetracked for past three months on other projects 
Gives reasons Warehouse bugs also not done.  
Fixes to be done by next OSCAR release 
Graphical display for Warehouse created 
Same interfaces as Maui wrt requesting everything from all nodes 
SSSRMAP into ssslib 
Initial skeleton code for integration into ssslib begun. 
Needs questions answered from Jackson and Narayan to proceed 
Thomas Naughton – SSS OSCAR releases 
Testing for v1.1 release 
Base OSCAR v4.1 includes SSS APItest runs post-install tests on packages 
Discussion that Debian support will require both RPM and DEB formats 
Future work: complete v1.1 testing, migrate distribution to FRE repository 
  extend SSS component tests, distribute as basic OSCAR “package set” 
  needed ordering within a phase (work around for now) 
Release schedule:  
version   Freeze        Release            New 
v1.0                   Nov (SC04)   first full suite release                             
v1.1        Feb 15        May       Gold update, bug fixes                                                       
v1.2        Jun 15        July      RH9 to Fedora2 oscar 4.1, BLCR to linux 2.6, 
                                    improved tests, close known bug reports                                
v2.0b      Aug 15       Sept        less restrictive syntax switch over, perf tests 
                                    Silver meta-scheduler, Fedora4                            
v2.0        Oct 15     Nov (SC05)   bug fixes, minor updates 
                                    In oscar 5.0 as package set (after SC05) 
Bret Bode – Queue Manager API 
Lists all the functions then goes through detailed scheme of each 
Bamboo Uses SSSRMAP messaging and wire protocol 
Authentication – uses ssslib 
Authorization – uses info in SSSRMAP wire protocol 
Questions and discussion of interfaces 
Group has no objections just suggestions for improvement.