Scalable Systems Software Project Notebook - page 85 of 88

First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Meeting Notes August 17 in Oak Ridge

This is a place holder till I find my notes

Hi Folks,

Finally! Here are my notes from our Scalable Systems meeting at Oak Ridge. Some highlights:


Al Geist ORNL
Paul Hargrove LBNL
William McLendon SNL
Ron Oldfield SNL
Neil Pundit SNL
Craig Steffen NCSA
Narayan Desai ANL
Stephen Scott ORNL
Thomas Naughton ORNL
David Jackson Ames Lab
Brett Bode Ames Lab
Rick Bradshaw ANL
Rusty Lusk ANL
Wesley Bland ORNL
John Mugler ORNL
Todd Kordenbrock SNL/HP
Gary Skouson PNNL
Paul Egli NCSA

Final Agenda and ppt slides

Tuesday May 10                          
 8:00  Continental Breakfast  CSB room B226                                                      
 8:30  Al Geist - Project Status 
 9:00  Craig Steffen - Race Conditions in Suite  
       and Warehouse update 
 1:30  Paul Hargrove - Process Management and Monitoring update                                                                                                                      
10:30  Break                                                                                               
11:00  Todd Kordenbrock - Robustness and Scalability Testing                                                
12:00  Lunch (on own at Caffeteria)                                                     
 1:30  Brett Bode - Resource Management update                                                     
 2:30  Narayan Desai - Node Build, Configure, Cobalt update (PDF not PPT)           
 3:30  Break           
 4:00 Craig Stefan - SSSRMAP in ssslib                  
 5:00 Discussion of getting SSS users and feedback 
      Thomas Naughton - SSS-OSCAR  update                                                                                          
 5:30  Adjourn for dinner            

Meeting Notes Al Geist – see slides. Started out with a presentation of the "Faster than Light Computer" (magic trick) while waiting for everyone to arrive. Craig Steffen – Exciting new race condition! Nodes go offline – Warehouse doesn’t know quick enough Event manager, scheduler, lots of components affected Problem grows linear with system size. C Order of operations need to be considered – something we haven’t considered before. Issue can be reduced, can’t be solved Good discussion on ways to reduce race conditions. SSS use at NCSA Paul Egli rewrote Warehouse – many new features added, Sandia user Now monitoring sessions All configuration is dynamic Multiple debugging channels Sandia user – tested to 1024 virtual nodes Web site – New hire full time on SSS Lining up T2 scheduling (500 proc) Paul Hargrove – Checkpoint Manager BLCR status AMD64/EM64T port now in beta (crashes some users machines) Recently discovered kernel panic during signal interaction (must fix at hackerfest) Next step process groups/sessions – begin next week LRS-XML and Events “real soon now” Open MPI chpt/restart support by SC2005 Torque integration done at U. Mich. for phd thesis (needs hardening) Process manager – MPD rewrite “refactoring” Getting a PM stable and working on BG. Todd K – Scalability and Robustness tests ESP2 Efficiency ratio 0.9173 on 64 nodes Scalability – Bamboo 1000 job submission Gold (java version) – reservation slow – PERL version not tested Warehouse – up to 1024 nodes Maui on 64 nodes (need more testing) Durability – Node Warm stop – 30 seconds to Maui notification Node Warm start – 10 seconds Node Cold stop – 30 seconds Single node failure – good Resource hog (stress) Resource exhaustion – service node (Gold fails in logging package) Anomalies Maui Warehouse Gold happynsm ToDo Test BLCR module Retest on larger cluster Get latest release of all software and retest Write report on results. Brett Bode – RM status New release of components Bamboo v1.1 Maui 3.2.6p13 Gold 2b2.10.2 Gold being used on Utah cluster SSS suite on several systems at Ames New fountain component – to front end Supermon, ganglia, etc. Demos new tool called Goanna for looking at fountain output Has same interface as Warehouse – could plug right in General release of GOLD available. New perl cgi gui no Java dependency at all in Gold X509 support in Mcom (for Maui and Silver) Cluster scheduler bunch of new features Grid scheduler – enabled basic accounting for grid jobs. Future work – Gary needs to get up to speed on Gold code make it all work with LRS Narayan – LRS Conversion status All components in center cloud converted to LRS Service directory, Event manager, BCM stack, Processor Manager Targeted for SC05 release SSSlib changeover – completed SDK support – completed Cobalt Overview SSS suite on Chiba and BG Motivations – scalability, flexibility, simplicity, support for research ideas Tools included: parallel programming tools Porting has been easy – now running on Linux, MacOS, and BG/L Only about 5K lines of code. Targeted for Cray xt3, x1, zeptoOS Unique features- small partition support on BG/L, OS Spec support Agile – swap out components. User and admin requests easier to satisfy Running on ANL and NCAR (evaluation at other BG sites) May be running on JAZZ soon. Future- better scheduler, new platforms, more front ends, better docs