|Date and Author(s)|
Here are my notes from our Scalable Systems meeting at Argonne. At this meeting we had four big results:
- Planing for SciDAC-2 and discuss new directions.
- Adjusting the schedule and content of the next three SSS-OSCAR releases
- Dicussions on how to get more external users and feedback this year
- Presentataion and acceptance of the Queue Manager API
|Brett Bode||Ames Labfirstname.lastname@example.org|
Tuesday May 10 8:00 Continental Breakfast 8:30 Al Geist - Project Status 9:00 Discussion of future ISIC directions presented to Strayer 9:30 Scott Jackson - Resource Management update 10:30 Break 11:00 Will Mclendon - Validation and Testing update Ron Oldfield - Integrated SSS test suites 12:00 Lunch (on own at Caffeteria) 1:30 Paul Hargrove - Process Management and Monitoring update 2:30 Narayan Desai - Node Build, Configure update (PDF not PPT) 3:30 Break 4:00 Craig Stefan - SSSRMAP in ssslib (no slides) 5:00 Discussion of getting SSS users and feedback 5:30 Adjourn for dinner
Wednesday May 11 8:00 Continental Breakfast 8:30 Thomas Naughton - Next SSS OSCAR software releases and schedule 9:30 Brett Bode - Discussions and vote on Queue Manager API 10:30 Group discussion of plans for SciDAC-2 (see next page and geist slides ) 11:00 Discussion - FastOS meeting in early June SciDAC PI Meeting in late June (Stephen Scott to prepare a poster) Set next meeting date: August 17-19, 2005 at ORNL 12:00 Meeting Ends
Al Geist – presents project overview and goals for this meeting then ideas for CS ISICs in next round of SciDAC (see slides) Scott Jackson – production use at more places eg. U. Utah Icebox 430proc Incorporation of SSSRMAP into ssslib in progress Paper accepted and new documents (see RM notebook) SOAP considered as basis for SSSRMAP v4 Discussion of pros and cons (scalability issues, but ssslib can support) Fault tolerance in Gold being developed using hot failover New Gold release v2 b2.10.2 includes distributed accounting Todo: Simplify allocation management Enabled support for mysql database Bamboo QM v1.1 released New fountain component alternate to Warehouse used in Work for support for SuperMon, Ganglin, and Nwperf Maui – improved grid scheduler multisite authentication. Support for Globus 4 Future Work - increase deployment base, ssslib integration, portability support for loadlever-like multi-step jobs, and PBS job language release of Silver meta-scheduler Will McClendon – APITest project status current release v 1.0 Latest work – new look using cascading style sheets new capabilities – pass/fail batch files, better parse error reporting User Guide Documentation done (50 pages) and SNL approved SW requirements: Python 2.3+, ElementTree, MySQL, ssslib, Twisted (version 2.0 added new dependencies) Helping fix bad tests – led to good discussion of this utility Future work: config file, test developer GUI, more… Ron Oldfield – Testing SSS suites 2 wks ago hired full time contractor (Tod Cordenbach) plus summer student Goals and deliverables for summer work performance testing of SSS-OSCAR comparison to other components write tech report of results What is important for each component: scheduler, job launch, queue, I/O,… Discussion of metrics. Scalability? User time, Admin time, HW resource efficiency Report what works, what doesn’t, what is performance critical Paul Hargrove – PM update Checkpoint (BLCR) status: users on four continents, bug fixes, Works with Linux2.6.11, partial AMD64/EM64T pot Next step is process groups/sessions OpenMPI work this summer ( student of Lumsdane) Have sketch of less restrictive syntax API Process manager status: complete rewrite of MPD more OO and pythonic provided a non-MPD implementation for BG/L using SSS API Narayan Dasi – BCM update SSS infrastructure in use at ANL: clusters, BG/L, IA32, PPC64 Better documentation now in place LRS Syntax: spec done, SDK complete, todo ssslib integration BG/L: arrived in January, initial Cobalt (SSS) suite on February many features being requested eg, node modes set in mpirun DB2 used for everything Cobalt – same as SW on Chiba City. All python components implemented using SSS-SDK several major extensions required for BG/L Narayan Dasi - Cobalt update for BG/L Scheduler (bgsched): new implementation needed to be topology aware, use DB2 partition unit is 512 nodes. Queue Manager (cqm): same SW as Chiba OS change on BG/L is trivial since system rebooted for each job Process Manager (bgpm): new implementation computer nodes don’t run full OS so no MPD mpirun complicated Allocation Manager (am): same as chiba very simple design Experiences: SSS really works Easy to port, simple approach makes system easy to understand Agility required for BG/L Comprehensive interfaces expose all information Admins can access internal state component behavior less mysterious extracting new info is easy Shipping Cobalt to a couple other sites Craig Stefan – (no slides) Not as much to report. Sidetracked for past three months on other projects Gives reasons Warehouse bugs also not done. Fixes to be done by next OSCAR release Graphical display for Warehouse created Same interfaces as Maui wrt requesting everything from all nodes SSSRMAP into ssslib Initial skeleton code for integration into ssslib begun. Needs questions answered from Jackson and Narayan to proceed Thomas Naughton – SSS OSCAR releases Testing for v1.1 release Base OSCAR v4.1 includes SSS APItest runs post-install tests on packages Discussion that Debian support will require both RPM and DEB formats Future work: complete v1.1 testing, migrate distribution to FRE repository extend SSS component tests, distribute as basic OSCAR “package set” needed ordering within a phase (work around for now) Release schedule: version Freeze Release New v1.0 Nov (SC04) first full suite release v1.1 Feb 15 May Gold update, bug fixes v1.2 Jun 15 July RH9 to Fedora2 oscar 4.1, BLCR to linux 2.6, improved tests, close known bug reports v2.0b Aug 15 Sept less restrictive syntax switch over, perf tests Silver meta-scheduler, Fedora4 v2.0 Oct 15 Nov (SC05) bug fixes, minor updates In oscar 5.0 as package set (after SC05) Bret Bode – Queue Manager API Lists all the functions then goes through detailed scheme of each Bamboo Uses SSSRMAP messaging and wire protocol Authentication – uses ssslib Authorization – uses info in SSSRMAP wire protocol Questions and discussion of interfaces Group has no objections just suggestions for improvement.