Scalable Systems Software Project Notebook - page 42 of 88
Notes from February 24-25 Meeting at Argonne
First let me appologize for taking so long to get the meeting notes
into the notebook. Believe it or not I have been on a long string
of back to back trips and this is my first week back in the
office since our Scalable Systems meeting in Chicago.
|Al Geist || ORNL || email@example.com
|Don Mason || Cray || firstname.lastname@example.org
|Mike Showerman || NCSA || email@example.com
|Paul Hargrove || LBNL || firstname.lastname@example.org
|Erik DeBenedictis || SNL || email@example.com
|William McLendon || SNL || firstname.lastname@example.org
|Craig Steffen || NCSA || email@example.com
|Dan Stanzione || Clemson || firstname.lastname@example.org
|Neil Pudit || SNL || email@example.com
|Narayan Desai || ANL || firstname.lastname@example.org
|Stephen Scott || ORNL || email@example.com
|Thomas Naughton || ANL || firstname.lastname@example.org
|Scott Jackson || PNNL || Scott.Jackson@pnl.gov
|Brett Bode || Ames Lab || email@example.com
|Phil Pfeiffer || ETSU || firstname.lastname@example.org
|Rick Bradshaw || ANL || email@example.com
|Rusty Lusk || ANL || firstname.lastname@example.org
|John Dawson || USI || email@example.com
|Matt Sottile || LANL || firstname.lastname@example.org
Agenda and ppt slides
Monday February 24 8:00 Continental Breakfast
8:30 Al Geist - View from the Top
Project Status, SciDAC PI mtg, and External Project review.
9:00 Matt Sottile - Science Appliance Project
and leveraging Scalable Systems.
[In the following time slots each of the Working Group Leaders
has an hour to cover the following topics and others of their choosing:
- What areas their working group is addressing
- Progress report on what their group has done
- Present problems being addressed
- Next steps for the group
- Discussion items for the larger group to consider]
9:30 Scott Jackson - Resource Management and Accounting
11:00 Will Mclendon - Validation and Testing
1:00 Paul Hargrove - Process Management and Monitoring
Mike Showermann - Monitoring update (no slides)
Rusty Lusk - process manager slides
2:00 Narayan Desai - Node Build and Configuration
3:30 Hacking and Putting together a Large-scale run on Chiba City
Lot of time spent debugging components and interfaces.
Dinner - on your own
Tuesday February 25
8:00 Continental Breakfast
8:30 Discussions, proposals, and voting
Rusty proposes that we write papers on each component
more in the spirit of documentation and motivation and
Al reports that a Project status report (18 pages) is in
the main notebook. Comments welcome.
Al requests comments on "restricted interface" XML presented
by Rusty the day before. Al states that he would like all
the components to adopt this type of format.
Discussion by the group about whether we can do a live demo
at the external review meeting. Concensus is yes.
11:00 Al Geist - Summary:
PI mtg talk and poster
External Review agenda
next meeting date: June 5-6 at Argonne
Thank our hosts at ANL.
11:30 Meeting Ends
- Getting ready for External review panel evaluation
- Progress and next steps reports from the Working Group leaders
- Demonstration of system components from several groups working together
- Planning for SciDAC PI meeting in March
Lots of new people. Start with introductions
Al's Talk. - (slides above)
His talk was divided into three sections:
Review of results of last meeting,
progress since last meeting, and
expectations for this meeting.
SciDAC PI meeting talk and poster
External review agenda and topics
Matt's Talk - (slides above)
Pink: a 1024 node science appliance. Provide pseudo SSI that scales to 1024.
Tolerates failure. Singe point for management.
Reduce boot and install time by x100. Reduce number of FTP per number of nodes.
Science Appliance – very little in common with older linux.
Software is called Clustermatic – linuxBIOS, Bproc, V9fs, supermon,
Panasas or Lustre (parallel file system by someone else)
Beoboot, asymmetric SSI, private name spaces from Plan 9,
BJS (Bproc Job Scheduler)
Other work –
ZPL (automatic check point)
Debuggers (parallel, relative debugging –Guard) port totalview.
Latency tolerant applications
Users – SNL/CA, U Penn, Clemson
What are overlap opportunities?
- Each piece can be separated out. Supermon, Bproc
- Remy will be sending more material on collaboration soon
Scott's Talk - (slides above)
RM update. Diagram of architecture and infrastructure services
Sc02 demo what components working. They used polling.
Now moving to event driven components
Release of initial RM suite – from website http://sss.scl.ameslab.gov/software/
SSSRMAP protocol using HTTP validated
- OpenPBS-sss 2.3.15-1
- Maui scheduler 3.2.6
- Qbank 2.10.4 (accounting system)
Scalability testing performed on all components
Scheduler progress in slides
Queue Manager progress in slides
Accounting and Allocation Manager progress (Qbank and Gold prototype)
Meta-scheduler progress – Globus interface, Gold Information service.
Next work Release 2 of RM interface
Implement and test SSSRMAP security authentication (XML digital sigs)
Discuss need to have SSS wrappers on initial RM suite
Will's Talk - (slides above)
Validation and Testing update
-- Users expect a high degree of quality in today’s HPC.
QMTest – RM group using it (www.codesourcery.com) They like --it's “easy”
App test packages
APITEST – growing out of October discussion
Cluster Integration Toolkit (CIT) –James Laros email@example.com
- C++ driven XML schema scriptable test of network components
- blackbox testing. Tcp, ssslib, portals support, fault injection
- whitebox testing. Try to exercise all paths in a known suite
- v0.1a underway 75% done
- Discussion how this could be useful to Scalable Systems
- management tasks on Cplant – scalable to 1800 nodes
- done in Perl
- create Scalable Systems interface to CIT
- would be a good test of implementation of flexibility of standard.
- USI, IBM, and Linux Networx looking at it.
Paul's Talk - (slides above)
Process management report. Moving beyond prototypes of:
Mike - Monitoring – job, system, node, and meta-version
- beta-code April release awaiting legal OK
- will do scalability test today
- working on XML interface for checkpoint/restart (draft in May)
what data is needed – an extensible framework defined
stream and single item. working on scalability now
Rusty - Process Manager
schematic of PM component in slides
MPD-2 in python and distributed with MPICH-2
- supports separate executables, arguments, and environment variables
New XML for PM (with queries that allow wildcards and ranges)
Combination of published interfaces, XML, and communication lib gives us a power greater than the sum of its parts.
Naryan's Talk - (slides above)
Build and configure report
Tests suggest scalability to 2000 host clusters
-- more protocol support, high availability option.
Build and configuration components
-- complete implementation on Chiba City
-- second OSCAR implementation underway
-- three components
Restriction Based Syntax for XML interfaces
- hardware manager (needs more modular, extensible design)
- build system
- node manager (admin control panel for a cluster) system diagnostics
APIs need more documentation to describe event handling protocol
John Dawson asks about license. Al says like MPI.
Don Mason asks about license (not GNU please!) and holding a workshop for industry
Talk with Remy about Science Appliance collaboration
Here is his reply
Having spoken with nearly everyone about the relationship between the Science Appliance project (SA) and the Scalable Systems Software project (SSS) and gotten to the point where I think we're all on the same page, it seems like a good idea to actually write that understanding down... if any of this is not what anyone is expecting, please let me know.
At a highest level, we would like to make sure that the two projects are complementary even though they have different emphasis and focus. Fortunately, we don't think this will be difficult to do.
The plan is as follows:
There's the plan.
- Narayan Desai of ANL has worked a bit with the Science Appliance programs and has done fairly extensive development as part of SSS. He and I are confident that the Science Appliance components (linuxbios, the bproc system, supermon, ...) can fit well into the Scalable Systems Software component framework.
- Thus, the goal will be to adapt the SA tools and possibly the SSS interfaces in order to eventually allow the SA tools to be used without modification in a system using SSS components.
- Narayan and Matt Sotille (LANL) are getting together tomorrow to look at Supermon in particular in some detail to see how the SSS interface specification maps onto the Supermon API.
- The exact mechanism that we'll be using to get the SA tools to speak the SSS interface specifications is still to be determined - it may involve changes to the SA code, wrappers for various things, or encapsulation in SSS tools. In any case, these shouldn't be too difficult, because the SSS project has developed a library for the interface, and because the components match fairly well. Most likely this will involve effort by both LANL and ANL.
- Our goal will be to develop a specific plan for what needs to be done and how to do it during the next 3 months, and then to start doing it. We may be able to do this via the regular SSS meetings, or it may be better for the LANL and ANL folks to get together for a day or so and work through the details.
- Once we understand the best way to do this and we get it done, we'll be able to demonstrate a cluster that, when running the SSS infrastructure, can also use SA components. We've talked about a system that might be running bproc on some nodes and the more standard linux installation on other nodes at the same time. (In fact, we have some interest in actually doing this on a production cluster at ANL to support users...) In the best case, this kind of interaction would be possible to demonstrate late this summer.
A few other notes, while we're on the subject:
Please let me know if you have issues with any of this plan. I'd like to make sure we're all on the same page so that we can get busy on this.... :-)
- This exercise will be valuable to SSS, because it will help validate the component model that SSS has developed. Or, if the SSS component model needs to be adapted somewhat, this is also worth understanding.
- I think this will also be useful to SA because it will increase the exposure of the tools to folks across the Lab community, and also make it more clear that people can use the SA tools independent of one another.
- Communication between the two projects has already been pretty good. For example, an SA person has been at all or nearly all of the SSS meetings.
Talk with Rusty about writing a paper on each component.
Groups Work on large scalability test on Chiba City and XTORC