Scalable Systems Software Project Notebook - page 42 of 88

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Notes from February 24-25 Meeting at Argonne

Hi Folks,

First let me appologize for taking so long to get the meeting notes into the notebook. Believe it or not I have been on a long string of back to back trips and this is my first week back in the office since our Scalable Systems meeting in Chicago.

Attendees

NameOrganizationEmail
Al Geist ORNL gst@ornl.gov
Don Mason Cray dmm@cray.com
Mike Showerman NCSA mshow@ncsa.edu
Paul Hargrove LBNL phhargrove@lbl.gov
Erik DeBenedictis SNL epdeben@sandia.gov
William McLendon SNL wcmclen@sandia.gov
Craig Steffen NCSA csteffen@ncsa.uiuc.edu
Dan Stanzione Clemson dstanzi@clemson.edu
Neil Pudit SNL pundit@sandia.gov
Narayan Desai ANL desai@mcs.anl.gov
Stephen Scott ORNL scottsl@ornl.gov
Thomas Naughton ANL naughtont@ornl.gov
Scott Jackson PNNL Scott.Jackson@pnl.gov
Brett Bode Ames Lab brett@ameslab.gov
Phil Pfeiffer ETSU phil@etsu.edu
Rick Bradshaw ANL bradshaw@mcs.anl.gov
Rusty Lusk ANL lusk@mcs.anl.gov
John Dawson USI jkd@unlimitedscale.com
Matt Sottile LANL matt@lanl.gov

Agenda and ppt slides

              
Monday February 24 8:00  Continental Breakfast                          
                          
 8:30  Al Geist - View from the Top                         
         Project Status, SciDAC PI mtg, and External Project review.             
             
 9:00  Matt Sottile - Science Appliance Project    
       and leveraging Scalable Systems.             
                          
               
     [In the following time slots each of the Working Group Leaders                  
     has an hour to cover the following topics and others of their choosing:                          
        - What areas their working group is addressing                          
        - Progress report on what their group has done                          
        - Present problems being addressed                          
        - Next steps for the group                          
        - Discussion items for the larger group to consider]               
             
         
9:30  Scott Jackson - Resource Management and Accounting                                  
             
10:30  Break                          
             
11:00  Will Mclendon - Validation and Testing                      
                                   
12:00  Lunch                          
                          
 1:00  Paul Hargrove - Process Management and Monitoring    
       Mike Showermann - Monitoring update (no slides)         
       Rusty Lusk - process manager slides                  
             
 2:00  Narayan Desai - Node Build and Configuration                               
              
 3:00  Break                          
                          
 3:30  Hacking and Putting together a Large-scale run on Chiba City                 
       Lot of time spent debugging components and interfaces.             
                         
 5:30  Adjourn                          
                          
       Dinner - on your own                          
                          
         

Tuesday February 25 8:00 Continental Breakfast 8:30 Discussions, proposals, and voting Rusty proposes that we write papers on each component more in the spirit of documentation and motivation and symantics. Al reports that a Project status report (18 pages) is in the main notebook. Comments welcome. Al requests comments on "restricted interface" XML presented by Rusty the day before. Al states that he would like all the components to adopt this type of format. Discussion by the group about whether we can do a live demo at the external review meeting. Concensus is yes. 10:30 Break 11:00 Al Geist - Summary: PI mtg talk and poster External Review agenda next meeting date: June 5-6 at Argonne Thank our hosts at ANL. 11:30 Meeting Ends

Meeting Goals

Meeting notes

Lots of new people. Start with introductions

Al's Talk. - (slides above)
His talk was divided into three sections: Review of results of last meeting, progress since last meeting, and expectations for this meeting.

SciDAC PI meeting talk and poster
External review agenda and topics

Matt's Talk - (slides above)
Pink: a 1024 node science appliance. Provide pseudo SSI that scales to 1024.
Tolerates failure. Singe point for management.
Reduce boot and install time by x100. Reduce number of FTP per number of nodes.
Science Appliance – very little in common with older linux.
Software is called Clustermatic – linuxBIOS, Bproc, V9fs, supermon,
Panasas or Lustre (parallel file system by someone else)
Beoboot, asymmetric SSI, private name spaces from Plan 9,
BJS (Bproc Job Scheduler)
Other work –
ZPL (automatic check point)
Debuggers (parallel, relative debugging –Guard) port totalview.
Latency tolerant applications
Users – SNL/CA, U Penn, Clemson
What are overlap opportunities?

Scott's Talk - (slides above)
RM update. Diagram of architecture and infrastructure services
Sc02 demo what components working. They used polling.
Now moving to event driven components
Release of initial RM suite – from website http://sss.scl.ameslab.gov/software/

SSSRMAP protocol using HTTP validated
Scalability testing performed on all components
Scheduler progress in slides
Queue Manager progress in slides
Accounting and Allocation Manager progress (Qbank and Gold prototype)
Meta-scheduler progress – Globus interface, Gold Information service.
Next work Release 2 of RM interface
Implement and test SSSRMAP security authentication (XML digital sigs)
Discuss need to have SSS wrappers on initial RM suite

Will's Talk - (slides above)
Validation and Testing update
-- Users expect a high degree of quality in today’s HPC.
Strategies
QMTest – RM group using it (www.codesourcery.com) They like --it's “easy”
App test packages
APITEST – growing out of October discussion

  1. C++ driven XML schema scriptable test of network components
  2. blackbox testing. Tcp, ssslib, portals support, fault injection
  3. whitebox testing. Try to exercise all paths in a known suite
  4. v0.1a underway 75% done
  5. Discussion how this could be useful to Scalable Systems
Cluster Integration Toolkit (CIT) –James Laros jhlaros@sandia.gov
  1. management tasks on Cplant – scalable to 1800 nodes
  2. done in Perl
  3. create Scalable Systems interface to CIT
  4. would be a good test of implementation of flexibility of standard.
  5. USI, IBM, and Linux Networx looking at it.

Paul's Talk - (slides above)
Process management report. Moving beyond prototypes of:
Checkpoint manager

Mike - Monitoring – job, system, node, and meta-version
what data is needed – an extensible framework defined
stream and single item. working on scalability now
Rusty - Process Manager
schematic of PM component in slides
MPD-2 in python and distributed with MPICH-2
- supports separate executables, arguments, and environment variables
New XML for PM (with queries that allow wildcards and ranges)
Combination of published interfaces, XML, and communication lib gives us a power greater than the sum of its parts.

Naryan's Talk - (slides above)
Build and configure report
Tests suggest scalability to 2000 host clusters
Communication Infrastructure
-- more protocol support, high availability option.
Build and configuration components
-- complete implementation on Chiba City
-- second OSCAR implementation underway
-- three components

Restriction Based Syntax for XML interfaces
API augmentation
APIs need more documentation to describe event handling protocol

John Dawson asks about license. Al says like MPI.
Don Mason asks about license (not GNU please!) and holding a workshop for industry
Talk with Remy about Science Appliance collaboration
Here is his reply


Having spoken with nearly everyone about the relationship between the Science Appliance project (SA) and the Scalable Systems Software project (SSS) and gotten to the point where I think we're all on the same page, it seems like a good idea to actually write that understanding down... if any of this is not what anyone is expecting, please let me know.

At a highest level, we would like to make sure that the two projects are complementary even though they have different emphasis and focus. Fortunately, we don't think this will be difficult to do.

The plan is as follows:

There's the plan.

A few other notes, while we're on the subject:

Please let me know if you have issues with any of this plan. I'd like to make sure we're all on the same page so that we can get busy on this.... :-)

Talk with Rusty about writing a paper on each component.

Groups Work on large scalability test on Chiba City and XTORC

meeting adjourned