Toward the Next Generation of Parallel and Resilient Algorithms

Michael A. Heroux, Computing Research Division, Sandia National Laboratories

For decades parallel computing has been the focus of intense research and development in selected fields, and numerous large-scale parallel applications have been developed. SPMD via MPI has been a dominant approach to parallelism to date, but this approach alone will be insufficient going forward. Presently we are on the threshold of large-scale algorithm re-design and implementation across most application areas, but the path to developing these algorithms is uncertain. There are many competing concerns, and the number of choices is growing. Furthermore, resilience is becoming an issue and may force algorithm developers to explicitly manage system faults, beyond checkpoint-restart.

In this presentation we discuss some of the principles of parallel algorithm development that have produced today’s approaches, and how we can address these principles going forward. We also discuss what much change in order to move forward and give ideas for developing parallel algorithms now that will have sustained value in the future.

Michael A. Heroux worked at Cray Research from 1988 to 1998, the last three years as part of Silicon Graphics. During his first five years he developed mathematical libraries for sparse and dense systems of equations on Cray systems. Following this, he worked in the application division, focusing on solution methods for fluid dynamics, oil and gas and structural applications, both for commercial applications such as FIDAP and FLUENT, and for individual customer applications. During his final three years he managed several groups of scientists focused on new application capabilities in science and engineering, and parallel applications. During these years he was also the applications representative on future architecture teams, including the Cray T3E and SV2 systems. Presently Dr. Heroux is a Distinguished Member of the Technical Staff at Sandia National Laboratories, working on new algorithm development, and robust parallel implementation of solver components for problems of interest to Sandia and the broader scientific and engineering community. He leads development of the Trilinos Project, an effort to provide state of the art solution methods in a state of the art software framework. Trilinos is a 2004 R&D 100 award-winning product, freely available as Open Source and actively developed by dozens of researchers. In addition to Trilinos, Dr. Heroux works on the development of scalable parallel scientific and engineering applications and maintains his interest in the interaction of scientific/engineering applications and high performance computer architectures. He leads the Mantevo project, which is focused on the development of Open Source, portable mini-applications and mini-drivers for scientific and engineering applications. Dr. Heroux is a telecommuter for Sandia, maintaining an office at home in rural central Minnesota and at St. John's University where he is Scientist in Residence in the Computer Science Department. He is a member of the Society for Industrial and Applied Mathematics (SIAM) and past chair of the SIAM Activity Group on Supercomputing. He is a Distinguished Member of the Association for Computing Machinery (ACM). He is the Editor-in-Chief for the ACM Transactions on Mathematical Software, Subject Area Editor for the Journal on Parallel and Distributed Computing and Associate Editor for the SIAM Journal on Scientific Computing..