Data

Oak Ridge National Laboratory conducts research in core technologies for handling data at all stages of the scientific discovery process, from managing input and output of simulations and experiments, through processing and visualization of data, to presentations and dissemination to collaborators of the results. Current practices and I/O solutions are becoming increasingly incapable of processing the fast growing data volumes produced by novel high performance computing resources and experimental facilities. There is an urgent need for scalable data science that enables analysis and visualization at large scale, and that allows for validation of simulations against experiments. A particular focus is on exploring and developing new techniques for massive scale scientific data. These include new methods for scalable and in-situ analysis of big data, development of dynamic workflow for in-situ data processing, new tools to enable highest performing I/O for scientific simulation, frameworks to provide easier evaluation of I/O performance for a variety of domains, and the use of domain and user knowledge to optimize data management.

Collaboratory

Text Here...

International Collaboration Framework for Extreme Scale Experiments (ICEE)

ICEE is a project to focus on researching and development of collaborative environments for large-scale scientific data exploration. In many science domains, such as high-energy physics, fusion, and climate, etc., remote or even international collaborations are getting common and produce more and more data that the existing workflow management systems are hard pressed to keep pace with.

Our work is to research and develop necessary solutions which can dramatically increase the data handling capability of collaborative workflow systems by leveraging the popular in transit processing system known as ADIOS, and integrating this with FastBit to provide selective data accesses. These new features will contribute to our new collaborative system at significantly improving the data flow management for distributed workflows. The improved data processing capability will enable large international projects to make near real-time collaborative decisions. We are also investigating data mining features to provide feedback and recommendations while users are constructing or modifying a workflow.

Overall, the ICEE framework will allow researchers to conduct distributed analyses on extreme scale data efficiently and easily. It will enable collaborative decisions in near real-time for geographically distributed teams, reduce the turn-around time on large instruments, and improve scientific productivity.

Dashboards

As simulations scale to extreme processor counts, the complexity and the size of scientific data being generated is ever-growing. Modern supercomputers and experimental facilities are producing an unprecedented amount of data to visualize and analyze in order to understand the mysteries of the universe. Rather than disperse pen-and-paper researchers, the problem solving approach of leadership science involves teams of collaborating scientists. Scientific social networks, gateways and dashboards are an important manifestation of this phenomenon. The eSiMon dashboard is one project that also researches and tackles the challenges of collaboration in high performance computing. Our research is based on the hypothesis that the best way to introduce collaboration in these traditionally resistant domains is to use light-weight and easy-to-use tools that are well integrated with scientists’ existing environments. The goal is to collect and combine resources and expertise in one converging system so that individual researchers do not have to put extra efforts in becoming experts in related disciplines or new technologies. The hidden and powerful advantage of such integrated portal is the capability to persistently record the process leading to scientific breakthrough. The paths to discoveries as well as the results themselves are fundamental to understanding and promoting scientific advances. The eSiMon research goal is to enable more science by exploring the best collaboration methods and technologies.

eSiMon (Contact Roselyne Tchoua)

eSiMon (electronic Simulation and Monitoring) is a one-point-access to the simulation. It is a one-stop-shop for collaborating scientists to monitor and access results. It is a light-weight in order to allow easy access by team members from any browser and platform. At the top level collaborators from different but related disciplines view the same general presentation of the status of the simulation. eSiMon is the first medium between scientists and their data. Therefore, we aim at presenting the user directly with lists of scientific variables names, 1D and 2D graphics instead of burdening them with the intricacies of directories, file names and formats etc. We hide the complexity in the back end by using data lineage and provenance tracking techniques. From this at-a-glance view, users can drill deeper into aspects of the results that pertain to their specific fields. eSiMon allows users to identify problems or areas of special interest in the vast datasets produced by the simulation. Collaborators can share access to running and past simulation results. Examples of actions performed on eSiMon are visualizing images and movies of variables evolving over time, annotating runs and particular results, downloading subsets of the data and performing preliminary light-weight analysis on the data. By enabling transparent sharing of metadata, resources, data and analysis routines, we expect to minimize the time spent on mundane tasks and repetitive preliminary steps in the scientists' routine. We anticipate that after using eSiMon users have a clearer picture of where they wish to do more-in-depth analysis. Our ultimate goal is to seamlessly lead them to the most exciting and challenging part of their work. The vision for the next generation eSiMon is to enable job submission, validation and verification, independently of where the data, the scripts and resources are physically located.

Data Tools

Text Here...

SDAV (Data Management/Scientific Software Tools)(Contact Scott Klasky)

Scalable Data Management, Analysis, and Visualization (SDAV) provides comprehensive expertise in scientific data management, analysis, and visualization aimed at transferring state of the art techniques into operational use by application scientists on leadership-class computing facilities. Our team works directly with application scientists to assist them by aplying the best tools and technologies at our disposal, and learning from the scientists where our tools fall short. Technical solutions to any shortcomings are implemented to ensure that our tools overcome mission-critical challenges in the scientific discovery process. These tools are further developed and improved as these computing platforms change.

Data Management: Experts in our field expect that, as concurrency grows, there will be a widening gap between computational and I/O capacity, and this will be further stressed by energy demands. Our approach is to perform as much work as possible while the data is still resident in application memory, a use model often referred to as “in-situ.”

Simulations are generating an unprecedented amount of data, facilitated by the rapidly increasing computational capabilities of leading compute resources. This presents significant challenges. One challenge lies in hardware trends: the enormous increases in compute power are not being matched by corresponding increases in bandwidth to storage. Cost and power constrain the feasibility of dramatically larger storage deployments. A second challenge lies in extracting knowledge from these volumes of data. Research in data management infrastructure has created capabilities that can assist in this process, but the available tools are not widely used and deployed. These are not just future challenges, but rather, they are already causing bottlenecks that substantially impact the quality and productivity of scientific research performed with HPC machines.

Scientific Software Tools: A sustainable software infrastructure requires quality assurance, regression testing, distribution, and tracking feedback from the users. Our intent is to deliver a software infrastructure to the scientific community that couples the best practices from both research and development.

Source: http://sdav-scidac.org/


Steven Chu Announces the Scalable Data Management, Analysis, and Visualization Institute

BERKELEY, Calf., March 30, 2012 -- As scientists around the world address some of society’s biggest challenges, they increasingly rely on tools ranging from powerful supercomputers to one-of-a-kind experimental facilities to dedicated high-bandwidth research networks. But whether they are investigating cleaner sources of energy, studying how to treat diseases, improve energy efficiency, understand climate change or address environmental issues, the scientists all face a common problem: massive amounts of data which must be stored, shared, analyzed and understood. And the amount of data continues to grow – scientists who already are falling behind are in danger of being engulfed by massive datasets. Today Energy Secretary Steven Chu announced a $25 million five-year initiative to help scientists better extract insights from today’s increasingly massive research datasets, the Scalable Data Management, Analysis, and Visualization (SDAV) Institute. SDAV will be funded through DOE’s Scientific Discovery through Advanced Computing (SciDAC) program and led by Arie Shoshani of Lawrence Berkeley National Laboratory (Berkeley Lab).

Source: http://www.hpcwire.com/hpcwire/2012-03-30/steven_chu_announces_the_scalable_data_management_analysis_and_visualization_institute.html

In-situ Scientific Data Processing

One bottleneck for performance in massive scale scientific simulation has been the movement of data to and from the persistent storage platforms. Analytics and visualization tasks consume the data generated by the simulation, and using the storage system to hold the data unnecessarily adds additional I/O to the processing pipeline. This is a key insight driving the research into new techniques for in-situ data processing pipelines. Based on our research with the componentized I/O framework, ADIOS, and collaboration with multiple domain scientists, we have proposed this paradigm shift towards reducing the dependence on storing data on disks, instead processing the data near the data generator. We have already demonstrated the effectiveness of this "staged" processing of data for I/O intensive tasks and are now exploring the scheduling and placement of dynamic workflows to minimize the end-to-end data processing latency.

 

IO

For the past decades, the computing power in High Performance Computing (HPC) grows doubled every 1.8 years and we will soon face the exa-scale (10^18 operation per second) computing era. However, the IO-related technology will not scale to face with the growth of computing powers and, in fact, it has been a serious bottleneck in many scientific applications. Especially, in big data science, seeking scientific breakthroughs from large-scale data analysis in the fields of physics, cosmology, biology, to name a few, IO performance is a critical issue. Our research focus is to provide cutting-edge IO performance for scientific applications and help scientists to achieve scientific breakthroughs. In the line of this effort, we have developed the Adaptable IO System (ADIOS), a simple and flexible IO middleware designed to orchestrate various IO components and harness application performance for large-scale and data-intensive scientific applications. ADIOS has been successfully applied in many real-world scientific applications, such as Gyrokinetic Toroidal Code (GTC), plasma fusion simulation code (XGC), combustion simulation code (S3D), and proved its success by showing increased IO performances.

ADIOS (Contact Qing Liu)

The Adaptable IO System (ADIOS) is a parallel IO middleware designed for large-scale scientific simulations. The goal of this project is to provide fast, adaptable and scalable IO interfaces so that scientific codes can run highly efficiently across all computing platforms.ADIOS provides a simple, flexible way for scientists to describe the data in their code that may need to be written, read, or processed outside of the running simulation. By providing an external XML file describing the various elements, their types, and how you wish to process them, the routines in the host code can transparently change how they process the data. Based on the settings in this XML file, the data will be processed differently. For example, a user could select file IO, including MPI-IO, POSIX-IO, or staging IO, by which data can be streamed to auxiliary compute nodes through high-speed interconnections for further processing, without having to either change the source code or even recompile. So far, ADIOS has been integrated into many codes that have large allocation at leadership computing facilities, including S3D (combustion), GTC(fusion), GPC-P(fusion), XGC(fusion), SCEC(earthquake), Chimera (Astrophysics) etc, and significant IO improvements have been achieved for these codes. ADIOS is also adopted by DOE SDAV institute as a software framework that will further incorporate compression, indexing and other technologies into it.


The Adaptable IO System (ADIOS) provides a simple, flexible way for scientists to describe the data in their code that may need to be written, read, or processed outside of the running simulation. By providing an external to the code XML file describing the various elements, their types, and how you wish to process them this run, the routines in the host code (either Fortran or C) can transparently change how they process the data.

The in code IO routines were modeled after standard Fortran POSIX IO routines for simplicity and clarity. The additional complexity including organization into hierarchies, data type specifications, process grouping, and how to process the data is stored in an XML file that is read once on code startup. Based on the settings in this XML file, the data will be processed differently. For example, you could select MPI individual IO, MPI colletive IO, POSIX IO, an asynchronous IO technique, visualization engine, or even NULL for no output and cause the code to process the data differently without having to either change the source code or even recompile.

The real goal of this system is to give a level of adaptability such that the scientist can change how the IO in their code works simply by changing a single entry in the XML file and restarting the code. The ability to control at a per element basis and not just a data grouping such as a restart, diagnostic output, or analysis output makes this approach very flexible. Along with this detail level, a user can also just change which transport method is used for a data type such as a restart, analysis, or diagnostic write.

For the transport method implementer, the system provides a series of standard function calls to encode/decode data in the standardized .bp file format as well as “interactive” processing of the data by providing direct downcalls into the implementation for each data item written and also callbacks when processing a data stream once a data item has been identified along with its dimensions and a second callback once the data has been read giving the implementation the option to allocate memory and process the data as close to the data source as is reasonable.

Website: http://www.olcf.ornl.gov/center-projects/adios/


ADIOS team releases version 1.4. A unified read API was released in ADIOS 1.4.0 for data processing for both files and staging. It lays the foundation for future releases with new in situ processing features. This way the same ADIOS software supports both high-performance file I/O for existing LCF applications, and in situ analytics, visualization and code-coupling frameworks looking forward to exascale computing. A visualization schema has been created, too. Applications can describe, alongside the definition of the output dataset, also the representation of the data. This will help visualization tools to create a generic ADIOS data reader for all applications using ADIOS and also help users of applications to visualize the data according to the developers' intention. An I/O skeleton generation and testing tool, skel, has also been released with this version. Skel generates code from the ADIOS XML input file to imitate the output characteristics of an application. It allows for having up-to-date I/O skeletons from various scientific applications and a unified testing method of I/O performance for valid comparisons over time and over I/O solutions.

 

In transit Scientific Data Management

Text Here...

Center for Exascale Simulation of Combustion in Turbulence (ExaCT) (Contact Hasan Abbasi)

Combustion plays a central role in the energy economy for both the nation, and the rest of the word, with sustained demand expected to last throughout the century. Efficient development of combustion technology will produce a significant impact on the nation's energy consumption. The advent of exascale computing will enable high fidelity simulations of the complex chemistry coupled turbulent transports to allow simulations to greatly aid in the design of new engines and new fuels. However, the upcoming exascale age will also require substantial advances in techniques to obtain insight from the vast deluge of data expected to be produced by the simulations. Our focus is to investigate new techniques to support exascale scientific data management, analysis and visualization (SDMAV), interact with both the application scientists and hardware architects and provide input to vendors that will provide the next generation of high performance hardware. To wit, we are studying methods for improving the efficiency of data movement, utilizing in situ analysis and visualization to minimize, or even eliminate the data movement entirely, and utilize placement of analysis and visualization tasks to optimize power consumption while meeting performance targets.


Combustion currently provides 85% of the nation's energy needs. In fact, the continued demand for abundant combustion-fueled energy will persist well into this century. This places enormous pressure to improve the combustion efficiency in transportation and power generation devices while simultaneously developing more diverse fuel streams, including carbon neutral biofuels. Ultimately, to shorten the design cycle of new fuels optimally tailored to work with novel fuel-efficient, clean engines requires fundamental advances in combustion science. One key avenue of study in this area is the development of predictive models for engineering design. These predictive models couple chemistry with turbulent transport under real-world conditions. Exascale computing will enable first principles direct numerical simulation (DNS) of turbulent combustion science at higher Reynolds number, higher pressures, and with greater chemical complexity than current petascale computing.

One of the primary challenges to achieving exascale computing is designing new architectures that will work under the enormous power and cost constraints. The mission of co-design within the Center for Exascale Simulation of Combustion in Turbulence (ExaCT) is to absorb the sweeping changes necessary for exascale computing into software and ensure that the hardware is developed to meet the requirements to perform these real-world combustion computations.

ExaCT will perform multi-disciplinary research required to iteratively co-design all aspects of combustion simulation including math algorithms for partial differential equations, programming models, scientific data management and analytics for in situ uncertainty quantification and topological analysis, and architectural simulation to explore hardware tradeoffs with combustion proxy applications representing the workload to the exascale ecosystem. ExaCT is comprised of six DOE laboratories (SNL, ORNL, LLNL, LANL, LBNL, NREL) and five university partners (The University of Texas at Austin, Stanford University, Georgia Institute of Technology, The University of Utah, and Rutgers, The State University of New Jersey) involving the multi-disciplinary interaction of combustion scientists, computer scientists, and applied mathematicians. For additional information, please contact the Center Director, Dr. Jacqueline Chen of Sandia National Laboratories at jhchen@sandia.gov

Website: http://exactcodesign.org/

 

Provenance Capture Mining (Contact Jong Youl Choi)

Provenance system, which was originally designed to collect various metadata information related with events, process, or data to provide data lineage or audit trails, is recently receiving increased attentions in a collaborative multi-user environment, since one can take advantage of collected and accumulated provenance information to extract knowledge of peers by using various data mining and machine learning techniques. Due to the advances in statistical learning techniques in recent years, various machine learning and data mining techniques have been applied in many domains and proven their successes in discovering previously hidden knowledge from massively collected data.

Our research goal is to develop a systematic way to store and index provenance information of data access in scientific applications, and to utilize such collected information for mining to improve data access performance, provide machine-guided parameter selection, etc.

ADIOS-P (Contact Jong Youl Choi)

The Adaptable IO System (ADIOS) was developed to provides a simple and flexible way to manage IO related tasks in large-scale and data-intensive scientific applications and it has been playing a central role in many real-world scientific applications, such as Gyrokinetic Toroidal Code (GTC), plasma fusion simulation code (XGC), combustion simulation code (S3D), etc. ADIOS-P is a project to extend its success. Based on the ADIOS framework, we take a further step toward supporting intelligence in data management. Our research goal is two fold; i) support provenance through ADIOS in collecting and indexing various metadata information generated during the data accessing and processing, and ii) provide a systematic way to exploit collected information for enhancing IO performance and tuning performance parameters in applications. In other words, ADIOS-P will provide not only data lineage or audit trails, but also provide a framework to perform various knowledge discovery and data mining processes to discover hidden knowledge in a collaborative multi-user environment or a large-scale simulations with multiple components.

 

Statistics

We focus on two areas of statistics that are of importance to scalable data science: (1) Research for the development of scalable analytics and (2) research for understanding of high-dimensional response functions. (1) The development of scalable analytics is lagging far behind the simulation sciences in its use of high performance computing resources. Those who develop statistical analytics prefer to work with high level programming languages that are close to mathematics. They know what can be asynchronous in the mathematics but do not know the additional intricacies needed for developing and running codes on large computational platforms. As a result, there exist large diverse collections of serial analytical tools but only a scant number of the analytics are scalable. We are developing high level methods for programming with big data (pbd) to engage and enable this community to prototype new scalable codes. (2) Computational science codes often have large numbers of parameters that influence their output. It is often difficult to understand which parameters and parameter interactions are important over an input region of interest. While statistical techniques exist for variability attribution to parameters, they rely on designed sample spaces which are typically not available in simulation science collections either because the parameter space is too large or simply because statistical design was not used in selecting parameter combinations. We use a combination of surrogate models and analysis of variance to provide variability attribution and parameter effect estimation techniques.

Presentation: Applied Statistics for the Office of Science

pbd: programming with big data (Contact George Ostrouchov)

pbd: "programming with big data" is a set of high level language programming tools, written as R packages, that enable high level programming with big data in R without the need to micro-manage distributed data. The intent is to use a familiar serial programming syntax in R, which is close to data mathematics, while being mindful of the data distribution: "Old syntax with a new mindset." We provide the ability to program analytics sequences with minimal data movement among the distributed components. Our goal is to engage and enable analytics developers in R who have a mathematical mindset to create the needed diversity in scalable analytics. Currently we have four packages that are pending release: pbdMPI - a more intuitive and faster R interface to MPI, pbdSLAP - connecting scalable linear algebra libraries to R, pbdDMAC - intuitive distributed matrix algebra in R, and pbdBMTK - a benchmarking toolkit for pbd codes.