Digital Forensics and the Cloud
I work with a research group that generally focuses on text analysis/mining and Bayesian networks, but recently applied their strengths to the area of digital forensics. Specifically, they developed tools that are used by local police departments to aid in the prosecution of child pornographers. In one of our more recent meetings, we began discussing the role that cloud computing can play in this problem domain – how can it help, how can it hurt, what work needs to be done to address the resultant issues. While our collaboration effort is still in the nascent stages, we’ve established a handful of “known’s” that are worthy of broader conversation. As with most technologies, there is both a good and a dark side to the use of cloud computing. My goal in this article is not to paint the cloud with a black brush, but rather to highlight some unique issues and call to mind challenges that exist and must be dealt with.
Does Amazon’s Cluster Compute Platform Still Represent Cloud Computing?
I’m sitting at the airport in New Orleans, after having attended the first half of the ACM/IEEE 2010 Super Computing conference. This was the first time I have attended this conference, and it was certainly interesting to participate.
During the workshop I participated in on Sunday (Petascale Data Analytics on Clouds: Trends, Challenges, and Opportunities), there arose a conversation regarding the Amazon EC2 “cluster compute instances” and their having reached a spot on the Top 500 list. What surprised me, however, was not that they were mentioned (I actually expected them to receive more attention than they did), but that they were described as not being “real” cloud computing. The point was made that they represented some sort of special configuration that was done just for the tests and that the offering was somehow significantly different than the rest of the general populous could acquire. The two primary individuals involved in the exchange have significant history in classic HPC and have, at least a degree of “anti-cloud” bias, but I am responsible for helping influence the viewpoint of one of these folks so I’ve been thinking a bit over the past few days about how to properly articulate the inaccuracies of the argument… and wondering if it really matters anyway.
Cloud Computing: Beyond the Buzz
Everyone (or so it seems) is talking about the cloud...
I have the privilege of being on the ground, working every day with these technologies and I’m seeing the actual transformation – the people who are beginning to embrace it, the scientists who are interested in using it, and some of the problems it has actually solved. Rather than hyping an idea or pushing a particular technology, I thought I’d take this opportunity to discuss some examples of the work we’ve been involved with and where we think it is headed.
Evaluating the role of Cloud Computing for Scientific Discovery
The project that is currently consuming the majority of my time is that of evaluating the role that Cloud Computing can play in the scientific computing realm. We spent a signficant amount of time working on data movement issues and optimizations and are currently looking at various scientific codes that "fit" into the various cloud architectures.
Recently I've been working on a desktop application named BiLab. BiLab is an interactive workbench for computational biologists similar in some ways to MatLab but focused (not surprisingly) on the computational biologist or molecular biologist. The goal is to tie together a number of existing and proven tools into a single interface allowing the user to interact with numerous tools from one application. Additionally, the tool includes interpreters which allow the output from one tool to be used as input to the next, saving the user the headache of manipulating file formats.
Scientific Tool for Applications Harnessing the Cloud (STAHC)
STAHC is a rather simple tool that allows applications that are not HPC aware but have problem sets that can be data parallelized to benefit from the computational power available in the cloud. The user of the tool completes a manifest file which is then fed into STAHC. STAHC packages up the source application and input data, deploys it to a cloud computing platform, calls the provisiong APIs of that platform to deploy the application and input data to the specified number of nodes, executes the computation, aggregates the output data and downloads it after which the cloud platform is torn down. All of this occurs without the domain scientist having to adjust or port his code to any cloud specific interfaces.
One issue of significance in cloud computing is that of exposing large datasets in a cloud friendly fashion. Exposed Walrus is a set of tools that extends the existing Walrus platform (S3-compatible storage implementation in the open source Eucalyptus platform) and allows an organization to publish large collections of data with an S3-compatible interface in situ. So long as the original data store is visible from the Walrus server, these tools allow that data to be proxied through and published to consumers as if it had been directly loaded into the Walrus platform.
Windows Azure GAC Viewer
Developing for the cloud often introduces new challenges and one of them is overcoming (or becoming comfortable with) the opacity of the machines on which one's code is running. The Window Azure GAC Viewer is a utility made available to the community that allows them to view a number of screens giving them low-level details about a typical machine running in Windows Azure. Additionally, a user can have the tool screen his project file to verify that he has the appropriate project references in place.
When I'm not working on the projects listed above, I'm tinkering with GPGPU development using CUDA as well as statistical analysis using R.
Contact Info and Other Details
My ORNL Directory Entry has my primary address information.
Sending Me Large Files
Occasionally our email system strips large or otherwise suspicious file attachements. If you would like to send me a large file or anything else that seems to have trouble getting through email, please use our file upload system.
I'm available on a number of different social media sites although my most used (currently) is Twitter. If you are interested in following me, I'm available as @argodev.-
I am currently a member of the research staff at Oak Ridge National Laboratory where I'm working with the Computer Science Research Group (part of the Computer Science and Mathematics Division and the Computing and Computational Sciences Directorate). My research focus is cloud computing and similar distributed technologies. Prior to coming to ORNL, I obtained over 8 years of industry experience working in large-scale datacenters and building systems for provisioning and maintaining large-scale systems software and services.
Robert E. Gillen