Oak Ridge National Laboratory
/Scratch As a Cache
To sustain emerging data-intensive scientific applications,
High Performance Computing (HPC) centers invest a notable
fraction of their operating budget on a specialized fast
storage system, scratch space, which is designed for storing
the data of currently running and soon-to-run HPC jobs.
Instead, it is often used as a standard file system, wherein
users arbitrarily store their data, without any consideration
to the center.s overall performance. To remedy this, centers
periodically scan the scratch in an attempt to purge transient
and stale data. This practice of supporting a cache
workload using a file system and disjoint tools for staging
and purging results in suboptimal use of the scratch space.
In this work, we address the above issues by proposing a new perspective, where the HPC scratch space is treated as a cache, and data population, retention, and eviction tools are integrated with scratch management. In our approach, data is moved to the scratch space only when it needed, and unneeded data is removed as soon as possible. We also design a new job-workflow-aware caching policy that leverages user-supplied hints for managing the cache.