Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased and hence, the researchers have explored non-volatile memories (NVMs) which provide high density and... more
Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased and hence, the researchers have explored non-volatile memories (NVMs) which provide high density and consume low-leakage power. Since NVMs have low write-endurance and the existing cache management policies are write variation-unaware, effective wear-leveling techniques are required for achieving reasonable cache lifetimes using NVMs.  We present EqualWrites, a technique for mitigating intra-set write variation. Our technique works by recording the number of writes on a block and changing the cache-block location of a hot data-item to redirect the future writes to a cold block to achieve wear-leveling. Simulation experiments have been performed using an x86-64 simulator and benchmarks from SPEC06 and HPC (high-performance computing) field. The results show that for single, dual and quad-core system configurations, EqualWrites improves cache lifetime by 6.31X, 8.74X and 10.54X, respectively. Also, its implementation overhead is very small and it provides larger improvement in lifetime than  three other intra-set wear-leveling techniques and a cache replacement policy.
With increasing system core-count, the size of last level cache (LLC) has increased and since SRAM consumes high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this,... more
With increasing system core-count, the size of last level cache (LLC) has increased and since SRAM consumes high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this, researchers have used embedded DRAM (eDRAM) LLCs which consume low-leakage power. However, eDRAM caches consume a significant amount of energy in the form of refresh energy. In this paper, we propose ESTEEM, an energy saving technique for embedded DRAM caches. ESTEEM uses dynamic cache reconfiguration to turn-off a portion of the cache to save both leakage and refresh energy. It logically divides the cache sets into multiple modules and turns-off possibly different number of ways in each module. Microarchitectural simulations confirm that ESTEEM is effective in improving performance and energy efficiency and provides better results compared to a recently-proposed eDRAM cache energy saving technique, namely Refrint. For single and dual-core simulations, the average saving in memory subsystem (LLC+main memory) on using ESTEEM is 25.8% and 32.6%, respectively and average weighted speedup are 1.09X and 1.22X, respectively. Additional experiments confirm that ESTEEM works well for a wide-range of system parameters.
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern... more
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern processor design. The conventional cache energy saving techniques require offline profiling or provide only coarse granularity of cache allocation. We present FlexiWay, a cache energy saving technique which uses dynamic cache reconfiguration. FlexiWay logically divides the cache sets into multiple (e.g. 16) modules and dynamically turns off suitable and possibly different number of cache ways in each module. FlexiWay has very small implementation overhead and it provides fine-grain cache allocation even with caches of typical associativity, e.g. an 8-way cache. Microarchitectural simulations have been performed using an x86-64 simulator and workloads from SPEC2006 suite. Also, FlexiWay has been compared with two conventional energy saving techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.
Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard... more
Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary storage (e.g., solid state drive) and main memory. We classify these software techniques along several dimensions to highlight their similarities and differences. Given that NVMs are growing in popularity, we believe that this survey will motivate further research in the field of software technology for NVMs
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip... more
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches. A crucial limitation of NVMs, however, is that their write endurance is low and the large intra-set write variation introduced by existing cache management policies may further exacerbate this problem, thereby reducing the cache lifetime significantly. We present EqualChance, a technique to increase cache lifetime by reducing intra-set write variation. EqualChance works by periodically changing the physical cache-block location of a write-intensive data item within a set to achieve wear-leveling. Simulations using workloads from SPEC CPU2006 suite and HPC (high-performance computing) field show that EqualChance improves the cache lifetime by 4.29X. Also, its implementation overhead is small, and it incurs very small performance and energy loss.
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve... more
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents... more
As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in  memory components of future extreme-scale systems.
[Download source-code from: https://code.ornl.gov/3d_cache_modeling_tool/destiny] To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have... more
[Download source-code from: https://code.ornl.gov/3d_cache_modeling_tool/destiny] To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches.  We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, embedded DRAM (eDRAM),    spin transfer torque RAM (STT-RAM), resistive RAM (ReRAM) and phase change RAM (PCM). DESTINY is very useful for performing design-space exploration across several dimensions, such as optimizing for a target (e.g. latency, area or energy-delay product) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e. 2D v/s 3D) for a given optimization target etc. DESTINY has been validated against several cache prototypes. We believe that DESTINY will boost studies of next-generation memory architectures used in systems ranging from mobile devices to extreme-scale supercomputers.
Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase in the size of on-chip caches. Since SRAM has low density and consumes large amount of leakage power, its use in designing on-chip caches... more
Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase in the size of on-chip caches. Since SRAM has low density and consumes large amount of leakage power, its use in designing
on-chip caches has become more challenging. To address this issue, researchers are exploring the use of several emerging memory technologies, such as embedded DRAM, spin transfer torque RAM, resistive RAM, phase change RAM and domain wall memory. In this paper, we survey the architectural approaches proposed for designing memory systems and, specifically, caches with these emerging memory technologies. To highlight their similarities and differences, we present a classification of these technologies and architectural approaches based on their key characteristics. We also briefly summarize the challenges in using these technologies for architecting caches. We believe that this survey will help the readers gain insights into the emerging memory device technologies, and their potential use in designing future computing systems.
As evidenced by the popularity of MPI (Message Passing Interface), message passing is an effective programming technique for managing coarse-grained concurrency on distributed computers. Unfortunately, debugging message-passing... more
As evidenced by the popularity of MPI (Message Passing Interface), message passing is an effective programming technique for managing coarse-grained concurrency on distributed computers. Unfortunately, debugging message-passing applications can be difficult. Software complexity, data races, and scheduling dependencies can make programming errors challenging to locate with manual, interactive debugging techniques. This article describes Umpire, a new tool for detecting programming
This annotated bibliography reviews current research in dynamic and interactive programsteering. In particular, we review systems-related research addressing dynamic programsteering, raising issues in operating and language systems,... more
This annotated bibliography reviews current research in dynamic and interactive programsteering. In particular, we review systems-related research addressing dynamic programsteering, raising issues in operating and language systems, mechanisms and algorithmsfor dynamic program adaptation, program monitoring and the associated data storagetechniques, and the design of dynamically steerable or adaptable programs. We defineprogram steering as the capacity to control the execution of...
This paper presents the challenges encountered in and po- tential solutions to designing scalable Software Transactional Memory (STM) for large-scale distributed memory systems with thousands of nodes. We introduce Global Transactional... more
This paper presents the challenges encountered in and po- tential solutions to designing scalable Software Transactional Memory (STM) for large-scale distributed memory systems with thousands of nodes. We introduce Global Transactional Memory (GTM), a generalized and scalable STM design supporting a dynamic programming model based on thread- level parallelism, Single Process Multiple Data (SPMD) par- allelism, and remote procedure call
Future-looking high end computing initiatives will deploy powerful, large-scale computing platforms that leverage novel component technologies for superior node performance in advanced system architectures with tens or even hundreds of... more
Future-looking high end computing initiatives will deploy powerful, large-scale computing platforms that leverage novel component technologies for superior node performance in advanced system architectures with tens or even hundreds of thousands of nodes. Recent advances in performance tools and modeling methodologies suggest that it is feasible to acquire such systems intelligently and achieve excellent performance, while also significantly reducing the
... Group Lead: Scott Hemmert Participants and Contributors: Jim Ang, Brian Carnes, Patrick Chiang, Doug Doerfler, Sudip Dosanjh, Parks Fields, Ken Koch, Jim Laros, Matt Leininger, John Noe, Terri Quinn, Josep Torrellas, Jeff Vetter,... more
... Group Lead: Scott Hemmert Participants and Contributors: Jim Ang, Brian Carnes, Patrick Chiang, Doug Doerfler, Sudip Dosanjh, Parks Fields, Ken Koch, Jim Laros, Matt Leininger, John Noe, Terri Quinn, Josep Torrellas, Jeff Vetter, Cheryl Wampler, Andy White ...
1 The submitted manuscript has been authored by a contractor of the US Government under Contract No. DE-AC05-00OR22725. Accordingly, the US Government retains a non-exclusive, royalty-free license to publish or reproduce the published... more
1 The submitted manuscript has been authored by a contractor of the US Government under Contract No. DE-AC05-00OR22725. Accordingly, the US Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this ...
... Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter Oak Ridge National Laboratory {spaffordkl,jsmeredith,vetter}@ornl.gov Abstract. As heterogeneous computing platforms become more preva-lent, the programmer must account for complex... more
... Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter Oak Ridge National Laboratory {spaffordkl,jsmeredith,vetter}@ornl.gov Abstract. As heterogeneous computing platforms become more preva-lent, the programmer must account for complex memory hierarchies in addition to ...
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces... more
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies these effects and presents some
Page 1. Balancing Productivity and Performance on the Cell Broadband Engine Sadaf R Alam1, Jeremy S Meredith2, Jeffrey S Vetter3 Oak Ridge National Laboratory Oak Ridge, TN 37830 [email protected] [email protected] [email protected] ...
... Multi-paradigm, multi-threaded and multi-core computing devices available today provide several orders of magnitude performance improvement ... Eldorado[12]) systems —available in desktop computing systems as well as proposed for... more
... Multi-paradigm, multi-threaded and multi-core computing devices available today provide several orders of magnitude performance improvement ... Eldorado[12]) systems —available in desktop computing systems as well as proposed for Petaflops-scale supercomputing platforms ...
... Any adaptive control system must implicitly or ex- plicitly monitor pertinent system state s , determine what changes are needed, and realize those changes to ...
Abstract. As cluster computing has grown, so has its use for large sci-entific calculations. Recently, many researchers have experimented with using MPI between nodes of a clustered machine and OpenMP within a node, to manage the use of... more
Abstract. As cluster computing has grown, so has its use for large sci-entific calculations. Recently, many researchers have experimented with using MPI between nodes of a clustered machine and OpenMP within a node, to manage the use of parallel processing. ...
ABSTRACT Presents the welcome message from the conference proceedings.
Research Interests:
ABSTRACT
ABSTRACT Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing... more
ABSTRACT Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing characterization activities, we have developed the Oxbow toolkit and an associated data store infrastructure for collecting, storing, and querying this characterization information. This paper presents recent updates to the Oxbow toolkit and introduces the Oxbow project's Performance Analytics Data Store (PADS). To demonstrate the possible insights when using the toolkit and data store, we compare the characterizations of several full and proxy applications, along with the High Performance Linpack (HPL) and High Performance Conjugate Gradient (HPCG) benchmarks. Using techniques such as cluster visualizations of PADS data across many experiments, we found that the results show unexpected similarities and differences between proxy applications, and a greater similarity of proxy applications to HPCG than to HPL along many dimensions.
ABSTRACT As detailed in recent reports, HPC architectures will continue to change over the next decade in an effort to improve energy efficiency, reliability, and performance. At this time of significant disruption, it is critically... more
ABSTRACT As detailed in recent reports, HPC architectures will continue to change over the next decade in an effort to improve energy efficiency, reliability, and performance. At this time of significant disruption, it is critically important to understand specific application requirements, so that these architectural changes can include features that satisfy the requirements of contemporary extreme-scale scientific applications. To address this need, we have developed a methodology supported by a toolkit that allows us to investigate detailed computation, memory, and communication behaviors of applications at varying levels of resolution. Using this methodology, we performed a broad-based, detailed characterization of 12 contemporary scalable scientific applications and benchmarks. Our analysis reveals numerous behaviors that sometimes contradict conventional wisdom about scientific applications. For example, the results reveal that only one of our applications executes more floating-point instructions than other types of instructions. In another example, we found that communication topologies are very regular, even for applications that, at first glance, should be highly irregular. These observations emphasize the necessity of measurement-driven analysis of real applications, and help prioritize features that should be included in future architectures.

And 20 more

Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased. Since SRAM consumes high leakage power, researchers have explored use of non-volatile memories (NVMs)... more
Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased. Since SRAM consumes high leakage power, researchers have explored use of non-volatile memories (NVMs) for designing caches as they provide high density and consume low leakage power. However, since NVMs have low write-endurance and the existing cache management policies are write variation-unaware, effective wear-leveling techniques are required for achieving reasonable cache lifetimes using NVMs. We present WriteSmoothing, a technique for mitigating intra-set write variation in NVM caches.  WriteSmoothing logically divides the cache-sets into multiple modules. For each module, WriteSmoothing collectively records number of writes in each way for any of the sets. It then periodically makes most frequently written ways in a module unavailable to shift the write-pressure to other ways in the sets of the module. Extensive simulation results have shown that on average, for single and dual-core system configurations, WriteSmoothing improves cache lifetime by 2.17X and 2.75X, respectively. Also, its implementation overhead is small and it works well for a  wide range of algorithm and system parameters.
Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches holds the promise of providing a high-density, low-leakage alternative to SRAM. However, low write... more
Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches holds the promise of providing a high-density, low-leakage alternative to SRAM. However, low write endurance of NVMs, along with the write-variation introduced by existing cache management schemes may significantly limit the lifetime of NVM caches. We present LastingNVCache, a technique for improving lifetime of NVM caches by mitigating the intra-set write variation. LastingNVCache works on the key idea that by periodically flushing a frequently-written data-item, the next time the block can be made to load into a cold block in the set. Through this, the future writes to that data-item can be redirected from a hot block to a cold block, which leads to improvement in the cache lifetime. Microarchitectural simulations have shown that LastingNVCache provides 6.36X, 9.79X, and 10.94X  improvement in lifetime for single, dual and quad-core systems. Also, its implementation overhead is small and it outperforms a recently proposed technique for improving lifetime of NVM caches.
Research Interests: