- Dr. Sparsh Mittal is currently working as an assistant professor in the ECE department at IIT Roorkee, India. He is a... moreDr. Sparsh Mittal is currently working as an assistant professor in the ECE department at IIT Roorkee, India. He is also a joint faculty at the Mehta Family School of DS and AI at IIT Roorkee. He received the B.Tech. degree from IIT, Roorkee, India, and the Ph.D. degree from Iowa State University (ISU), USA. He has worked as a Post-Doctoral Research Associate at Oak Ridge National Lab (ORNL), USA, and assistant professor at IIT Hyderabad, India. He was the graduating topper of his batch in B.Tech and has received a fellowship from ISU and a performance award from ORNL. His research interests are Processor architectures for machine learning, neural network accelerators, computer architecture, VLSI, high-performance computing and approximate computing.
He has published more than 100 papers at top venues. His research has been covered by InsideHPC, HPCWire, Phys.org and ScientificComputing. Also, his research on mobile-phone usage detection was praised by the education minister of India and was covered by leading newspapers. He has given invited talks at the ISC Conference in Germany, New York University, and the University of Michigan. In Stanford's 2021 list of the world's top researchers in the field of Computer Hardware and Architecture, he was ranked as number 71 (for the whole career) and as number 3 (for the year 2020 alone). He has more than 4000 followers on Academia and ResearchGate.edit
Recent advances in pre-trained neural language models have substantially enhanced the performance of numerous natural language processing (NLP) tasks. However, some existing models require pretraining on a large dataset. Moreover, on... more
Recent advances in pre-trained neural language models have substantially enhanced the performance of numerous natural language processing (NLP) tasks. However, some existing models require pretraining on a large dataset. Moreover, on using a deep network with sequentially connected transformer blocks, there is a data loss across these blocks. To overcome these challenges, we propose LiBERTy, a novel network for natural language understanding. LiBERTy uses a novel TransLSTM module which takes the representations from the BERT block as input and feeds it to LSTM which functions as a pooling layer. The use of LSTM as a pooler helps the model sequentially encode the feature map into hidden states and understand semantic interrelations. The output of the TransLSTM module is fed to a classifier, which uses multiple 1D-CONV blocks, a 1D adaptive average pooling layer and a "fullyconnected" (FC) layer and then, ArcFace loss. ArcFace loss helps in achieving inter-class separability and intra-class compactness. Our proposed strategies increase the efficiency of model pre-training and the performance of both natural language understanding (NLU) and downstream tasks. We showcase the efficacy of LiBERTy by applying it for three tasks: (1) disaster tweet classification on the HumAID dataset, (2) fine-grained emotion analysis on the GoEmotions dataset and (3) named entity recognition on TASTEset dataset.
Research Interests:
This research investigates how heavy-ion irradiation affects the single event transient (SET) response of 14nm silicon-on-insulator (SOI) FinFET. The researchers generally use a TCAD tool (e.g., Sentauras TCAD) for developing a SET pulse... more
This research investigates how heavy-ion irradiation affects the single event transient (SET) response of 14nm silicon-on-insulator (SOI) FinFET. The researchers generally use a TCAD tool (e.g., Sentauras TCAD) for developing a SET pulse current model. However, the TCAD simulations are timeconsuming, which prohibits efficient design-space exploration. We propose efficient models for predicting SET pulse current with high accuracy. We use (1) polynomial chaos (PC) based models (2) ML regression techniques (3) artificial neural networks and 1Dconvolution neural network based models. Striking of a heavyion leads to transient behavior, which is very different from the normal behavior. Hence, for all the above predictors, we also evaluate the corresponding piecewise predictors. While TCAD tools take 4 hours for each simulation on a high-end computer, our proposed models take much lower latency (e.g., few seconds). This allows designers to explore a larger design space. Our proposed piecewise 1D-CNN model achieves state-of-the-art MSE which is 2.15 × 10 −6 mA-squared. Overall, our study provides insights into how PC and ML-based regression models can be used to enhance the efficiency of SET analysis in circuit design.
Research Interests:
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. We present TPFNet, a novel one-stage network for text removal from images. TPFNet has two parts: feature synthesis and image... more
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. We present TPFNet, a novel one-stage network for text removal from images. TPFNet has two parts: feature synthesis and image generation. Since noise can be more effectively removed from low-resolution images, part operates on low-resolution images. Part uses PVT or EfficientNet-B as the encoder. Further, we use a novel multi-headed decoder that generates a high-pass filtered image and a segmentation map, along with a text-free image. The segmentation branch helps locate the text precisely, and the high-pass branch helps in learning the image structure. Part uses the features learned in part to predict a high-resolution text-free image. To precisely locate the text, TPFNet employs an adversarial loss that is conditional on the segmentation map rather than the input image. On Oxford, SCUT, SCUT-EnsText and ICDAR datasets, TPFNet outperforms recent networks on nearly all the metrics. E.g., on Oxford dataset, TPFNet has a PSNR (higher is better) of. and a text-detection precision (lower is better) of. , compared to MTRNet++'s PSNR of. and precision of. . The source code can be obtained from https: //github.com/CandleLabAI/TPFNet.
Research Interests:
By exploiting the gap between the user's accuracy requirement and the hardware's accuracy capability, approximate circuit design offers enormous gains in efficiency for a minor accuracy loss. In this paper, we propose two approximate... more
By exploiting the gap between the user's accuracy requirement and the hardware's accuracy capability, approximate circuit design offers enormous gains in efficiency for a minor accuracy loss. In this paper, we propose two approximate floating point multipliers (AxFPMs), named DTCL (decomposition, truncation and chunk-level leading-one quantization) and TDIL (truncation, decomposition and ignoring LSBs). Both AxFPMs introduce approximation in mantissa multiplication. DTCL works by rounding and truncating LSBs and quantizing each chunk. TDIL works by truncating LSBs and ignoring the least important terms in the multiplication. Further, both techniques multiply more significant terms by simply exponent addition or shift-and-add operations. These AxFPMs are configurable and allow trading off accuracy with hardware overhead. Compared to exact floating-point multiplier (FPM), DTCL(4,8,8) reduces area, energy and delay by 11.0%, 69% and 61%, respectively, while incurring a mean relative error of only 2.37%. On a range of approximate applications from machine learning, deep learning and image processing domains, our AxFPMs greatly improve efficiency with only minor loss in accuracy. For example, for image sharpening and Gaussian smoothing, all DTCL and TDIL variants achieve a PSNR of more than 30dB. The source-code is available at https://github.com/CandleLabAI/ApproxFloatingPointMultiplier.
Research Interests:
The rapid growth in the volume and complexity of PCB design has encouraged researchers to explore automatic visual inspection of PCB components. Automatic identification of PCB components such as resistors, transistors, etc., can provide... more
The rapid growth in the volume and complexity of PCB design has encouraged researchers to explore automatic visual inspection of PCB components. Automatic identification of PCB components such as resistors, transistors, etc., can provide several benefits, such as producing a bill of materials, defect detection, and e-waste recycling. Yet, visual identification of PCB components is challenging since PCB components have different shapes, sizes, and colors depending on the material used and the functionality.
The paper proposes a lightweight and novel neural network, Dilated Involutional Pyramid Network (DInPNet), for the classification of PCB components on the FICS-PCB dataset. DInPNet makes use of involutions superseding convolutions that possess inverse characteristics of convolutions that are location-specific and channel-agnostic. We introduce the dilated involutional pyramid (DInP) block, which consists of an involution for transforming the input feature map into a low-dimensional space for reduced computational cost, followed by a pairwise pyramidal fusion of dilated involutions that resample back the feature map. This enables learning representations for a large effective receptive field while at the same time bringing down the number of parameters considerably. DInPNet with a total of 531,485 parameters achieves 95.48\% precision, 95.65\% recall, and 92.59\% MCC (Matthew's correlation coefficient). To our knowledge, we are the first to use involution for performing PCB components classification. The code is released at \url{https://github.com/CandleLabAI/DInPNet-PCB-Component-Classification}.
The paper proposes a lightweight and novel neural network, Dilated Involutional Pyramid Network (DInPNet), for the classification of PCB components on the FICS-PCB dataset. DInPNet makes use of involutions superseding convolutions that possess inverse characteristics of convolutions that are location-specific and channel-agnostic. We introduce the dilated involutional pyramid (DInP) block, which consists of an involution for transforming the input feature map into a low-dimensional space for reduced computational cost, followed by a pairwise pyramidal fusion of dilated involutions that resample back the feature map. This enables learning representations for a large effective receptive field while at the same time bringing down the number of parameters considerably. DInPNet with a total of 531,485 parameters achieves 95.48\% precision, 95.65\% recall, and 92.59\% MCC (Matthew's correlation coefficient). To our knowledge, we are the first to use involution for performing PCB components classification. The code is released at \url{https://github.com/CandleLabAI/DInPNet-PCB-Component-Classification}.
Research Interests:
In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the... more
In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the feature space and the distribution when performing multimodal learning. We deal with this problem through deep learning and a generative model approach. We introduce a novel network, GAFNet (Global Attention Fourier Net), which learns through large-scale pre-training over three image-text datasets (COCO, SBU, and CC-3M), for achieving high performance on downstream vision and language tasks. We propose a GAF (Global Attention Fourier) module, which integrates multiple modalities into one latent space. GAF module is independent of the type of modality, and it allows combining shared representations at each stage. Various ways of thinking about the relationships between different modalities directly affect the model's design. In contrast to previous research, our work considers visual grounding as a pretrainable and transferable quality instead of something that must be trained from scratch. We show that GAFNet is a versatile network that can be used for a wide range of downstream tasks. Experimental results demonstrate that our technique achieves state-of-theart performance on multimodal classification on the Cri-sisMD dataset and image generation on the COCO dataset. For image-text retrieval, our technique achieves competitive performance.
Research Interests:
We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses... more
We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses EfficientNet-B0 as the backbone, "à trous spatial pyramid pooling" (ASPP) to learn at multiple receptive fields, and "global attention module" (GAM) to extract fine-grained details from the image. ACLNet also uses k-means clustering to extract cloud boundaries more precisely. ACLNet is effective for both daytime and nighttime images. It provides lower error rate, higher recall and higher F1-score than state-of-art cloud segmentation models. We will release the source code of ACLNet in open-source.
Research Interests:
This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the... more
This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the multiplication stage and a novel approximate "carry predicting adder" (CPA) in the accumulation stage. Addition and multiplication operations are merged, which reduces the delay. MEGA-MAC provides knobs to exercise a tradeoff between accuracy and resource overhead. Compared to the accurate MAC unit, MEGA-MAC(8,6) (i.e., a MEGA-MAC unit with a chunk size of 6 bits, operating on 8-bit input operands) reduces the power-delay-product (PDP) by 49.4%, while incurring a mean error percentage of only 4.2%. Compared to state-of-art approximate MAC units, MEGA-MAC achieves a better balance between resource-saving and accuracy-loss. The source code is available at https://sites.google.com/view/mega-mac-approximate-mac-unit/.
Research Interests:
Research Interests:
In the deep sub-micron region, "spin-transfer torque RAM" (STT-RAM) suffers from "read-disturbance error" (RDE), whereby a read operation disturbs the stored data. Mitigation of RDE requires restore operations, which imposes latency and... more
In the deep sub-micron region, "spin-transfer torque RAM" (STT-RAM) suffers from "read-disturbance error" (RDE), whereby a read operation disturbs the stored data. Mitigation of RDE requires restore operations, which imposes latency and energy penalties. Hence, RDE presents a crucial threat to the scaling of STT-RAM. In this paper, we offer three techniques to reduce the restore overhead. First, we avoid the restore operations for those reads, where the block will get updated at a higher level cache in the near future. Second, we identify read-intensive blocks using a lightweight mechanism and then migrate these blocks to a small SRAM buffer. On a future read to these blocks, the restore operation is avoided. Third, for data blocks having zero value, a write operation is avoided, and only a flag is set. Based on this flag, both read and restore operations to this block are avoided. We combine these three techniques to design our final policy, named CORIDOR. Compared to a baseline policy, which performs restore operation after each read, CORIDOR achieves a 31.6% reduction in total energy and brings the relative CPI (cycle-per-instruction) to 0.64×. By contrast, an ideal RDE-free STT-RAM saves 42.7% energy and brings the relative CPI to 0.62×. Thus, our CORIDOR policy achieves nearly the same performance as an ideal RDE-free STT-RAM cache. Also, it reaches three-fourth of the energy-saving achieved by the ideal RDE-free cache. We also compare CORIDOR with four previous techniques and show that CORIDOR provides higher restore energy savings than these techniques.
Research Interests:
This paper presents a multi-level design for spin-orbit torque (SOT) assisted spin-transfer torque (STT) based four-bit magnetic random access memory (MRAM). Multi-level cell (MLC) design is an effective solution to increase the storage... more
This paper presents a multi-level design for spin-orbit torque (SOT) assisted spin-transfer torque (STT) based four-bit magnetic random access memory (MRAM). Multi-level cell (MLC) design is an effective solution to increase the storage capacity of MRAM. The conventional SOT-MRAMs enable an energy-efficient, fast, and reliable write operation. However, unlike STT-MRAM, these cells take more area and require two access transistors per cell. This poses significant challenges in the use of SOT MRAMs for high-density memory applications. To address these issues, we propose a multi-level cell that can store four bits and requires only three access transistors. The effective area per bit of the proposed cell is nearly 58% lower than that of the conventional one-bit SOT-MRAM cell. The combined effect of SOT and STT has been incorporated to design SOT-STT based MLC that enables more energy-efficient and faster write operation than the regular MLCs. The results show that SOT-STT based four-bit MLC is 52.9% and 40% more efficient in terms of latency and energy consumption, respectively, when compared to three-bit SOT/STT based MLC.
Research Interests:
Recent years have witnessed a significant interest in the ``generative adversarial networks'' (GANs) due to their ability to generate high-fidelity data. Many models of GANs have been proposed for a diverse range of domains ranging from... more
Recent years have witnessed a significant interest in the ``generative adversarial networks'' (GANs) due to their ability to generate high-fidelity data. Many models of GANs have been proposed for a diverse range of domains ranging from natural language processing to image processing. GANs have a high compute and memory requirements. Also, since they involve both convolution and deconvolution operation, they do not map well to the conventional accelerators designed for convolution operations. Evidently, there is a need of customized accelerators for achieving high efficiency with GANs. In this work, we present a survey of techniques and architectures for accelerating GANs. We organize the works on key parameters to bring out their differences and similarities. Finally, we present research challenges that are worthy of attention in near future. More than summarizing the state-of-art, this survey seeks to spark further research in the field of GAN accelerators.
Research Interests:
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application... more
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application vulnerability, as a method to capture vulnerability variance within the application, is critical to evaluate and improve system resilience. The tradition methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven methodology to analyze application vulnerability based on a novel resilience metric, the data vulnerability factor (DVF). DVF integrates both application and specific hardware into the resilience analysis. To calculate DVF, we extend a performance modeling language to provide a fast modeling solution. Furthermore, we measure six representative computational kernels; we demonstrate the values of DVF by quantifying the impact of algorithm optimization on vulnerability and by quantifying the effectiveness of a hardware protection mechanism.
Research Interests: Algorithms, Parallel Algorithms, Computer Architecture, Computer Security And Reliability, Chemistry, and 15 moreComputer Engineering, Resilience, Metrics, Performance, Modeling and Simulation, Error Correction Coding, Statistical Modeling, Algorithm, Domain Specific Languages, Data Structure, Performance Modeling, Reliability, Soft Errors, Analytic Modeling, and Main Memory
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on... more
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed-systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This paper will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing.
Research Interests:
3D convolution neural networks (CNNs) have shown excellent predictive performance on tasks such as action recognition from videos. Since 3D CNNs have unique characteristics and extremely high compute/memory-overheads, executing them on... more
3D convolution neural networks (CNNs) have shown excellent predictive performance on tasks such as action recognition from videos. Since 3D CNNs have unique characteristics and extremely high compute/memory-overheads, executing them on accelerators designed for 2D CNNs provides sub-optimal performance. To overcome these challenges, researchers have recently proposed architectures for 3D CNNs. In this paper, we present a survey of hardware accelerators and hardware-aware algorithmic optimizations for 3D CNNs. We include only those CNNs that perform 3D convolution and not those that perform only 2D convolution on 2D or 3D data. We highlight their key ideas and underscore their similarities and differences. We believe that this survey will spark a great deal of research towards the design of ultra-efficient 3D CNN accelerators of tomorrow.
Research Interests:
Intermittent computing (ImC) refers to the scenario where periods of program execution are separated by reboots. ImC systems are generally powered by energy-harvesting (EH) devices: they start executing a program when the accumulated... more
Intermittent computing (ImC) refers to the scenario where periods of program execution are separated by reboots. ImC systems are generally powered by energy-harvesting (EH) devices: they start executing a program when the accumulated energy reaches a threshold and stop when the energy buffer is exhausted. Since ImC does not depend on a fixed supply of power, it can be used in a wide range of scenarios/devices such as medical implants, wearables, IoT sensors, extraterrestrial systems and so on. Although attractive, ImC also brings challenges such as avoiding data-loss and data inconsistency, and striking the right balance between performance, energy and quality of the result. In this paper, we present a survey of techniques and systems for ImC. We organize the works on key metrics to expose their similarities and differences. This paper will equip researchers with the knowledge of recent developments in ImC and also motivate them to address the remaining challenges for reaping the full potential of ImC.
Research Interests:
"Recurrent neural networks" (RNNs) are powerful artificial intelligence models that have shown remarkable effectiveness in several tasks such as music generation, speech recognition and machine translation. RNN computations involve both... more
"Recurrent neural networks" (RNNs) are powerful artificial intelligence models that have shown remarkable effectiveness in several tasks such as music generation, speech recognition and machine translation. RNN computations involve both intra-timestep and inter-timestep dependencies. Due to these features, hardware acceleration of RNNs is more challenging than that of CNNs. Recently, several researchers have proposed hardware architectures for RNNs. In this paper, we present a survey of GPU/FPGA/ASIC-based accelerators and optimization techniques for RNNs. We highlight the key ideas of different techniques to bring out their similarities and differences. Improvements in deep-learning algorithms have inevitably gone hand-in-hand with the improvements in the hardware-accelerators. Nevertheless, there is a need and scope of even greater synergy between these two fields. This survey seeks to synergize the efforts of researchers in the area of deep learning, computer architecture, and chip-design.
Research Interests:
As DNNs become increasingly common in mission-critical applications, ensuring their reliable operation has become crucial. Conventional resilience techniques fail to account for the unique characteristics of DNN algorithms/accelerators,... more
As DNNs become increasingly common in mission-critical applications, ensuring their reliable operation has become crucial. Conventional resilience techniques fail to account for the unique characteristics of DNN algorithms/accelerators, and hence, they are infeasible or ineffective. In this paper, we present a survey of techniques for studying and optimizing the reliability of DNN accelerators and architectures. The reliability issues we cover include soft/hard errors arising due to process variation, voltage scaling, timing errors, DRAM errors due to refresh rate scaling and thermal effects, etc. We organize the research projects on several categories to bring out their key attributes. This paper underscores the importance of designing for reliability as the first principle, and not merely retrofit for it.
Research Interests:
Problems from a wide variety of application domains can be modeled as "nondeterministic finite automaton" (NFA) and hence, efficient execution of NFAs can improve the performance of several key applications. However, traditional... more
Problems from a wide variety of application domains can be modeled as "nondeterministic finite automaton" (NFA) and hence, efficient execution of NFAs can improve the performance of several key applications. However, traditional architectures, such as CPU and GPU are not inherently suited for executing NFAs, and hence, special-purpose architectures are required for accelerating them. Micron's automata processor (AP) exploits massively parallel in-memory processing capability of DRAM for executing NFAs and hence, it can provide orders of magnitude performance improvement compared to traditional architectures. In this paper, we present a survey of techniques that propose architectural optimizations to AP and use it for accelerating problems from various application domains. This paper will be useful not only for computer architects and processor-designers but also for researchers in the field of bioinformatics, data-mining, machine learning and others.
Research Interests:
Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad... more
Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for computer-architects, developers seeking to accelerate their applications and researchers in the area of high-performance computing.
Research Interests:
Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address... more
Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages the observation that the rollbacks are not necessary as long as the application-level loss in quality due to value misprediction is acceptable to the user. However, in the context of Graphics Processing Units (GPUs), our evaluations show that the existing approximate value predictors are not optimal in improving the prediction accuracy as they do not consider memory request order, a key characteristic in determining the accuracy of value prediction. As a result, the overall data movement reduction benefits are capped as it is necessary to limit the percentage of predicted values (i.e., prediction coverage) for an acceptable value of application-level error. To this end, we propose a new Address-Stride Assisted Approximate Value Predictor (ASAP) that explicitly considers the memory addresses and their request order information so as to provide high value prediction accuracy. We take advantage of our new observation that the stride between memory request addresses and the stride between their corresponding data values are highly correlated in several applications. Therefore, ASAP predicts the values only for those requests that have regular strides in their addresses. We evaluate ASAP on a diverse set of GPGPU applications. The results show that ASAP can significantly improve the value prediction accuracy over the previously proposed mechanisms at the same coverage, or can achieve higher coverage (leading to higher performance/energy improvements) under a fixed error threshold.
Research Interests:
Design of hardware accelerators for neural network (NN) applications involves walking a tight rope amidst the constraints of low-power, high accuracy and throughput. NVIDIA's Jetson is a promising platform for embedded machine learning... more
Design of hardware accelerators for neural network (NN) applications involves walking a tight rope amidst the constraints of low-power, high accuracy and throughput. NVIDIA's Jetson is a promising platform for embedded machine learning which seeks to achieve a balance between the above objectives. In this paper, we provide a survey of works that evaluate and optimize neural network applications on Jetson platform. We review both hardware and algorithmic optimizations performed for running NN algorithms on Jetson and show the real-life applications where these algorithms have been applied. We also review the works that compare Jetson with similar platforms. While the survey focuses on Jetson as an exemplar embedded system, many of the ideas and optimizations will apply just as well to existing and future embedded systems.
It is widely believed that the ability to run AI algorithms on low-cost, low-power platforms will be crucial for achieving the ``AI for all'' vision. This survey seeks to provide a glimpse of the recent progress towards that goal.
It is widely believed that the ability to run AI algorithms on low-cost, low-power platforms will be crucial for achieving the ``AI for all'' vision. This survey seeks to provide a glimpse of the recent progress towards that goal.
Research Interests:
Mobile web traffic has now surpassed the desktop web traffic and has become the primary means for service providers to reach-out to the billions of end-users. Due to this trend, optimization of mobile web browsing (MWB) has gained... more
Mobile web traffic has now surpassed the desktop web traffic and has become the primary means for service providers to reach-out to the billions of end-users. Due to this trend, optimization of mobile web browsing (MWB) has gained significant attention. In this paper, we present a survey of techniques for improving efficiency of web browsing on mobile systems, proposed in last 6-7 years. We review the techniques from both networking domain (e.g., proxy and browser enhancements) and processor-architecture domain (e.g., hardware customization, thread-to-core scheduling). We organize the research works based on key parameters to highlight their similarities and differences. Beyond summarizing the recent works, this survey aims to emphasize the need of architecting for MWB as the first principle, instead of retrofitting for it.
Research Interests:
Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of... more
Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, custom hardware accelerators are vital for boosting their performance. The high energy-efficiency, computing capabilities and reconfigurability of FPGA make it a promising platform for hardware acceleration of CNNs. In this paper, we present a survey of techniques for implementing and optimizing CNN algorithms on FPGA. We organize the works in several categories to bring out their similarities and differences. This paper is expected to be useful for researchers in the area of artificial intelligence, hardware architecture and system-design.
Research Interests:
[For clarity, the same information has been put in the pdf file uploaded] The code can be downloaded from https://code.ornl.gov/3d_cache_modeling_tool/destiny . The technical report is here http://goo.gl/qzyWFE, which is extension... more
[For clarity, the same information has been put in the pdf file uploaded] The code can be downloaded from https://code.ornl.gov/3d_cache_modeling_tool/destiny . The technical report is here http://goo.gl/qzyWFE, which is extension of our DATE 2015 paper http://goo.gl/3nKAM2. The manual for DESTINY tool is here https://code.ornl.gov/3d_cache_modeling_tool/destiny/blob/master/Doc/DESTINY_Documentation.pdf
Research Interests: Computer Architecture, Computer Engineering, Open Source Software, Modeling and Simulation, Cache Memory Design Issues, and 12 moreDesign Tools, Memristor, Cache Memory, Spin Transfer Torque, SRAM design, Design Space Exploration, Non-Volatile Memory Technologies, spin transfer torque MRAM, Phase Change Memory (PCM) Materials, Non-Volatile Memory, Embedded DRAM, and STTRAM
[For downloading the code, click on drive.google.com link above, which is same as https://drive.google.com/folderview?id=0B3CSJpITzNscMVBpb3pfUFcwVzQ&usp=sharing . For clarity, the same information has been put in the pdf file uploaded]... more
[For downloading the code, click on drive.google.com link above, which is same as https://drive.google.com/folderview?id=0B3CSJpITzNscMVBpb3pfUFcwVzQ&usp=sharing . For clarity, the same information has been put in the pdf file uploaded] These codes were used in the following paper https://www.academia.edu/3982638/A_Study_of_Successive_Over-relaxation_SOR_Method_Parallelization_Over_Modern_HPC_Languages
Research Interests: Parallel Computing, High Performance Computing, Programming Languages, Computer Engineering, Computational Fluid Dynamics, and 11 moreParallel Programming, Open Source Software, Linear Algebra, Computer Programming, High Performance Scientific Computing, Multithreading, High Performance Computing (HPC), High Performance Computing Languages, Google's Go language, Chapel language, and Successive over-relaxation
Research Interests:
Research Interests:
Summary of "A Survey Of Architectural Techniques for Near-Threshold Computing" https://www.academia.edu/15478916/A_Survey_Of_Architectural_Techniques_for_Near-Threshold_Computing
Research Interests:
Summary of "A Survey of Techniques for Cache Partitioning in Multicore Processors" https://www.academia.edu/31779426/A_Survey_of_Techniques_for_Cache_Partitioning_in_Multicore_Processors
Research Interests:
Summary of "A Survey of Techniques for Architecting TLBs" https://www.academia.edu/29585076/A_Survey_of_Techniques_for_Architecting_TLBs
Research Interests:
A summary of "A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors", CSUR, 2016 https://www.academia.edu/18301534/A_Survey_Of_Techniques_for_Architecting_and_Managing_Asymmetric_Multicore_Processors
Research Interests:
Covers:
(1) PCM resistance drift error (2) PCM write disturbance error (3) STT-RAM read disturbance error and (4) STT-RAM write errors
(1) PCM resistance drift error (2) PCM write disturbance error (3) STT-RAM read disturbance error and (4) STT-RAM write errors
Research Interests:
An overview of cache bypassing techniques
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Ongoing process scaling and push for performance have led to increasingly severe reliability challenges for memory systems. For example, process variation (PV)--deviation in parameters from their nominal specifications--threatens to slow... more
Ongoing process scaling and push for performance have led to increasingly severe reliability challenges for memory systems. For example, process variation (PV)--deviation in parameters from their nominal specifications--threatens to slow down and even pause technological scaling and addressing it is important for continuing the benefits of chip miniaturization. In this talk, I will present a brief background on memory reliability challenges, viz., PV and soft-errors, along with their impact on systems ranging from embedded systems to supercomputers. Then, several architectural strategies for managing PV in different processor components (e.g., core, cache and main memory) and memory technologies (e.g., SRAM, embedded DRAM, DRAM and non-volatile memory) will be discussed.
Research Interests:
PPT for GLSVLSI 2016 paper which is available at https://www.academia.edu/25461561/Reducing_Soft-error_Vulnerability_of_Caches_using_Data_Compression
Research Interests:
DESTINY code for download https://code.ornl.gov/3d_cache_modeling_tool/destiny . DESTINY research paper: http://www.academia.edu/9741921/Exploring_Design_Space_of_3D_NVM_and_eDRAM_Caches_Using_DESTINY_Tool . DESTINY manual:... more
DESTINY code for download https://code.ornl.gov/3d_cache_modeling_tool/destiny . DESTINY research paper: http://www.academia.edu/9741921/Exploring_Design_Space_of_3D_NVM_and_eDRAM_Caches_Using_DESTINY_Tool . DESTINY manual: https://code.ornl.gov/3d_cache_modeling_tool/destiny/blob/master/Doc/DESTINY_Documentation.pdf . Join DESTINY user mailing list and see previous posts at: https://elist.ornl.gov/pipermail/destiny-help/ .
Research Interests:
This is the PPT for the paper http://www.academia.edu/16541472/AYUSH_Extending_Lifetime_of_SRAM-NVM_Way-based_Hybrid_Caches_Using_Wear-leveling published in IEEE MASCOTS 2015
Research Interests:
With increasing number of on-chip cores and CMOS scaling, the size of last level caches (LLCs) is on rise and hence, managing their leakage energy consumption has become vital for continuing to scale performance. In multicore systems, the... more
With increasing number of on-chip cores and CMOS scaling, the size of last level caches (LLCs) is on rise and hence, managing their leakage energy consumption has become vital for continuing to scale performance. In multicore systems, the locality of memory access stream is significantly reduced due to multiplexing of access streams from different running programs and hence, leakage energy saving techniques such as decay cache, which rely on memory access locality, do not save large amount of energy. The techniques based on way level allocation provide very coarse granularity and the techniques based on offline profiling become infeasible to use for large number of cores. We present MASTER, a multicore cache energy saving technique using dynamic cache reconfiguration. MASTER uses online profiling to predict energy consumption of running programs at multiple LLC sizes. Using these estimates, suitable cache quotas are allocated to different programs using cache coloring scheme and the unused LLC space is turned off to save energy. Even for 4 core systems, the implementation overhead of MASTER is only 0.8% of L2 size. We evaluate MASTER using out-of-order simulations with multiprogrammed workloads from SPEC2006 and compare it with conventional cache leakage energy saving techniques. The results show that MASTER gives highest saving in energy and does not harm performance or cause unfairness. For 2 and 4-core simulations, the average savings in memory subsystem (which includes LLC and main memory) energy over shared baseline LLC are 15% and 11%, respectively. Also, the average values of weighted speedup and fair speedup are close to one ( >0.98).
Research Interests:
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern... more
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern processor design. The conventional cache energy saving techniques require offline profiling or provide only coarse granularity of cache allocation. We present FlexiWay, a cache energy saving technique which uses dynamic cache reconfiguration. FlexiWay logically divides the cache sets into multiple (e.g. 16) modules and dynamically turns off suitable and possibly different number of cache ways in each module. FlexiWay has very small implementation overhead and it provides fine-grain cache allocation even with caches of typical associativity, e.g. an 8-way cache. Microarchitectural simulations have been performed using an x86-64 simulator and workloads from SPEC2006 suite. Also, FlexiWay has been compared with two conventional energy saving
techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.
techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.
Research Interests:
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip... more
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches. A crucial limitation of NVMs, however, is that their write endurance is low and the large intra-set write variation introduced by existing cache management policies may further exacerbate this problem, thereby reducing the cache lifetime significantly. We present EqualChance, a technique to increase cache lifetime by reducing intra-set write variation. EqualChance works by periodically changing the physical cache-block location of a write-intensive data item within a set to achieve wear-leveling. Simulations using workloads from SPEC CPU2006 suite and HPC (high-performance computing) field show that EqualChance improves the cache lifetime by 4.29X. Also, its implementation overhead is small, and it incurs very small performance and energy loss.
Research Interests:
With each CMOS technology generation, leakage energy has been increasing at an exponential rate and hence, managing the energy consumption of large, last-level caches is becoming a critical research issue in modern chip design. Saving... more
With each CMOS technology generation, leakage energy has been increasing at an exponential rate and hence, managing the energy consumption of large, last-level caches is becoming a critical research issue in modern chip design. Saving cache energy in QoS systems is especially challenging, since, to avoid missing deadlines, a suitable balance needs to be made between energy saving and performance loss. In this paper, we propose CASHIER, a Cache Energy Saving Technique for Quality of Service Systems. Cashier uses dynamic profiling to
estimate the memory subsystem energy and execution time of the program under multiple last level cache (LLC) configurations. It then reconfigures LLC to an energy efficient configuration with a view to meet the deadline. In QoS systems, allowed slack may be specified either as percentage of baseline execution time or
as absolute slack and Cashier can work for both these cases. The experiments show the effectiveness of Cashier in saving cache energy. For example, for an L2 cache size of 2MB and 5% allowed-slack over baseline, the average saving in memory subsystem energy by using Cashier is 23.6%.
estimate the memory subsystem energy and execution time of the program under multiple last level cache (LLC) configurations. It then reconfigures LLC to an energy efficient configuration with a view to meet the deadline. In QoS systems, allowed slack may be specified either as percentage of baseline execution time or
as absolute slack and Cashier can work for both these cases. The experiments show the effectiveness of Cashier in saving cache energy. For example, for an L2 cache size of 2MB and 5% allowed-slack over baseline, the average saving in memory subsystem energy by using Cashier is 23.6%.
Research Interests:
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major... more
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making them a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.