Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
  • Dr. Sparsh Mittal is currently working as an assistant professor in the ECE department at IIT Roorkee, India. He is a... moreedit
Approximate computing offers significant gains in efficiency at the cost of minor errors. In this paper, we show that since approximate computing legitimizes controlled imprecision, this very relaxation can be exploited by an adversary to... more
Approximate computing offers significant gains in efficiency at the cost of minor errors. In this paper, we show that since approximate computing legitimizes controlled imprecision, this very relaxation can be exploited by an adversary to insert Trojans into approximate circuits. Since the minor errors introduced by the Trojan may be indistinguishable from those introduced by approximate computing, these Trojans can easily evade detection, yet they can severely degrade the end application's "quality of result" (QoR). By contrast, the conventional exact computing paradigm does not tolerate errors; hence, any inserted Trojan can be easily detected. Thus, we show that approximate circuits are more vulnerable to attacks, and this may nullify their efficiency advantages. We demonstrate our ideas through the two most foundational circuits, approximate adders and multipliers. We categorize the existing approximate adders and multipliers into broad families from the perspective of Trojan insertion strategies that an adversary might employ. We present a generalized framework to identify the suitable hardware Trojan insertion and masking sites within each family of approximate adders and multipliers. We also discuss the implications of these threats for a real-life application. Our work strongly emphasizes the need for better security measures and provides insights that will guide the development of robust digital systems capable of balancing the intricacies of approximation and security.
Early diagnosis plays a pivotal role in effectively treating numerous diseases, especially in healthcare scenarios where prompt and accurate diagnoses are essential. Contrastive learning (CL) has emerged as a promising approach for... more
Early diagnosis plays a pivotal role in effectively treating numerous diseases, especially in healthcare scenarios where prompt and accurate diagnoses are essential. Contrastive learning (CL) has emerged as a promising approach for medical tasks, offering advantages over traditional supervised learning methods. However, in healthcare, patient metadata contains valuable clinical information that can enhance representations, yet existing CL methods often overlook this data. In this study, we propose an novel approach that leverages both clinical information and imaging data in contrastive learning to enhance model generalization and interpretability. Furthermore, existing contrastive methods may be prone to sampling bias, which can lead to the model capturing spurious relationships and exhibiting unequal performance across protected subgroups frequently encountered in medical settings. To address these limitations, we introduce Patient-aware Contrastive Learning (PaCL), featuring an inter-class separability objective (IeSO) and an intra-class diversity objective (IaDO). IeSO harnesses rich clinical information to refine samples, while IaDO ensures the necessary diversity among samples to prevent class collapse. We demonstrate the effectiveness of PaCL both theoretically through causal refinements and empirically across six real-world medical imaging tasks spanning three imaging modalities: ophthalmology, radiology, and dermatology. Notably, PaCL outperforms previous techniques across all six tasks
Recent advances in pre-trained neural language models have substantially enhanced the performance of numerous natural language processing (NLP) tasks. However, some existing models require pretraining on a large dataset. Moreover, on... more
Recent advances in pre-trained neural language models have substantially enhanced the performance of numerous natural language processing (NLP) tasks. However, some existing models require pretraining on a large dataset. Moreover, on using a deep network with sequentially connected transformer blocks, there is a data loss across these blocks. To overcome these challenges, we propose LiBERTy, a novel network for natural language understanding. LiBERTy uses a novel TransLSTM module which takes the representations from the BERT block as input and feeds it to LSTM which functions as a pooling layer. The use of LSTM as a pooler helps the model sequentially encode the feature map into hidden states and understand semantic interrelations. The output of the TransLSTM module is fed to a classifier, which uses multiple 1D-CONV blocks, a 1D adaptive average pooling layer and a "fullyconnected" (FC) layer and then, ArcFace loss. ArcFace loss helps in achieving inter-class separability and intra-class compactness. Our proposed strategies increase the efficiency of model pre-training and the performance of both natural language understanding (NLU) and downstream tasks. We showcase the efficacy of LiBERTy by applying it for three tasks: (1) disaster tweet classification on the HumAID dataset, (2) fine-grained emotion analysis on the GoEmotions dataset and (3) named entity recognition on TASTEset dataset.
Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative... more
Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformerbased networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardwarelevel optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and the accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
This research investigates how heavy-ion irradiation affects the single event transient (SET) response of 14nm silicon-on-insulator (SOI) FinFET. The researchers generally use a TCAD tool (e.g., Sentauras TCAD) for developing a SET pulse... more
This research investigates how heavy-ion irradiation affects the single event transient (SET) response of 14nm silicon-on-insulator (SOI) FinFET. The researchers generally use a TCAD tool (e.g., Sentauras TCAD) for developing a SET pulse current model. However, the TCAD simulations are timeconsuming, which prohibits efficient design-space exploration. We propose efficient models for predicting SET pulse current with high accuracy. We use (1) polynomial chaos (PC) based models (2) ML regression techniques (3) artificial neural networks and 1Dconvolution neural network based models. Striking of a heavyion leads to transient behavior, which is very different from the normal behavior. Hence, for all the above predictors, we also evaluate the corresponding piecewise predictors. While TCAD tools take 4 hours for each simulation on a high-end computer, our proposed models take much lower latency (e.g., few seconds). This allows designers to explore a larger design space. Our proposed piecewise 1D-CNN model achieves state-of-the-art MSE which is 2.15 × 10 −6 mA-squared. Overall, our study provides insights into how PC and ML-based regression models can be used to enhance the efficiency of SET analysis in circuit design.
Approximate computing (AC) techniques provide overall performance gains in terms of power and energy savings at the cost of minor loss in application accuracy. For this reason, AC has emerged as a viable method for efficiently supporting... more
Approximate computing (AC) techniques provide overall performance gains in terms of power and energy savings at the cost of minor loss in application accuracy. For this reason, AC has emerged as a viable method for efficiently supporting several compute-intensive applications, e.g., machine learning, deep learning, and image processing, that can tolerate bounded errors in computations. However, most prior techniques do not consider the possibility of soft errors or malicious bit-flips in AC systems. These errors may interact with approximation-introduced errors in unforeseen ways, leading to disastrous consequences, such as the failure of computing systems. A recent research effort, FTApprox (DATE'21) proposes an error-resilient approximate data format. FTApprox stores two blocks, starting from the one containing the most significant valid (MSV) bit. It also stores location of the MSV block and protects them using error-correcting bits (ECBs). However, FTApprox has crucial limitations such as lack of flexibility, redundantly storing zeros in the MSV, etc. In this paper, we propose a novel storage format named Versatile Approximate Data Format (VADF) for storing approximate integer numbers while providing resilience to soft errors. VADF prescribes rules for storing, for example, a 32-bit number in either 8-bit, 12-bit or 16-bit numbers. VADF identifies the MSV bit and stores a certain number of bits following the MSV bit. It also stores the location of the MSV bit and protects it by ECBs. VADF does not explicitly store the MSB bit itself and this prevents VADF from accruing significant errors. VADF incurs lower error than both truncation methodologies and FTApprox. We further evaluate five image-processing and machine-learning applications and confirm that VADF provides higher application quality than FTApprox in the presence and absence of soft errors. Finally, VADF allows the use of narrow arithmetic units. For example, instead of using a 32-bit multiplier/adder, one can first use VADF (or FTApprox) to compress the data and then use a 8-bit multiplier/adder. Through this approach, VADF facilitates 95.97% and 79.3% energy savings in multiplication and addition, respectively. However, the subsequent reconversion of the 8-bit output data to 32-bit data using Inv-VADF(16,3,32) diminishes the energy savings by 9.6% for addition and 0.56% for multiplication operation, respectively. The code is available at https://github.com/CandleLabAI/VADF-ApproximateDataFormat-TECS.
PCB component classification and segmentation can be helpful for PCB waste recycling. However, the variance in shapes and sizes of PCB components presents crucial challenges. We propose PCBSegClassNet, a novel deep neural network for PCB... more
PCB component classification and segmentation can be helpful for PCB waste recycling. However, the variance in shapes and sizes of PCB components presents crucial challenges. We propose PCBSegClassNet, a novel deep neural network for PCB component classification and segmentation. The network uses a two-branch design that captures the global context in one branch and spatial features in the other. The fusion of two branches allows the effective segmentation of components of various sizes and shapes. We reinterpret the skip connections as a learning module to learn features efficiently. We propose a texture enhancement module that utilizes texture information and spatial features to obtain precise boundaries of components. We introduce a loss function that combines DICE, IoU, and SSIM loss functions to guide the training process for precise pixel-level, patch-level, and map-level segmentation. Our network outperforms all previous state-of-the-art networks on both segmentation and classification tasks. For example, it achieves a DICE score of 96.3% and IoU score of 92.7% on the FPIC dataset. From the FPIC dataset, we crop the images of 25 component classes and release these 19158 images in open-source as the "FPIC-Component dataset". On this dataset, our network achieves a classification accuracy of 95.2%. Our model is much more lightweight than previous networks and achieves a segmentation throughput of 122 frame-per-second on a single GPU. We also showcase its ability to count the number of each component on a PCB.
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. We present TPFNet, a novel one-stage network for text removal from images. TPFNet has two parts: feature synthesis and image... more
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. We present TPFNet, a novel one-stage network for text removal from images. TPFNet has two parts: feature synthesis and image generation. Since noise can be more effectively removed from low-resolution images, part operates on low-resolution images. Part uses PVT or EfficientNet-B as the encoder. Further, we use a novel multi-headed decoder that generates a high-pass filtered image and a segmentation map, along with a text-free image. The segmentation branch helps locate the text precisely, and the high-pass branch helps in learning the image structure. Part uses the features learned in part to predict a high-resolution text-free image. To precisely locate the text, TPFNet employs an adversarial loss that is conditional on the segmentation map rather than the input image. On Oxford, SCUT, SCUT-EnsText and ICDAR datasets, TPFNet outperforms recent networks on nearly all the metrics. E.g., on Oxford dataset, TPFNet has a PSNR (higher is better) of. and a text-detection precision (lower is better) of. , compared to MTRNet++'s PSNR of. and precision of. . The source code can be obtained from https: //github.com/CandleLabAI/TPFNet.
By exploiting the gap between the user's accuracy requirement and the hardware's accuracy capability, approximate circuit design offers enormous gains in efficiency for a minor accuracy loss. In this paper, we propose two approximate... more
By exploiting the gap between the user's accuracy requirement and the hardware's accuracy capability, approximate circuit design offers enormous gains in efficiency for a minor accuracy loss. In this paper, we propose two approximate floating point multipliers (AxFPMs), named DTCL (decomposition, truncation and chunk-level leading-one quantization) and TDIL (truncation, decomposition and ignoring LSBs). Both AxFPMs introduce approximation in mantissa multiplication. DTCL works by rounding and truncating LSBs and quantizing each chunk. TDIL works by truncating LSBs and ignoring the least important terms in the multiplication. Further, both techniques multiply more significant terms by simply exponent addition or shift-and-add operations. These AxFPMs are configurable and allow trading off accuracy with hardware overhead. Compared to exact floating-point multiplier (FPM), DTCL(4,8,8) reduces area, energy and delay by 11.0%, 69% and 61%, respectively, while incurring a mean relative error of only 2.37%. On a range of approximate applications from machine learning, deep learning and image processing domains, our AxFPMs greatly improve efficiency with only minor loss in accuracy. For example, for image sharpening and Gaussian smoothing, all DTCL and TDIL variants achieve a PSNR of more than 30dB. The source-code is available at https://github.com/CandleLabAI/ApproxFloatingPointMultiplier.
The rapid growth in the volume and complexity of PCB design has encouraged researchers to explore automatic visual inspection of PCB components. Automatic identification of PCB components such as resistors, transistors, etc., can provide... more
The rapid growth in the volume and complexity of PCB design has encouraged researchers to explore automatic visual inspection of PCB components. Automatic identification of PCB components such as resistors, transistors, etc., can provide several benefits, such as producing a bill of materials, defect detection, and e-waste recycling. Yet, visual identification of PCB components is challenging since PCB components have different shapes, sizes, and colors depending on the material used and the functionality.

The paper proposes a lightweight and novel neural network, Dilated Involutional Pyramid Network (DInPNet), for the classification of PCB components on the FICS-PCB dataset. DInPNet makes use of involutions superseding convolutions that possess inverse characteristics of convolutions that are location-specific and channel-agnostic. We introduce the dilated involutional pyramid (DInP) block, which consists of an involution for transforming the input feature map into a low-dimensional space for reduced computational cost, followed by a pairwise pyramidal fusion of dilated involutions that resample back the feature map. This enables learning representations for a large effective receptive field while at the same time bringing down the number of parameters considerably. DInPNet with a total of 531,485 parameters achieves 95.48\% precision, 95.65\% recall, and 92.59\% MCC (Matthew's correlation coefficient). To our knowledge, we are the first to use involution for performing PCB components classification. The code is released at \url{https://github.com/CandleLabAI/DInPNet-PCB-Component-Classification}.
In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the... more
In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the feature space and the distribution when performing multimodal learning. We deal with this problem through deep learning and a generative model approach. We introduce a novel network, GAFNet (Global Attention Fourier Net), which learns through large-scale pre-training over three image-text datasets (COCO, SBU, and CC-3M), for achieving high performance on downstream vision and language tasks. We propose a GAF (Global Attention Fourier) module, which integrates multiple modalities into one latent space. GAF module is independent of the type of modality, and it allows combining shared representations at each stage. Various ways of thinking about the relationships between different modalities directly affect the model's design. In contrast to previous research, our work considers visual grounding as a pretrainable and transferable quality instead of something that must be trained from scratch. We show that GAFNet is a versatile network that can be used for a wide range of downstream tasks. Experimental results demonstrate that our technique achieves state-of-theart performance on multimodal classification on the Cri-sisMD dataset and image generation on the COCO dataset. For image-text retrieval, our technique achieves competitive performance.
Wildfires can cause significant damage to forests and endanger wildlife. Detecting these forest fires at the initial stages helps the authorities in preventing them from spreading further. In this paper, we first propose a novel... more
Wildfires can cause significant damage to forests and endanger wildlife. Detecting these forest fires at the initial stages helps the authorities in preventing them from spreading further. In this paper, we first propose a novel technique, termed CIELAB-color technique, which detects fire based on the color of the fire in CIELAB color space. We train state-of-art CNNs to detect fire. Since deep learning (CNNs) and image processing have complementary strengths, we combine their strengths to propose an ensemble architecture. It uses two CNNs and the CIELAB-color technique and then performs majority voting to decide the final fire/no-fire prediction output. We finally propose a chain-of-classifiers technique which first tests an image using the CIELAB-color technique. If an image is flagged as no-fire, then it further checks the image using a CNN. This technique has lower model size than ensemble technique. On FLAME dataset, the ensemble technique provides 93.32% accuracy, outperforming both previous works (88.01% accuracy) and individually using either CNNs or CIELAB-color technique. The source code can be obtained from https://github.com/CandleLabAI/FireDetection.
Dehazing refers to removing the haze and restoring the details from hazy images. In this paper, we propose ClarifyNet, a novel, end-to-end trainable, convolutional neural network architecture for single image dehazing. We note that a... more
Dehazing refers to removing the haze and restoring the details from hazy images. In this paper, we propose ClarifyNet, a novel, end-to-end trainable, convolutional neural network architecture for single image dehazing. We note that a high-pass filter detects sharp edges, texture, and other fine details in the image, whereas a low-pass filter detects color and contrast information. Based on this observation, our key idea is to train ClarifyNet on ground-truth haze-free images, low-pass filtered images, and high-pass filtered images. Based on this observation, we present a shared-encoder multi-decoder model ClarifyNet which employs interconnected parallelization. While training, ground-truth haze-free images, low-pass filtered images, and high-pass filtered images undergo multistage filter fusion and attention. By utilizing a weighted loss function composed of SSIM loss and L1 loss, we extract and propagate complementary features. We comprehensively evaluate ClarifyNet on I-HAZE, O-HAZE, Dense-Haze, NH-HAZE, SOTS-Indoor, SOTS-Outdoor, HSTS, and Middlebury datasets. We use PSNR and SSIM metrics and compare the results with previous works. For most datasets, ClarifyNet provides the highest scores. On using EfficientNet-B6 as the backbone, ClarifyNet has 18M parameters (model size of ∼71MB) and a throughput of 8 frames-per-second while processing images of size 2048x1024.
Objective: Automated cell nuclei segmentation is vital for the histopathological diagnosis of cancer. However, nuclei segmentation from “hematoxylin and eosin” (HE) stained “whole slide images” (WSIs) remains a challenge due to... more
Objective: Automated cell nuclei segmentation is vital for the histopathological diagnosis of cancer. However, nuclei segmentation from “hematoxylin and eosin” (HE) stained “whole slide images” (WSIs) remains a challenge due to noise-induced intensity variations and uneven staining. The goal of this paper is to propose a novel deep learning model for accurately segmenting the nuclei in HE-stained WSIs.
Approach: We introduce FEEDNet, a novel encoder-decoder network that uses LSTM units and “feature enhancement blocks” (FE-blocks). Our proposed FE-block avoids the loss of location information incurred by pooling layers by concatenating the downsampled version of the original image to preserve pixel intensities. FEEDNet uses an LSTM unit to capture multi-channel representations compactly. Secondly, for datasets that provide class information, we train a multiclass segmentation model, which generates masks corresponding to each class at the output. Using
this information, we generate more accurate binary masks than that generated by conventional binary segmentation models.
Main results: We have thoroughly evaluated FEEDNet on CoNSeP, Kumar, and CPM-17 datasets. FEEDNet achieves the best value of PQ (panoptic quality) on CoNSeP and CPM-17 datasets and the second best value of
PQ on the Kumar dataset. The 32-bit floating-point version of FEEDNet has a model size of 64.90MB. With INT8 quantization, the model size reduces to only 16.51 MB, with a negligible loss in predictive performance on Kumar and CPM-17 datasets and a minor loss on the CoNSeP dataset.
Significance: Our proposed idea of generalized class-aware binary segmentation is shown to be accurate on a variety of datasets. FEEDNet has a smaller model size than the previous nuclei segmentation networks, which makes it suitable for execution on memory-constrained edge devices. The state-of-the-art predictive performance of FEEDNet makes it the most preferred network. The source code can be obtained from https://github.com/CandleLabAI/FEEDNet.
We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses... more
We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses EfficientNet-B0 as the backbone, "à trous spatial pyramid pooling" (ASPP) to learn at multiple receptive fields, and "global attention module" (GAM) to extract fine-grained details from the image. ACLNet also uses k-means clustering to extract cloud boundaries more precisely. ACLNet is effective for both daytime and nighttime images. It provides lower error rate, higher recall and higher F1-score than state-of-art cloud segmentation models. We will release the source code of ACLNet in open-source.
In this paper, we present novel bit-flip attack (BFA) algorithms for DNNs, along with techniques for defending against the attack. Our attack algorithms leverage information about the layer importance, such that a layer is considered... more
In this paper, we present novel bit-flip attack (BFA) algorithms for DNNs, along with techniques for defending against the attack. Our attack algorithms leverage information about the layer importance, such that a layer is considered important if it has high-ranked feature maps. We first present a classwisetargeted attack that degrades the accuracy of just one class in the dataset. Comparative evaluation with related works shows the effectiveness of our attack algorithm. We finally propose multiple novel defense strategies against untargeted BFAs. We comprehensively evaluate the robustness of both large-scale CNNs (VGG19, ResNext50, AlexNet and ResNet) and compact CNNs (MobileNet-v2, ShuffleNet, GoogleNet and SqueezeNet) towards BFAs. We also reveal a valuable insight that compact CNNs are highly vulnerable to not only well-crafted BFAs such as ours, but even random BFAs. Also, defense strategies are less effective on compact CNNs.
This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the... more
This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the multiplication stage and a novel approximate "carry predicting adder" (CPA) in the accumulation stage. Addition and multiplication operations are merged, which reduces the delay. MEGA-MAC provides knobs to exercise a tradeoff between accuracy and resource overhead. Compared to the accurate MAC unit, MEGA-MAC(8,6) (i.e., a MEGA-MAC unit with a chunk size of 6 bits, operating on 8-bit input operands) reduces the power-delay-product (PDP) by 49.4%, while incurring a mean error percentage of only 4.2%. Compared to state-of-art approximate MAC units, MEGA-MAC achieves a better balance between resource-saving and accuracy-loss. The source code is available at https://sites.google.com/view/mega-mac-approximate-mac-unit/.
In recent years, there has been an enormous interest in using deep learning to classify underwater images to identify various objects like fishes, plankton, coral reefs, seagrass, submarines, and gestures of sea-divers. This... more
In recent years, there has been an enormous interest in using deep learning to classify underwater images to identify various objects like fishes, plankton, coral reefs, seagrass, submarines, and gestures of sea-divers. This classification is essential for measuring the water bodies' health and quality and protecting the endangered species. Further, it has applications in oceanography, marine economy and defense, environment protection, and underwater exploration and human-robot collaborative tasks. This paper presents a survey of deep learning techniques for performing the underwater image classification. We underscore the similarities and differences of several methods. We believe that underwater image classification is one of the killer application that would test the ultimate success of deep learning techniques. Towards realizing that goal, this survey seeks to inform researchers about state-of-the-art on deep learning on underwater images and also motivate them to push its frontiers forward. Index Terms-Deep neural networks, artificial intelligence, autonomous underwater vehicle, transfer learning.
Crowd counting is the process of counting or estimating the number of individuals in a crowd. There has been a rapid surge in the amount of Unmanned Aerial Vehicles (UAV) images over the last few years. However, efficient crowd counting... more
Crowd counting is the process of counting or estimating the number of individuals in a crowd. There has been a rapid surge in the amount of Unmanned Aerial Vehicles (UAV) images over the last few years. However, efficient crowd counting techniques from UAV images have hardly come into the focus of the research community. Crowd counting from UAV images has its unique challenges compared to crowd counting from images in natural scenes. Moreover, solving the problem in real-time makes the task even harder. In this paper, we introduce an attention-based encoderdecoder model called Attention-based Real-time CrowdNet (ARCN). ARCN is a computationally efficient density estimationbased crowd counting model. It can perform crowd-counting from UAV images in real-time with high accuracy. Ours is the first work that proposes a real-time density map estimation and crowd counting model from drone-based images. The key idea of our work is to add "Convolution Block Attention Module" (CBAM) blocks in-between the bottleneck layers of the MobileCount architecture. The proposed ARCN model achieves an MAE of 19.9 and MSE of 27.7 on the DroneCrowd dataset. Also, on NVIDIA GTX 2080 Ti GPU, ARCN has a processing speed of 48 FPS, making it a real-time technique. The pretrained model is available at https://bit.ly/3na7LUy
In the deep sub-micron region, "spin-transfer torque RAM" (STT-RAM) suffers from "read-disturbance error" (RDE), whereby a read operation disturbs the stored data. Mitigation of RDE requires restore operations, which imposes latency and... more
In the deep sub-micron region, "spin-transfer torque RAM" (STT-RAM) suffers from "read-disturbance error" (RDE), whereby a read operation disturbs the stored data. Mitigation of RDE requires restore operations, which imposes latency and energy penalties. Hence, RDE presents a crucial threat to the scaling of STT-RAM. In this paper, we offer three techniques to reduce the restore overhead. First, we avoid the restore operations for those reads, where the block will get updated at a higher level cache in the near future. Second, we identify read-intensive blocks using a lightweight mechanism and then migrate these blocks to a small SRAM buffer. On a future read to these blocks, the restore operation is avoided. Third, for data blocks having zero value, a write operation is avoided, and only a flag is set. Based on this flag, both read and restore operations to this block are avoided. We combine these three techniques to design our final policy, named CORIDOR. Compared to a baseline policy, which performs restore operation after each read, CORIDOR achieves a 31.6% reduction in total energy and brings the relative CPI (cycle-per-instruction) to 0.64×. By contrast, an ideal RDE-free STT-RAM saves 42.7% energy and brings the relative CPI to 0.62×. Thus, our CORIDOR policy achieves nearly the same performance as an ideal RDE-free STT-RAM cache. Also, it reaches three-fourth of the energy-saving achieved by the ideal RDE-free cache. We also compare CORIDOR with four previous techniques and show that CORIDOR provides higher restore energy savings than these techniques.
As von Neumann computing architectures become increasingly constrained by data-movement overheads, researchers have started exploring in-memory computing (IMC) techniques to offset datamovement overheads. Due to the widespread use of... more
As von Neumann computing architectures become increasingly constrained by data-movement overheads, researchers have started exploring in-memory computing (IMC) techniques to offset datamovement overheads. Due to the widespread use of SRAM, IMC techniques for SRAM hold the promise of accelerating a broad range of computing systems and applications. In this article, we present a survey of techniques for in-memory computing using SRAM memory. We review the use of SRAM-IMC for implementing Boolean, search and arithmetic operations, and accelerators for machine learning (especially neural networks) and automata computing. This paper aims to accelerate co-design efforts by informing researchers in both algorithm and hardware architecture fields about the recent developments in SRAM-based IMC techniques.
This paper presents a multi-level design for spin-orbit torque (SOT) assisted spin-transfer torque (STT) based four-bit magnetic random access memory (MRAM). Multi-level cell (MLC) design is an effective solution to increase the storage... more
This paper presents a multi-level design for spin-orbit torque (SOT) assisted spin-transfer torque (STT) based four-bit magnetic random access memory (MRAM). Multi-level cell (MLC) design is an effective solution to increase the storage capacity of MRAM. The conventional SOT-MRAMs enable an energy-efficient, fast, and reliable write operation. However, unlike STT-MRAM, these cells take more area and require two access transistors per cell. This poses significant challenges in the use of SOT MRAMs for high-density memory applications. To address these issues, we propose a multi-level cell that can store four bits and requires only three access transistors. The effective area per bit of the proposed cell is nearly 58% lower than that of the conventional one-bit SOT-MRAM cell. The combined effect of SOT and STT has been incorporated to design SOT-STT based MLC that enables more energy-efficient and faster write operation than the regular MLCs. The results show that SOT-STT based four-bit MLC is 52.9% and 40% more efficient in terms of latency and energy consumption, respectively, when compared to three-bit SOT/STT based MLC.
Recent years have witnessed a significant interest in the ``generative adversarial networks'' (GANs) due to their ability to generate high-fidelity data. Many models of GANs have been proposed for a diverse range of domains ranging from... more
Recent years have witnessed a significant interest in the ``generative adversarial networks'' (GANs) due to their ability to generate high-fidelity data. Many models of GANs have been proposed for a diverse range of domains ranging from natural language processing to image processing. GANs have a high compute and memory requirements. Also, since they involve both convolution and deconvolution operation, they do not map well to the conventional accelerators designed for convolution operations. Evidently, there is a need of customized accelerators for achieving high efficiency with GANs. In this work, we present a survey of techniques and architectures for accelerating GANs. We organize the works on key parameters to bring out their differences and similarities. Finally, we present research challenges that are worthy of attention in near future. More than summarizing the state-of-art, this survey seeks to spark further research in the field of GAN accelerators.
As ``deep neural networks'' (DNNs) achieve increasing accuracy, they are getting employed in increasingly diverse applications, including security-critical applications such as medical and defense. This immense use of DNNs has motivated... more
As ``deep neural networks'' (DNNs) achieve increasing accuracy, they are getting employed in increasingly diverse applications, including security-critical applications such as medical and defense. This immense use of DNNs has motivated the researchers to scrutinizingly study their security vulnerability and propose countermeasures, especially in the context of hardware security. In this paper, we present a survey of techniques for the hardware security of DNNs. For the research works, we highlight the threat-model, key idea for launching attack and defense strategies. We organize the works on salient categories to highlight their strengths and limitations. This paper aims to equip researchers with the knowledge of recent advances in DNN security and motivate them to think of security as the first principle.
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application... more
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application vulnerability, as a method to capture vulnerability variance within the application, is critical to evaluate and improve system resilience. The tradition methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven methodology to analyze application vulnerability based on a novel resilience metric, the data vulnerability factor (DVF). DVF integrates both application and specific hardware into the resilience analysis. To calculate DVF, we extend a performance modeling language to provide a fast modeling solution. Furthermore, we measure six representative computational kernels; we demonstrate the values of DVF by quantifying the impact of algorithm optimization on vulnerability and by quantifying the effectiveness of a hardware protection mechanism.
"Unmanned aerial vehicles" (UAVs) are now being used for a wide range of surveillance applications. Specifically, the detection of on-ground vehicles from UAV images has attracted significant attention due to its potential in applications... more
"Unmanned aerial vehicles" (UAVs) are now being used for a wide range of surveillance applications. Specifically, the detection of on-ground vehicles from UAV images has attracted significant attention due to its potential in applications such as traffic management, parking lot management, and facilitating rescue operations in disaster zones and rugged terrains. This paper presents a survey of deep learning techniques for performing on-ground vehicle detection from aerial imagery captured using UAVs (also known as drones). We review the works in terms of their approach to improve accuracy and reduce computation overhead and their optimization objective. We show the similarities and differences of various techniques and also highlight the future challenges in this area. This survey will benefit researchers in the area of artificial intelligence, traffic surveillance, and applications of UAVs.
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on... more
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this paper, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed-systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This paper will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing.
3D convolution neural networks (CNNs) have shown excellent predictive performance on tasks such as action recognition from videos. Since 3D CNNs have unique characteristics and extremely high compute/memory-overheads, executing them on... more
3D convolution neural networks (CNNs) have shown excellent predictive performance on tasks such as action recognition from videos. Since 3D CNNs have unique characteristics and extremely high compute/memory-overheads, executing them on accelerators designed for 2D CNNs provides sub-optimal performance. To overcome these challenges, researchers have recently proposed architectures for 3D CNNs. In this paper, we present a survey of hardware accelerators and hardware-aware algorithmic optimizations for 3D CNNs. We include only those CNNs that perform 3D convolution and not those that perform only 2D convolution on 2D or 3D data. We highlight their key ideas and underscore their similarities and differences. We believe that this survey will spark a great deal of research towards the design of ultra-efficient 3D CNN accelerators of tomorrow.
As convolutional neural networks (CNNs) improve in accuracy, their model size and computational overheads have also increased. These overheads make it challenging to deploy the CNNs on resource-constrained devices. Pruning is a promising... more
As convolutional neural networks (CNNs) improve in accuracy, their model size and computational overheads have also increased. These overheads make it challenging to deploy the CNNs on resource-constrained devices. Pruning is a promising technique to mitigate these overheads. In this paper, we propose a novel pruning technique called CURATING that looks at the pruning of CNNs as a multi-objective optimization problem. CURATING retains filters that i) are very different (less redundant) from each other in terms of their representation ii) have high saliency score i.e., they reduce the model accuracy drastically if pruned iii) are likely to produce higher activations. We treat a filter specific to an output channel as a probability distribution over spatial filters to measure the similarity between filters. The similarity matrix is leveraged to create filter embeddings, and we constrain our optimization problem to retain a diverse set of filters based on these filter embeddings. On a range of CNNs over well-known datasets, CURATING exercises a better or comparable tradeoff between model size, accuracy, and inference latency than existing techniques. For example, while pruning VGG16 on the ILSVRC-12 dataset, CURATING achieves higher accuracy and a smaller model size than the previous techniques.
Intermittent computing (ImC) refers to the scenario where periods of program execution are separated by reboots. ImC systems are generally powered by energy-harvesting (EH) devices: they start executing a program when the accumulated... more
Intermittent computing (ImC) refers to the scenario where periods of program execution are separated by reboots. ImC systems are generally powered by energy-harvesting (EH) devices: they start executing a program when the accumulated energy reaches a threshold and stop when the energy buffer is exhausted. Since ImC does not depend on a fixed supply of power, it can be used in a wide range of scenarios/devices such as medical implants, wearables, IoT sensors, extraterrestrial systems and so on. Although attractive, ImC also brings challenges such as avoiding data-loss and data inconsistency, and striking the right balance between performance, energy and quality of the result. In this paper, we present a survey of techniques and systems for ImC. We organize the works on key metrics to expose their similarities and differences. This paper will equip researchers with the knowledge of recent developments in ImC and also motivate them to address the remaining challenges for reaping the full potential of ImC.
In recent years, researchers have focused on reducing the model size and number of computations (measured as "multiply-accumulate" or MAC operations) of DNNs. The energy consumption of a DNN depends on both the number of MAC operations... more
In recent years, researchers have focused on reducing the model size and number of computations (measured as "multiply-accumulate" or MAC operations) of DNNs. The energy consumption of a DNN depends on both the number of MAC operations and the energy efficiency of each MAC operation. The former can be estimated at design time; however, the latter depends on the intricate data reuse patterns and underlying hardware architecture. Hence, estimating it at design time is challenging. This work shows that the conventional approach to estimate the data reuse, viz. arithmetic intensity, does not always correctly estimate the degree of data reuse in DNNs since it gives equal importance to all the data types. We propose a novel model, termed "data type aware weighted arithmetic intensity" (DI), which accounts for the unequal importance of different data types in DNNs. We evaluate our model on 25 state-of-the-art DNNs on two GPUs. We show that our model accurately models data-reuse for all possible data reuse patterns for different types of convolution and different types of layers. We show that our model is a better indicator of the energy efficiency of DNNs. We also show its generality using the central limit theorem.
The remarkable predictive performance of deep neural networks (DNNs) has led to their adoption in service domains of unprecedented scale and scope. However, the widespread adoption and growing commercialization of DNNs have underscored... more
The remarkable predictive performance of deep neural networks (DNNs) has led to their adoption in service domains of unprecedented scale and scope. However, the widespread adoption and growing commercialization of DNNs have underscored the importance of intellectual property (IP) protection. Devising techniques to ensure IP protection has become necessary due to the increasing trend of outsourcing the DNN computations on the untrusted accelerators in cloud-based services. The design methodologies and hyper-parameters of DNNs are crucial information, and leaking them may cause massive economic loss to the organization. Furthermore, the knowledge of DNN's architecture can increase the success probability of an adversarial attack where an adversary perturbs the inputs and alter the prediction. In this work, we devise a two-stage attack methodology "DeepPeep" which exploits the distinctive characteristics of design methodologies to reverse-engineer the architecture of building blocks in compact DNNs. We show the efficacy of "DeepPeep" on P100 and P4000 GPUs. Additionally, we propose intelligent design maneuvering strategies for thwarting IP theft through the DeepPeep attack and proposed "Secure MobileNet-V1". Interestingly, compared to vanilla MobileNet-V1, secure MobileNet-V1 provides a significant reduction in inference latency (≈60%) and improvement in predictive performance (≈2%) with very-low memory and computation overheads.
As the capabilities of mobile phones have increased, the potential of their negative use has also increased tremendously. For example, use of mobile phones while driving or in high-security zones can lead to accidents, information leaks... more
As the capabilities of mobile phones have increased, the potential of their negative use has also increased tremendously. For example, use of mobile phones while driving or in high-security zones can lead to accidents, information leaks and security breaches. In this paper, we use deep-learning algorithms viz., single shot multiBox detector (SSD) and faster-region based convolution neural network (Faster-RCNN), to detect mobile phone usage. We highlight the importance of mobile phone usage detection and the challenges involved in it. We have used a subset of State Farm Distracted Driver Detection dataset from Kaggle, which we term as Kag-gleDriver dataset. In addition, we have created a dataset on mobile phone usage, which we term as IITH-dataset on mobile phone usage (IITH-DMU). Although small, IITH-DMU is more generic than the KaggleDriver dataset, since it has images with higher amount of variation in foreground and background objects. Ours is possibly the first work to perform mobile-phone detection for a wide range of scenarios. On the KaggleDriver dataset, the AP at 0.5IoU is 98.97% with SSD and 98.84% with Faster-RCNN. On the IITH-DMU dataset, these numbers are 92.6% for SSD and 95.92% for Faster-RCNN. These pretrained models and the datasets are available at sites.google.com/view/mobile-phone-usage-detection.
"Recurrent neural networks" (RNNs) are powerful artificial intelligence models that have shown remarkable effectiveness in several tasks such as music generation, speech recognition and machine translation. RNN computations involve both... more
"Recurrent neural networks" (RNNs) are powerful artificial intelligence models that have shown remarkable effectiveness in several tasks such as music generation, speech recognition and machine translation. RNN computations involve both intra-timestep and inter-timestep dependencies. Due to these features, hardware acceleration of RNNs is more challenging than that of CNNs. Recently, several researchers have proposed hardware architectures for RNNs. In this paper, we present a survey of GPU/FPGA/ASIC-based accelerators and optimization techniques for RNNs. We highlight the key ideas of different techniques to bring out their similarities and differences. Improvements in deep-learning algorithms have inevitably gone hand-in-hand with the improvements in the hardware-accelerators. Nevertheless, there is a need and scope of even greater synergy between these two fields. This survey seeks to synergize the efforts of researchers in the area of deep learning, computer architecture, and chip-design.
Systolic architecture has widely been used for efficient processing of DNNs in both edge devices and datacenter servers. The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and... more
Systolic architecture has widely been used for efficient processing of DNNs in both edge devices and datacenter servers. The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; however, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy efficiency.  To mitigate this underutilization, numerous dataflow techniques and micro-architectural heuristics have been proposed. In this work, we address this challenge at the algorithm front and propose data reuse aware co-optimization (DRACO), which improves the PE utilization of memory-bound DNNs without any additional need for dataflow/micro-architecture modifications. Furthermore, unlike the previous co-optimization methods, DRACO not only maximizes performance and energy efficiency but also boosts the representational power and improves the predictive performance of DNNs. We perform an extensive experimental study to understand the role of computational complexity and PE utilization on  (inference) latency optimization. The  key finding of this work  is that improving PE utilization does not always improve the performance of a DNN; it also depends on the computational overhead of improving PE utilization. 

To the best of our knowledge, DRACO is the first work that resolves the resource underutilization challenge at the algorithm level and demonstrates a trade-off between computational efficiency, PE utilization, and predictive performance of DNN. Compared to the state-of-the-art row stationary dataflow, DRACO achieves 41.8% and 42.6% improvement in average PE utilization and inference latency (respectively) with negligible loss in predictive performance in MobileNetV1 on a 64x64 systolic array. DRACO provides seminal insights for utilization-aware DNN design methodologies that can fully leverage the computation power of systolic array-based hardware accelerators.
The capability of the self-attention mechanism to model the long-range dependencies has catapulted its deployment in vision models. Unlike convolution operators, self-attention offers infinite receptive field and enables compute-efficient... more
The capability of the self-attention mechanism to model the long-range dependencies has catapulted its deployment in vision models. Unlike convolution operators, self-attention offers infinite receptive field and enables compute-efficient modeling of global dependencies. However, the existing state-of-the-art attention mechanisms incur high compute and/or parameter overheads, and hence unfit for compact convolutional neural networks (CNNs). In this work, we propose a simple yet effective "Ultra-Lightweight Subspace Attention Mechanism" (ULSAM), which infers different attention maps for each feature map subspace. We argue that leaning separate attention maps for each feature subspace enables multi-scale and multi-frequency feature representation, which is more desirable for fine-grained image classification. Our method of subspace attention is orthogonal and complementary to the existing state-of-the-arts attention mechanisms used in vision models. ULSAM is end-to-end trainable and can be deployed as a plug-and-play module in the pre-existing compact CNNs. Notably, our work is the first attempt that uses a subspace attention mechanism to increase the efficiency of compact CNNs. To show the efficacy of ULSAM, we perform experiments with MobileNet-V1 and MobileNet-V2 as backbone architectures on ImageNet-1K and three fine-grained image classification datasets. We achieve ≈13% and ≈25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively). Code and trained models are available at https://github.com/Nandan91/ULSAM .
The number of groups (g) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of g in GConv creates... more
The number of groups (g) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of g in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group size model, which enables a balance between computational cost and data movement cost, thus, optimize the energy-efficiency of DNNs. Based on the insights from this model, we propose an "energy-efficient group convolution" (E2GC) module where, unlike the previous implementations of GConv, the group size (G) remains constant. Further, to demonstrate the efficacy of the E2GC module, we incorporate this module in the design of MobileNet-V1 and ResNeXt-50 and perform experiments on two GPUs, P100 and P4000. We show that, at comparable computational complexity , DNNs with constant group size (E2GC) are more energy-efficient than DNNs with a fixed number of groups (FgGC). For example, on P100 GPU, the energy-efficiency of MobileNet-V1 and ResNeXt-50 is increased by 10.8% and 4.73% (respectively) when E2GC modules substitute the FgGC modules in both the DNNs. Furthermore, through our extensive experimentation with ImageNet-1K and Food-101 image classification datasets, we show that the E2GC module enables a trade-off between generalization ability and representational power of DNN. Thus, the predictive performance of DNNs can be optimized by selecting an appropriate G. The code and trained models are available at https://github.com/iithcandle/E2GC-release.
As DNNs become increasingly common in mission-critical applications, ensuring their reliable operation has become crucial. Conventional resilience techniques fail to account for the unique characteristics of DNN algorithms/accelerators,... more
As DNNs become increasingly common in mission-critical applications, ensuring their reliable operation has become crucial. Conventional resilience techniques fail to account for the unique characteristics of DNN algorithms/accelerators, and hence, they are infeasible or ineffective. In this paper, we present a survey of techniques for studying and optimizing the reliability of DNN accelerators and architectures. The reliability issues we cover include soft/hard errors arising due to process variation, voltage scaling, timing errors, DRAM errors due to refresh rate scaling and thermal effects, etc. We organize the research projects on several categories to bring out their key attributes. This paper underscores the importance of designing for reliability as the first principle, and not merely retrofit for it.
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to its unique features, the GPU continues to remain the most widely used accelerator for DL applications. In this paper, we present a survey of... more
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to its unique features, the GPU continues to remain the most widely used accelerator for DL applications. In this paper, we present a survey of architecture and system-level techniques for optimizing DL applications on GPUs. We review techniques for both inference and training and for both single GPU and distributed system with multiple GPUs. We bring out the similarities and differences of different works and highlight their key attributes. This survey will be useful for both novice and experts in the field of machine learning, processor architecture and high-performance computing.
Problems from a wide variety of application domains can be modeled as "nondeterministic finite automaton" (NFA) and hence, efficient execution of NFAs can improve the performance of several key applications. However, traditional... more
Problems from a wide variety of application domains can be modeled as "nondeterministic finite automaton" (NFA) and hence, efficient execution of NFAs can improve the performance of several key applications. However, traditional architectures, such as CPU and GPU are not inherently suited for executing NFAs, and hence, special-purpose architectures are required for accelerating them. Micron's automata processor (AP) exploits massively parallel in-memory processing capability of DRAM for executing NFAs and hence, it can provide orders of magnitude performance improvement compared to traditional architectures. In this paper, we present a survey of techniques that propose architectural optimizations to AP and use it for accelerating problems from various application domains. This paper will be useful not only for computer architects and processor-designers but also for researchers in the field of bioinformatics, data-mining, machine learning and others.
Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad... more
Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for computer-architects, developers seeking to accelerate their applications and researchers in the area of high-performance computing.
Research Interests:
Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address... more
Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages the observation that the rollbacks are not necessary as long as the application-level loss in quality due to value misprediction is acceptable to the user. However, in the context of Graphics Processing Units (GPUs), our evaluations show that the existing approximate value predictors are not optimal in improving the prediction accuracy as they do not consider memory request order, a key characteristic in determining the accuracy of value prediction. As a result, the overall data movement reduction benefits are capped as it is necessary to limit the percentage of predicted values (i.e., prediction coverage) for an acceptable value of application-level error. To this end, we propose a new Address-Stride Assisted Approximate Value Predictor (ASAP) that explicitly considers the memory addresses and their request order information so as to provide high value prediction accuracy. We take advantage of our new observation that the stride between memory request addresses and the stride between their corresponding data values are highly correlated in several applications. Therefore, ASAP predicts the values only for those requests that have regular strides in their addresses. We evaluate ASAP on a diverse set of GPGPU applications. The results show that ASAP can significantly improve the value prediction accuracy over the previously proposed mechanisms at the same coverage, or can achieve higher coverage (leading to higher performance/energy improvements) under a fixed error threshold.
Design of hardware accelerators for neural network (NN) applications involves walking a tight rope amidst the constraints of low-power, high accuracy and throughput. NVIDIA's Jetson is a promising platform for embedded machine learning... more
Design of  hardware accelerators for neural network (NN) applications involves walking a tight rope amidst the constraints of low-power, high accuracy and throughput. NVIDIA's Jetson is a promising platform for embedded machine learning which seeks to achieve a balance between the above objectives. In this paper, we provide a survey of works that evaluate and optimize neural network applications on Jetson platform. We review both hardware and algorithmic optimizations performed for running NN algorithms on Jetson and show the real-life applications where these algorithms have been applied. We also review the works that compare Jetson with similar platforms. While the survey focuses on Jetson as an exemplar embedded system, many of the ideas and optimizations will apply just as well to existing and future embedded systems.
It is widely believed that the ability to run AI algorithms on low-cost, low-power platforms will be crucial for achieving the ``AI for all'' vision. This survey seeks to provide a glimpse of the recent progress towards that goal.
Mobile web traffic has now surpassed the desktop web traffic and has become the primary means for service providers to reach-out to the billions of end-users. Due to this trend, optimization of mobile web browsing (MWB) has gained... more
Mobile web traffic has now surpassed the desktop web traffic and has become the primary means for service providers to reach-out to the billions of end-users. Due to this trend, optimization of mobile web browsing (MWB) has gained significant attention. In this paper, we present a survey of techniques for improving efficiency of web browsing on mobile systems, proposed in last 6-7 years. We review the techniques from both networking domain (e.g., proxy and browser enhancements) and processor-architecture domain (e.g., hardware customization, thread-to-core scheduling). We organize the research works based on key parameters to highlight their similarities and differences. Beyond summarizing the recent works, this survey aims to emphasize the need of architecting for MWB as the first principle, instead of retrofitting for it.
The rising overheads of data-movement and limitations of general-purpose processing architectures have led to a huge surge in the interest in ``processing-in-memory'' (PIM) approach and ``neural networks'' (NN) architectures. Spintronic... more
The rising overheads of data-movement and limitations of general-purpose processing architectures have led to a huge surge in the interest in ``processing-in-memory'' (PIM) approach and ``neural networks'' (NN) architectures. Spintronic memories facilitate efficient implementation of PIM approach and NN accelerators, and offer several advantages over conventional memories. In this paper, we present a survey of spintronic-architectures for PIM and NNs. We organize the works based on main attributes to underscore their similarities and differences. This paper will be useful for researchers in the area of artificial intelligence, hardware architecture, chip design and memory system.
In modern processors, data-movement consumes two orders of magnitude higher energy than a floating-point operation and hence, data-movement is becoming the primary bottleneck in scaling the performance of modern processors within the... more
In modern processors, data-movement consumes two orders of magnitude higher energy than a floating-point operation and hence, data-movement is becoming the primary bottleneck in scaling the performance of modern processors within the fixed power budget. Intelligent data-encoding techniques hold the promise of reducing the data-movement energy. In this paper, we present a survey of encoding techniques for reducing data-movement energy. By classifying the works on key metrics, we bring out their similarities and differences. This paper is expected to be useful for computer architects, processor designers and researchers in the area of interconnect and memory system design.
Research Interests:
The recent trend in deep neural networks (DNNs) research is to make the networks more compact. The motivation behind designing compact DNNs is to improve energy efficiency since by virtue of having lower memory footprint, compact DNNs... more
The recent trend in deep neural networks (DNNs) research is to make the networks more compact. The motivation behind designing compact DNNs is to improve energy efficiency since by virtue of having lower memory footprint, compact DNNs have lower number of off-chip accesses which improves energy efficiency. However, we show that making DNNs compact has indirect and subtle implications which are not well-understood. Reducing the number of parameters in DNNs increases the number of activations which, in turn, increases the memory footprint. We evaluate several recently-proposed compact DNNs on Tesla P100 GPU and show that their " activations to parameters ratio " ranges between 1.4 to 32.8. Further, the " memory-footprint to model size ratio " ranges between 15 to 443. This shows that a higher number of activations causes large memory footprint which increases on-chip/off-chip data movements. Furthermore, these parameter-reducing techniques reduce the arithmetic intensity which increases on-chip/off-chip memory bandwidth requirement. Due to these factors, the energy efficiency of compact DNNs may be significantly reduced which is against the original motivation for designing compact DNNs.
Research Interests:
Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of... more
Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, custom hardware accelerators are vital for boosting their performance. The high energy-efficiency, computing capabilities and reconfigurability of FPGA make it a promising platform for hardware acceleration of CNNs. In this paper, we present a survey of techniques for implementing and optimizing CNN algorithms on FPGA.      We organize the works in several categories to bring out their similarities and differences. This paper is expected to be useful for researchers in the area of artificial intelligence, hardware architecture and system-design.
Research Interests:

And 84 more

[For clarity, the same information has been put in the pdf file uploaded] The code can be downloaded from https://code.ornl.gov/3d_cache_modeling_tool/destiny . The technical report is here http://goo.gl/qzyWFE, which is extension... more
[For clarity, the same information has been put in the pdf file uploaded]        The code can be downloaded from https://code.ornl.gov/3d_cache_modeling_tool/destiny . The technical report is here http://goo.gl/qzyWFE, which is extension of our DATE 2015 paper http://goo.gl/3nKAM2. The manual for DESTINY tool is here https://code.ornl.gov/3d_cache_modeling_tool/destiny/blob/master/Doc/DESTINY_Documentation.pdf
Research Interests:
[For downloading the code, click on drive.google.com link above, which is same as https://drive.google.com/folderview?id=0B3CSJpITzNscMVBpb3pfUFcwVzQ&usp=sharing . For clarity, the same information has been put in the pdf file uploaded]... more
[For downloading the code, click on drive.google.com link above, which is same as https://drive.google.com/folderview?id=0B3CSJpITzNscMVBpb3pfUFcwVzQ&usp=sharing . For clarity, the same information has been put in the pdf file uploaded] These codes were used in the following paper https://www.academia.edu/3982638/A_Study_of_Successive_Over-relaxation_SOR_Method_Parallelization_Over_Modern_HPC_Languages
Research Interests:
Research Interests:
Research Interests:
Summary of "A Survey Of Architectural Techniques for Near-Threshold Computing" https://www.academia.edu/15478916/A_Survey_Of_Architectural_Techniques_for_Near-Threshold_Computing
Research Interests:
Summary of "A Survey of Techniques for Cache Partitioning in Multicore Processors" https://www.academia.edu/31779426/A_Survey_of_Techniques_for_Cache_Partitioning_in_Multicore_Processors
Research Interests:
Summary of "A Survey of Techniques for Architecting TLBs" https://www.academia.edu/29585076/A_Survey_of_Techniques_for_Architecting_TLBs
Research Interests:
A summary of  "A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors", CSUR, 2016 https://www.academia.edu/18301534/A_Survey_Of_Techniques_for_Architecting_and_Managing_Asymmetric_Multicore_Processors
Research Interests:
Covers:
(1) PCM resistance drift error (2) PCM write disturbance error (3) STT-RAM read disturbance error and (4) STT-RAM write errors
Research Interests:
An overview of cache bypassing techniques
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Ongoing process scaling and push for performance have led to increasingly severe reliability challenges for memory systems. For example, process variation (PV)--deviation in parameters from their nominal specifications--threatens to slow... more
Ongoing process scaling and push for performance have led to increasingly severe reliability challenges for memory systems. For example, process variation (PV)--deviation in parameters from their nominal specifications--threatens to slow down and even pause technological scaling and addressing it is important for continuing the benefits of chip miniaturization. In this talk, I will present a brief background on memory reliability challenges, viz., PV and soft-errors, along with their impact on systems ranging from embedded systems to supercomputers. Then, several architectural strategies for managing PV in different processor components (e.g., core, cache and main memory) and memory technologies (e.g., SRAM, embedded DRAM, DRAM and non-volatile memory) will be discussed.
Research Interests:
Research Interests:
Research Interests:
With increasing number of on-chip cores and CMOS scaling, the size of last level caches (LLCs) is on rise and hence, managing their leakage energy consumption has become vital for continuing to scale performance. In multicore systems, the... more
With increasing number of on-chip cores and CMOS scaling, the size of last level caches (LLCs) is on rise and hence, managing their leakage energy consumption has become vital for continuing to scale performance. In multicore systems, the locality of memory access stream is significantly reduced due to multiplexing of access streams from different running programs and hence, leakage energy saving techniques such as decay cache, which rely on memory access locality, do not save large amount of energy. The techniques based on way level allocation provide very coarse granularity and the techniques based on offline profiling become infeasible to use for large number of cores. We present MASTER, a multicore cache energy saving technique using dynamic cache reconfiguration. MASTER uses online profiling to predict energy consumption of running programs at multiple LLC sizes. Using these estimates, suitable cache quotas are allocated to different programs using cache coloring scheme and the unused LLC space is turned off to save energy. Even for 4 core systems, the implementation overhead of MASTER is only 0.8% of L2 size. We evaluate MASTER using out-of-order simulations with multiprogrammed workloads from SPEC2006 and compare it with conventional cache leakage energy saving techniques. The results show that MASTER gives highest saving in energy and does not harm performance or cause unfairness. For 2 and 4-core simulations, the average savings in memory subsystem (which includes LLC and main memory) energy over shared baseline LLC are 15% and 11%, respectively. Also, the average values of weighted speedup and fair speedup are close to one ( >0.98).
Research Interests:
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern... more
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern processor design. The conventional cache energy saving techniques require offline profiling or provide only coarse granularity of cache allocation. We present FlexiWay, a cache energy saving technique which uses dynamic cache reconfiguration. FlexiWay logically divides the cache sets into multiple (e.g. 16) modules and dynamically turns off suitable and possibly different number of cache ways in each module. FlexiWay has very small implementation overhead and it provides fine-grain cache allocation even with caches of typical associativity, e.g. an 8-way cache. Microarchitectural simulations have been performed using an x86-64 simulator and workloads from SPEC2006 suite. Also, FlexiWay has been compared with two conventional energy saving
techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.
Research Interests:
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip... more
To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches. A crucial limitation of NVMs, however, is that their write endurance is low and the large intra-set write variation introduced by existing cache management policies may further exacerbate this problem, thereby reducing the cache lifetime significantly. We present EqualChance, a technique to increase cache lifetime by reducing intra-set write variation. EqualChance works by periodically changing the physical cache-block location of a write-intensive data item within a set to achieve wear-leveling. Simulations using workloads from SPEC CPU2006 suite and HPC (high-performance computing) field show that EqualChance improves the cache lifetime by 4.29X. Also, its implementation overhead is small, and it incurs very small performance and energy loss.
Research Interests:
Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased. Since SRAM consumes high leakage power, researchers have explored use of non-volatile memories (NVMs)... more
Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased. Since SRAM consumes high leakage power, researchers have explored use of non-volatile memories (NVMs) for designing caches as they provide high density and consume low leakage power. However, since NVMs have low write-endurance and the existing cache management policies are write variation-unaware, effective wear-leveling techniques are required for achieving reasonable cache lifetimes using NVMs. We present WriteSmoothing, a technique for mitigating intra-set write variation in NVM caches.  WriteSmoothing logically divides the cache-sets into multiple modules. For each module, WriteSmoothing collectively records number of writes in each way for any of the sets. It then periodically makes most frequently written ways in a module unavailable to shift the write-pressure to other ways in the sets of the module. Extensive simulation results have shown that on average, for single and dual-core system configurations, WriteSmoothing improves cache lifetime by 2.17X and 2.75X, respectively. Also, its implementation overhead is small and it works well for a  wide range of algorithm and system parameters.
Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches holds the promise of providing a high-density, low-leakage alternative to SRAM. However, low write... more
Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches holds the promise of providing a high-density, low-leakage alternative to SRAM. However, low write endurance of NVMs, along with the write-variation introduced by existing cache management schemes may significantly limit the lifetime of NVM caches. We present LastingNVCache, a technique for improving lifetime of NVM caches by mitigating the intra-set write variation. LastingNVCache works on the key idea that by periodically flushing a frequently-written data-item, the next time the block can be made to load into a cold block in the set. Through this, the future writes to that data-item can be redirected from a hot block to a cold block, which leads to improvement in the cache lifetime. Microarchitectural simulations have shown that LastingNVCache provides 6.36X, 9.79X, and 10.94X  improvement in lifetime for single, dual and quad-core systems. Also, its implementation overhead is small and it outperforms a recently proposed technique for improving lifetime of NVM caches.
Research Interests:
With each CMOS technology generation, leakage energy has been increasing at an exponential rate and hence, managing the energy consumption of large, last-level caches is becoming a critical research issue in modern chip design. Saving... more
With each CMOS technology generation, leakage energy has been increasing at an exponential rate and hence, managing the energy consumption of large, last-level caches is becoming a critical research issue in modern chip design. Saving cache energy in QoS systems is especially challenging, since, to avoid missing deadlines, a suitable balance needs to be made between energy saving and performance loss. In this paper, we propose CASHIER, a Cache Energy Saving Technique for Quality of Service Systems. Cashier uses dynamic profiling to
estimate the memory subsystem energy and execution time of the program under multiple last level cache (LLC) configurations. It then reconfigures LLC to an energy efficient configuration with a view to meet the deadline. In QoS systems, allowed slack may be specified either as percentage of baseline execution time or
as absolute slack and Cashier can work for both these cases. The experiments show the effectiveness of Cashier in saving cache energy. For example, for an L2 cache size of 2MB and 5% allowed-slack over baseline, the average saving in memory subsystem energy by using Cashier is 23.6%.
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major... more
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making them a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.