Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content

Core Placement Optimization for Multi-chip Many-core Neural Network Systems with Reinforcement Learning

Authors: Nan Wu, Lei Deng, Guoqi Li, Yuan XieAuthors Info & Claims
Article No.: 11, Pages 1 - 27
Published: 19 October 2020 Publication History

Abstract

Multi-chip many-core neural network systems are capable of providing high parallelism benefited from decentralized execution, and they can be scaled to very large systems with reasonable fabrication costs. As multi-chip many-core systems scale up, communication latency related effects will take a more important portion in the system performance. While previous work mainly focuses on the core placement within a single chip, there are two principal issues still unresolved: the communication-related problems caused by the non-uniform, hierarchical on/off-chip communication capability in multi-chip systems, and the scalability of these heuristic-based approaches in a factorially growing search space. To this end, we propose a reinforcement-learning-based method to automatically optimize core placement through deep deterministic policy gradient, taking into account information of the environment by performing a series of trials (i.e., placements) and using convolutional neural networks to extract spatial features of different placements. Experimental results indicate that compared with a naive sequential placement, the proposed method achieves 1.99× increase in throughput and 50.5% reduction in latency; compared with the simulated annealing, an effective technique to approximate the global optima in an extremely large search space, our method improves the throughput by 1.22× and reduces the latency by 18.6%. We further demonstrate that our proposed method is capable to find optimal placements taking advantages of different communication properties caused by different system configurations, and work in a topology-agnostic manner.

References

[1]
Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient progressive device placement optimization. In Proceedings of the NIPS Machine Learning for Systems Workshop.
[2]
Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, Brian Taba, Michael Beakes, Bernard Brezzo, Jente B. Kuang, Rajit Manohar, William P. Risk, Bryan Jackson, and Dharmendra S. Modha. 2015. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537--1557.
[3]
Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W. Hwu, John Paul Strachan, Kaushik Roy, and Dejan S. Milojicic. 2019. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 715--731.
[4]
Peter E. Bailey, David K. Lowenthal, Vignesh Ravi, Barry Rountree, Martin Schulz, and Bronis R. De Supinski. 2014. Adaptive configuration selection for power-constrained heterogeneous systems. In Proceedings of the 43rd International Conference on Parallel Processing. IEEE, 371--380.
[5]
Nathan Beckmann and Daniel Sanchez. 2017. Maximizing cache performance under uncertainty. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 109--120.
[6]
Richard Bellman. 1957. A Markovian decision process. J. Math. Mech. 6, 5 (1957), 679--684.
[7]
Troy Beukema, Michael Sorna, Karl Selander, Steven Zier, Brian L. Ji, Phil Murfet, James Mason, Woogeun Rhee, Herschel Ainspan, Benjamin Parker, and Michael Beakes. 2005. A 6.4-Gb/s CMOS SerDes core with feed-forward and decision-feedback equalization. IEEE J. Solid-State Circ. 40, 12 (2005), 2633--2645.
[8]
Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez. 2019. Perceptron-based prefetch filtering. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 1--13.
[9]
Andrea Boni, Andrea Pierazzi, and Davide Vecchi. 2001. LVDS I/O interface for Gb/s-per-pin operation in 0.35-/spl mu/m CMOS. IEEE J. Solid-State Circ. 36, 4 (2001), 706--711.
[10]
Snaider Carrillo, Jim Harkin, Liam J. McDaid, Fearghal Morgan, Sandeep Pande, Seamus Cawley, and Brian McGinley. 2012. Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations. IEEE Trans. Parallel Distributed Syst. 24, 12 (2012), 2451--2461.
[11]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 609--622.
[12]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE, 367--379.
[13]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (Aug. 2011), 2493--2537.
[14]
Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steve McCoy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, and Hong Wang. 2018. Loihi: A neuromorphic many-core processor with on-chip learning. IEEE Micro 38, 1 (2018), 82--99.
[15]
John Demme, Matthew Maycock, Jared Schmitz, Adrian Tang, Adam Waksman, Simha Sethumadhavan, and Salvatore Stolfo. 2013. On the feasibility of online malware detection with performance counters. ACM SIGARCH Comput. Architect. News 41, 3 (2013), 559--570.
[16]
Lei Deng, Ling Liang, Guanrui Wang, Liang Chang, Xing Hu, Xin Ma, Liu Liu, Jing Pei, Guoqi Li, and Yuan Xie. 2018. Semimap: A semi-folded convolution mapping for speed-overhead balance on crossbars. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 1 (2018), 117--130.
[17]
Lei Deng, Guanrui Wang, Guoqi Li, Shuangchen Li, Ling Liang, Maohua Zhu, Yujie Wu, Zheyu Yang, Zhe Zou, Jing Pei, Zhenzhi Wu, Xing Hu, Yufei Ding, Wei He, Yuan Xie, and Luping Shi. 2020. Tianjic: A unified and scalable chip bridging spike-based and continuous neural computation. IEEE J. Solid-State Circ. 55, 8 (2020), 2228--2246.
[18]
Yi Ding, Nikita Mishra, and Henry Hoffmann. 2019. Generative and multi-phase learning for computer systems optimization. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 39--52.
[19]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104.
[20]
Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmendra S. Modha. 2016. Convolutional networks for fast energy-efficient neuromorphic computing. Proc. Natl. Acad. Sci. U.S.A. 113, 41 (2016), 11441--11446.
[21]
Quintin Fettes, Mark Clark, Razvan Bunescu, Avinash Karanth, and Ahmed Louri. 2019. Dynamic voltage and frequency scaling in NoCs with supervised and reinforcement learning techniques. IEEE Trans. Comput. 68, 3 (2019).
[22]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Post: Device placement with cross-entropy minimization and proximal policy optimization. In Advances in Neural Information Processing Systems. MIT Press, 9971--9980.
[23]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing device placement for training deep neural networks. In Proceedings of the International Conference on Machine Learning. 1662--1670.
[24]
Michael R. Garey and David S. Johnson. 1979. Computers and Intractability. Vol. 174. Freeman, San Francisco, CA.
[25]
Elba Garza, Samira Mirbagher-Ajorpaz, Tahsin Ahmad Khan, and Daniel A. Jiménez. 2019. Bit-level perceptron prediction for indirect branches. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 27--38.
[26]
Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning. 1764--1772.
[27]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243--254.
[28]
Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Learning memory access patterns. In Proceedings of the International Conference on Machine Learning. 1924--1933.
[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[30]
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29, 6 (2012), 82--97.
[31]
Henry Hoffmann. 2015. JouleGuard: Energy guarantees for approximate applications. In Proceedings of the 25th Symposium on Operating Systems Principles. 198--214.
[32]
Jingcao Hu and Radu Marculescu. 2003. Energy-aware mapping for tile-based NoC architectures under performance constraints. In Proceedings of the Asia and South Pacific Design Automation Conference. 233--239.
[33]
Jingcao Hu and Radu Marculescu. 2005. Energy-and performance-aware mapping for regular NoC architectures. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 24, 4 (2005), 551--562.
[34]
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. 2019. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 4 (2019), eaau5872.
[35]
Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. 2006. Efficiently Exploring Architectural Design Spaces via Predictive Modeling. Vol. 41. ACM.
[36]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems 8 Software. IEEE, 2--13.
[37]
Daniel A. Jiménez. 2011. An optimized scaled neural branch predictor. In Proceedings of the IEEE 29th International Conference on Computer Design (ICCD’11). IEEE, 113--118.
[38]
Ali Jooya, Nikitas Dimopoulos, and Amirali Baniasadi. 2016. MultiObjective GPU design space exploration optimization. In Proceedings of the International Conference on High Performance Computing 8 Simulation (HPCS’16). IEEE, 659--666.
[39]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 1--12.
[40]
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly scalable dragonfly topology. In Proceedings of the International Symposium on Computer Architecture. IEEE, 77--88.
[41]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT Press, 1097--1105.
[42]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[43]
Yann LeCun, Corinna Cortes, and C. J. Burges. 2010. MNIST handwritten digit database. Retrieved from http://yann.lecun.com/exdb/mnist/.
[44]
Benjamin C. Lee and David M. Brooks. 2007. Illustrative design space studies with microarchitectural regression models. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 340--351.
[45]
Benjamin C. Lee, Jamison Collins, Hong Wang, and David Brooks. 2008. CPR: Composable performance regression for scalable multiprocessor models. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 270--281.
[46]
Matthew Kay Fei Lee, Yingnan Cui, Thannirmalai Somu, Tao Luo, Jun Zhou, Wai Teng Tang, Weng-Fai Wong, and Rick Siow Mong Goh. 2019. A system-level simulator for RRAM-based neuromorphic computing chips. ACM Trans. Arch. Code Optim. 15, 4 (2019), 1--24.
[47]
Tang Lei and Shashi Kumar. 2003. A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In Proceedings of the Euromicro Symposium on Digital System Design. IEEE, 180--187.
[48]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. ArXiv:1509.02971. Retrieved from https://arxiv.org/abs/1509.02971.
[49]
Ting-Ru Lin, Yunfan Li, Massoud Pedram, and Lizhong Chen. 2019. Design space exploration of memory controller placement in throughput processors with deep learning. IEEE Comput. Arch. Lett. 18, 1 (2019), 51--54.
[50]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.
[51]
Shiting Justin Lu, Russell Tessier, and Wayne Burleson. 2015. Reinforcement learning for thermal-aware many-core task allocation. In Proceedings of the 25th Edition on Great Lakes Symposium on VLSI. ACM, 379--384.
[52]
Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In Proceedings of the International Conference on Machine Learning. 4505--4515.
[53]
Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, and Michael Carbin. 2019. Compiler auto-vectorization with imitation learning. In Advances in Neural Information Processing Systems. MIT Press, 14598--14609.
[54]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. 2018. A hierarchical model for device placement. In Proceedings of the 35th International Conference on Machine Learning. JMLR. org.
[55]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2430--2439.
[56]
Nikita Mishra, John D. Lafferty, and Henry Hoffmann. 2017. Esp: A machine learning approach to predicting application interference. In Proceedings of the IEEE International Conference on Autonomic Computing (ICAC’17). IEEE, 125--134.
[57]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. ArXiv:1312.5602. Retrieved from https://arxiv.org/abs/1312.5602.
[58]
Srinivasan Murali and Giovanni De Micheli. 2004. Bandwidth-constrained mapping of cores onto NoC architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. IEEE, 896--901.
[59]
Meltem Ozsoy, Khaled N. Khasawneh, Caleb Donovick, Iakov Gorelik, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2016. Hardware-based malware detection using low-level architectural features. IEEE Trans. Comput. 65, 11 (2016), 3332--3344.
[60]
Eustace Painkras, Luis A. Plana, Jim Garside, Steve Temple, Francesco Galluppi, Cameron Patterson, David R. Lester, Andrew D. Brown, and Steve B. Furber. 2013. SpiNNaker: A 1-W 18-core system-on-chip for massively parallel neural network simulation. IEEE J. Solid-State Circ. 48, 8 (2013), 1943--1953.
[61]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 27--40.
[62]
Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, Feng Chen, Ning Deng, Si Wu, Yu Wang, Yujie Wu, Zheyu Yang, Cheng Ma, Guoqi Li, Wentao Han, Huanglong Li, Huaqiang Wu, Rong Zhao, Yuan Xie, and Luping Shi. 2019. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572, 7767 (2019), 106--111.
[63]
Nishant Rao, Akshay Ramachandran, and Amish Shah. 2018. MLNoC: A machine learning based approach to NoC design. In Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’18). IEEE, 1--8.
[64]
Pradip Kumar Sahu, Nisarg Shah, Kanchan Manna, and Santanu Chattopadhyay. 2010. A new application mapping algorithm for mesh based network-on-chip design. In Proceedings of the Annual IEEE India Conference (INDICON’10). IEEE, 1--4.
[65]
Ahmad E. L. Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. 2017. Deep reinforcement learning framework for autonomous driving. Electronic Imag. 2017, 19 (2017), 70--76.
[66]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. ArXiv:1707.06347. Retrieved from https://arxiv.org/abs/1707.06347.
[67]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 14--27.
[68]
Pradeep Kumar Sharma, Santosh Biswas, and Pinaki Mitra. 2019. Energy efficient heuristic application mapping for 2D mesh-based network-on-chip. Microprocess. Microsyst. 64 (2019), 88--100.
[69]
Wein-Tsung Shen, Chih-Hao Chao, Yu-Kuang Lien, and An-Yeu Wu. 2007. A new binomial mapping and optimization algorithm for reduced-complexity mesh-based on-chip network. In Proceedings of the 1st International Symposium on Networks-on-Chip (NOCS’07). IEEE, 317--322.
[70]
Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. 2019. Applying deep learning to the cache replacement problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 413--425.
[71]
Dongjoo Shin, Jinmook Lee, Jinsu Lee, and Hoi-Jun Yoo. 2017. 14.2 DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, 240--241.
[72]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML'14), Volume 32. I-387--I-395.
[73]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354--359.
[74]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.
[75]
Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards hybrid parallelism for deep-learning accelerator array. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 56--68.
[76]
Kevin Stock, Louis-Noël Pouchet, and P. Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM Trans. Arch. Code Optim. 8, 4 (2012), 50.
[77]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104--3112.
[78]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.
[79]
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. MIT Press, 1057--1063.
[80]
Elvira Teran, Zhe Wang, and Daniel A. Jiménez. 2016. Perceptron learning for reuse prediction. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.
[81]
George E. Uhlenbeck and Leonard S. Ornstein. 1930. On the theory of the Brownian motion. Phys. Rev. 36, 5 (1930), 823.
[82]
Gianvito Urgese, Francesco Barchi, Enrico Macii, and Andrea Acquaviva. 2016. Optimizing network traffic for spiking neural network simulations on densely interconnected many-core neuromorphic platforms. IEEE Trans. Emerg. Topics Comput. 6, 3 (2016), 317--329.
[83]
Peter J. M. Van Laarhoven and Emile H. L. Aarts. 1987. Simulated annealing. In Simulated Annealing: Theory and Applications. Springer, 7--15.
[84]
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350--354.
[85]
Ke Wang, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. 2019. IntelliNoC: A holistic design framework for energy-efficient and reliable on-chip communication for many cores. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 589--600.
[86]
Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Mach. Learn. 8, 3--4 (1992), 279--292.
[87]
Darrell Whitley. 1994. A genetic algorithm tutorial. Stat. Comput. 4, 2 (1994), 65--85.
[88]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3--4 (1992), 229--256.
[89]
John M. Wilson, Walker J. Turner, John W. Poulton, Brian Zimmer, Xi Chen, Sudhir S. Kudva, Sanquan Song, Stephen G. Tell, Nikola Nedovic, Wenxu Zhao, Sunil R. Sudhakaran, C. Thomas Gray, and William J. Dally. 2018. A 1.17 pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off-and on-package communication in 16nm CMOS using a process-and temperature-adaptive voltage regulator. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). IEEE, 276--278.
[90]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564--576.
[91]
Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, and Mark Horowitz. 2018. DNN dataflow choice is overrated. ArXiv preprint arXiv:1809.04070. Retrieved from https://arxiv.org/abs/1809.04070.
[92]
Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick Epema. 2013. Towards machine learning-based auto-tuning of mapreduce. In Proceedings of the IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. IEEE, 11--20.
[93]
Yuan Zeng and Xiaochen Guo. 2017. Long short-term memory--based hardware prefetcher: A case study. In Proceedings of the International Symposium on Memory Systems. ACM, 305--311.
[94]
Haitao Zhang, Bingchang Tang, Xin Geng, and Huadong Ma. 2018. Learning driven parallelization for large-scale video workload in hybrid CPU-GPU cluster. In Proceedings of the 47th International Conference on Parallel Processing. ACM, 32.
[95]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 20.
[96]
Brian Zimmer, Rangharajan Venkatesan, Yakun Sophia Shao, Jason Clemons, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Stephen W. Keckler, and Brucek Khailany. 2019. A 0.11 pJ/Op, 0.32-128 TOPS, scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm. In Proceedings of the Symposium on VLSI Circuits. IEEE, C300--C301.

Cited By

View all
  • (2023)Approaching the mapping limit with closed-loop mapping strategy for deploying neural networks on neuromorphic hardwareFrontiers in Neuroscience10.3389/fnins.2023.116886417Online publication date: 18-May-2023
  • (2023)A Survey of Machine Learning for Computer Architecture and SystemsACM Computing Surveys10.1145/349452355:3(1-39)Online publication date: 30-Apr-2023
  • (2023)Policy Gradient-Based Core Placement Optimization for Multichip Many-Core SystemsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.311787834:8(4529-4543)Online publication date: Aug-2023
  • Show More Cited By

Index Terms

  1. Core Placement Optimization for Multi-chip Many-core Neural Network Systems with Reinforcement Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      ACM Transactions on Design Automation of Electronic Systems  Volume 26, Issue 2
      March 2021
      220 pages
      ISSN:1084-4309
      EISSN:1557-7309
      DOI:10.1145/3430836
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 19 October 2020
      Accepted: 01 August 2020
      Revised: 01 July 2020
      Received: 01 April 2020
      Published in TODAES Volume 26, Issue 2

      Permissions

      Request permissions for this article.
      Request Permissions

      Check for updates

      Author Tags

      1. Multi-chip many-core architecture
      2. core placement optimization
      3. machine learning for system
      4. neural network accelerator

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • NSF

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)341
      • Downloads (Last 6 weeks)41
      Reflects downloads up to 27 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Approaching the mapping limit with closed-loop mapping strategy for deploying neural networks on neuromorphic hardwareFrontiers in Neuroscience10.3389/fnins.2023.116886417Online publication date: 18-May-2023
      • (2023)A Survey of Machine Learning for Computer Architecture and SystemsACM Computing Surveys10.1145/349452355:3(1-39)Online publication date: 30-Apr-2023
      • (2023)Policy Gradient-Based Core Placement Optimization for Multichip Many-Core SystemsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.311787834:8(4529-4543)Online publication date: Aug-2023
      • (2023)CANNON: Communication-Aware Sparse Neural Network OptimizationIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.328977811:4(882-894)Online publication date: Oct-2023
      • (2023)HeterGenMap: An Evolutionary Mapping Framework for Heterogeneous NoC-Based Neuromorphic SystemsIEEE Access10.1109/ACCESS.2023.334516811(144095-144112)Online publication date: 2023
      • (2022)Coordinated Batching and DVFS for DNN Inference on GPU AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.314461433:10(2496-2508)Online publication date: 1-Oct-2022
      • (2022)Preferred Benchmarking Criteria for Systematic Taxonomy of Embedded Platforms (STEP) in Human System Interaction Systems2022 15th International Conference on Human System Interaction (HSI)10.1109/HSI55341.2022.9869470(1-7)Online publication date: 28-Jul-2022
      • (2022)Rapid Design-Space Exploration for Low-Power Manycores Under Process Variation Utilizing Machine LearningIEEE Access10.1109/ACCESS.2022.318714010(70187-70203)Online publication date: 2022
      • (2021)RAMAN: Reinforcement Learning Inspired Algorithm for Mapping Applications onto Mesh Network-on-Chip2021 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)10.1109/SLIP52707.2021.00019(52-58)Online publication date: Nov-2021
      • (2021)VLSI Structure-aware Placement for Convolutional Neural Network Accelerator Units2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586294(1117-1122)Online publication date: 5-Dec-2021

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media

      View Issue’s Table of Contents