Skip to content

Latest commit

 

History

History
913 lines (458 loc) · 107 KB

README.md

File metadata and controls

913 lines (458 loc) · 107 KB

Papers that use sparsity in deep learning

This is a list of papers curated for the paper “Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks”.

The following list is automatically generated from sparsity.bib. To contribute to this list, please set up a Pull Request and add new bibtex entries.

Papers

Achille, Alessandro, Matteo Rovere, and Stefano Soatto. 2019. “Critical Learning Periods in Deep Neural Networks.” http://arxiv.org/abs/1711.08856.

Afghan, Sher, and Uwe Naumann. 2020. “Interval Adjoint Significance Analysis for Neural Networks.” In International Conference on Computational Science, 365–78. Springer.

Aghasi, Alireza, Afshin Abdi, Nam Nguyen, and Justin Romberg. 2017. “Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee.” http://arxiv.org/abs/1611.05162.

Ahmad, Subutai, and Luiz Scheinkman. 2019. “How Can We Be so Dense? The Benefits of Using Highly Sparse Representations.” http://arxiv.org/abs/1903.11257.

Aji, Alham Fikriand, and Kenneth Heafield. 2017. “Sparse Communication for Distributed Gradient Descent.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 440–45. http://arxiv.org/abs/1704.05021.

Albericio, J., P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2016. “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.” In 2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca), 1–13. https://doi.org/10.1109/ISCA.2016.11.

Alistarh, Dan, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. “QSGD: Communication-Efficient Sgd via Gradient Quantization and Encoding.” http://arxiv.org/abs/1610.02132.

Alistarh, Dan, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. “The Convergence of Sparsified Gradient Methods.” In Advances in Neural Information Processing Systems, 5973–83. http://arxiv.org/abs/1809.10505.

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2019. “A Convergence Theory for Deep Learning via over-Parameterization.” http://arxiv.org/abs/1811.03962.

Almahairi, Amjad, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. “Dynamic Capacity Networks.” http://arxiv.org/abs/1511.07838.

Alvarez, Jose M., and Mathieu Salzmann. 2017. “Compression-Aware Training of Deep Networks.” http://arxiv.org/abs/1711.02638.

Alwani, Manoj, Han Chen, Michael Ferdman, and Peter Milder. 2016. “Fused-Layer Cnn Accelerators.” In The 49th Annual Ieee/Acm International Symposium on Microarchitecture, 22. IEEE Press.

Amari, Shun-ichi. 1998. “Natural Gradient Works Efficiently in Learning.” Neural Computation 10 (2): 251–76. https://doi.org/10.1162/089976698300017746.

Anwar, Sajid, Kyuyeon Hwang, and Wonyong Sung. 2017. “Structured Pruning of Deep Convolutional Neural Networks.” ACM Journal on Emerging Technologies in Computing Systems (JETC) 13 (3): 1–18.

Atashgahi, Zahra, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2020. “Quick and Robust Feature Selection: The Strength of Energy-Efficient Sparse Training for Autoencoders.” http://arxiv.org/abs/2012.00560.

Azarian, Kambiz, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. 2020. “Learned Threshold Pruning.” http://arxiv.org/abs/2003.00075.

Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations.”

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” http://arxiv.org/abs/1607.06450.

Baalen, Mart van, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. “Bayesian Bits: Unifying Quantization and Pruning.” http://arxiv.org/abs/2005.07093.

Baldi, Pierre, and Peter J Sadowski. 2013. “Understanding Dropout.” In Advances in Neural Information Processing Systems, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 26:2814–22. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf.

Bartoldson, Brian R., Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. “The Generalization-Stability Tradeoff in Neural Network Pruning.” http://arxiv.org/abs/1906.03728.

Basu, Debraj, Deepesh Data, Can Karakus, and Suhas N Diggavi. 2020. “Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations.” IEEE Journal on Selected Areas in Information Theory 1 (1): 217–26. http://arxiv.org/abs/1906.02367.

Baykal, Cenk, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. 2018. “Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds.” arXiv Preprint arXiv:1804.05345.

Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Img. Sci. 2 (1): 183–202. https://doi.org/10.1137/080716542.

Bellec, Guillaume, David Kappel, Wolfgang Maass, and Robert Legenstein. 2018. “Deep Rewiring: Training Very Sparse Deep Networks.” http://arxiv.org/abs/1711.05136.

Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” http://arxiv.org/abs/2004.05150.

Bengio, Emmanuel, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. “Conditional Computation in Neural Networks for Faster Models.” http://arxiv.org/abs/1511.06297.

Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. 2013. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” http://arxiv.org/abs/1308.3432.

Ben-Nun, Tal, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoefler. 2019. “A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning.” http://arxiv.org/abs/1901.10183.

Ben-Nun, Tal, and Torsten Hoefler. 2018. “Demystifying Parallel and Distributed Deep Learning: An in-Depth Concurrency Analysis.” http://arxiv.org/abs/1802.09941.

Betzel, Richard F, John D Medaglia, Lia Papadopoulos, Graham L Baum, Ruben Gur, Raquel Gur, David Roalf, Theodore D Satterthwaite, and Danielle S Bassett. 2017. “The Modular Organization of Human Anatomical Brain Networks: Accounting for the Cost of Wiring.” Network Neuroscience 1 (1): 42–68.

Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. “Benchmark Analysis of Representative Deep Neural Network Architectures.” IEEE Access 6: 64270–7. https://doi.org/10.1109/access.2018.2877890.

Blalock, Davis, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. “What Is the State of Neural Network Pruning?” http://arxiv.org/abs/2003.03033.

Bourely, Alfred, John Patrick Boueri, and Krzysztof Choromonski. 2017. “Sparse Neural Networks Topologies.” http://arxiv.org/abs/1706.05683.

Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/2005.14165.

Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2017. “SGD Learns over-Parameterized Networks That Provably Generalize on Linearly Separable Data.” http://arxiv.org/abs/1710.10174.

Burrascano, P. 1993. “A Pruning Technique Maximizing Generalization.” In Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan), 1:347–50 vol.1. https://doi.org/10.1109/IJCNN.1993.713928.

Carreira-Perpinan, M. A., and Y. Idelbayev. 2018. “"Learning-Compression" Algorithms for Neural Net Pruning.” In 2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 8532–41. https://doi.org/10.1109/CVPR.2018.00890.

Castellano, G., A. M. Fanelli, and M. Pelillo. 1997. “An Iterative Pruning Algorithm for Feedforward Neural Networks.” IEEE Transactions on Neural Networks 8 (3): 519–31. https://doi.org/10.1109/72.572092.

Castellano, Giovanna, and Anna Maria Fanelli. 2000. “Variable Selection Using Neural-Network Models.” Neurocomputing 31 (1-4): 1–13.

Chandrasekaran, Hema, Hung-Han Chen, and Michael T. Manry. 2000. “Pruning of Basis Functions in Nonlinear Approximators.” Neurocomputing 34 (1): 29–53. https://doi.org/https://doi.org/10.1016/S0925-2312(00)00311-8.

Changpinyo, Soravit, Mark Sandler, and Andrey Zhmoginov. 2017. “The Power of Sparsity in Convolutional Neural Networks.” http://arxiv.org/abs/1702.06257.

Chao, Shih-Kang, Zhanyu Wang, Yue Xing, and Guang Cheng. 2020. “Directional Pruning of Deep Neural Networks.” http://arxiv.org/abs/2006.09358.

Chauvin, Yves. 1989. “A Back-Propagation Algorithm with Optimal Use of Hidden Units.” In Advances in Neural Information Processing Systems 1, 519–26. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Chellapilla, Kumar, Sidd Puri, and Patrice Simard. 2006. “High Performance Convolutional Neural Networks for Document Processing.” In.

Chen, Chia-Yu, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2017. “AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training.” In 32nd Aaai Conference on Artificial Intelligence, 2827–35. http://arxiv.org/abs/1712.02679.

Chen, Jianda, Shangyu Chen, and Sinno Jialin Pan. 2020. “Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning.” Advances in Neural Information Processing Systems 33.

Chen, Tianlong, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. “The Lottery Ticket Hypothesis for Pre-Trained Bert Networks.” http://arxiv.org/abs/2007.12223.

Chen, Y., T. Krishna, J. S. Emer, and V. Sze. 2017. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” IEEE Journal of Solid-State Circuits 52 (1): 127–38. https://doi.org/10.1109/JSSC.2016.2616357.

Chen, Yu-Hsin, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. “Eyeriss V2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices.” http://arxiv.org/abs/1807.07928.

Cheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2020. “A Survey of Model Compression and Acceleration for Deep Neural Networks.” http://arxiv.org/abs/1710.09282.

Chérief-Abdellatif, Badr-Eddine. 2019. “Convergence Rates of Variational Inference in Sparse Deep Learning.” http://arxiv.org/abs/1908.04847.

Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: Efficient Primitives for Deep Learning.” http://arxiv.org/abs/1410.0759.

Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. “Generating Long Sequences with Sparse Transformers.” http://arxiv.org/abs/1904.10509.

Cho, Minsu, Ameya Joshi, and Chinmay Hegde. 2020. “ESPN: Extremely Sparse Pruned Networks.” http://arxiv.org/abs/2006.15741.

Choudhary, Tejalal, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. “A Comprehensive Survey on Model Compression and Acceleration.” Artificial Intelligence Review, 1–43.

Cibas, Tautvydas, Françoise Fogelman Soulié, Patrick Gallinari, and Sarunas Raudys. 1996. “Variable Selection with Neural Networks.” Neurocomputing 12 (2): 223–48. https://doi.org/https://doi.org/10.1016/0925-2312(95)00121-2.

Cohen, Joseph Paul, Henry Z. Lo, and Wei Ding. 2017. “RandomOut: Using a Convolutional Gradient Norm to Rescue Convolutional Filters.” http://arxiv.org/abs/1602.05931.

Collins, Maxwell D., and Pushmeet Kohli. 2014. “Memory Bounded Deep Convolutional Networks.” CoRR abs/1412.1442. http://arxiv.org/abs/1412.1442.

Correia, Gonçalo M, Vlad Niculae, and André FT Martins. 2019. “Adaptively Sparse Transformers.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp). http://arxiv.org/abs/1909.00015.

Cosentino, Justin, Federico Zaiter, Dan Pei, and Jun Zhu. 2019. “The Search for Sparse, Robust Neural Networks.” http://arxiv.org/abs/1912.02386.

Cui, Baiyun, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. “Fine-Tune BERT with Sparse Self-Attention Mechanism.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 3539–44.

Dai, Bin, Chen Zhu, and David Wipf. 2018. “Compressing Neural Networks Using the Variational Information Bottleneck.” http://arxiv.org/abs/1802.10399.

Dai, Xiaoliang, Hongxu Yin, and Niraj K. Jha. 2018. “NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm.” http://arxiv.org/abs/1711.02017.

d’Ascoli, Stéphane, Levent Sagun, Joan Bruna, and Giulio Biroli. 2020. “Finding the Needle in the Haystack with Convolutions: On the Benefits of Architectural Bias.” http://arxiv.org/abs/1906.06766.

Dave, Shail, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. “Hardware Acceleration of Sparse and Irregular Tensor Computations of Ml Models: A Survey and Insights.” http://arxiv.org/abs/2007.00864.

Davies, Peter, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. 2020. “Distributed Variance Reduction with Optimal Communication.” http://arxiv.org/abs/2002.09268.

Deng, L., G. Li, S. Han, L. Shi, and Y. Xie. 2020. “Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey.” Proceedings of the IEEE 108 (4): 485–532. https://doi.org/10.1109/JPROC.2020.2976475.

Denil, Misha, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. 2014. “Predicting Parameters in Deep Learning.” http://arxiv.org/abs/1306.0543.

Denton, Emily L, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. “Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation.” Advances in Neural Information Processing Systems 27: 1269–77.

Dettmers, Tim, and Luke Zettlemoyer. 2019. “Sparse Networks from Scratch: Faster Training Without Losing Performance.” http://arxiv.org/abs/1907.04840.

De Vivo, Luisa, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio Tononi, and Chiara Cirelli. 2017. “Ultrastructural Evidence for Synaptic Scaling Across the Wake/Sleep Cycle.” Science 355 (6324): 507–10.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Dey, S., K. Huang, P. A. Beerel, and K. M. Chugg. 2019. “Pre-Defined Sparse Neural Networks with Hardware Acceleration.” IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2): 332–45. https://doi.org/10.1109/JETCAS.2019.2910864.

Diering, Graham H, Raja S Nirujogi, Richard H Roth, Paul F Worley, Akhilesh Pandey, and Richard L Huganir. 2017. “Homer1a Drives Homeostatic Scaling-down of Excitatory Synapses During Sleep.” Science 355 (6324): 511–15.

Ding, Xiaohan, Guiguang Ding, Yuchen Guo, and Jungong Han. 2019. “Centripetal Sgd for Pruning Very Deep Convolutional Networks with Complicated Structure.” http://arxiv.org/abs/1904.03837.

Ding, Xiaohan, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. 2019. “Global Sparse Momentum Sgd for Pruning Very Deep Neural Networks.” http://arxiv.org/abs/1909.12778.

Dolan, William B, and Chris Brockett. 2005. “Automatically Constructing a Corpus of Sentential Paraphrases.” In Proceedings of the Third International Workshop on Paraphrasing (Iwp2005).

Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” http://arxiv.org/abs/2012.00152.

Dong, Xiao, Lei Liu, Guangli Li, Jiansong Li, Peng Zhao, Xueying Wang, and Xiaobing Feng. 2019. “Exploiting the Input Sparsity to Accelerate Deep Neural Networks: Poster.” In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Ppopp 2019, Washington, Dc, Usa, February 16-20, 2019, 401–2. https://doi.org/10.1145/3293883.3295713.

Dong, Xin, Shangyu Chen, and Sinno Jialin Pan. 2017. “Learning to Prune Deep Neural Networks via Layer-Wise Optimal Brain Surgeon.” http://arxiv.org/abs/1705.07565.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In Proceedings of the Ninth International Conference on Learning Representations. http://arxiv.org/abs/2010.11929.

Dryden, Nikoli, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. “Communication Quantization for Data-Parallel Training of Deep Neural Networks.” In 2nd Workshop on Machine Learning in Hpc Environments (Mlhpc), 1–8.

Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. “Gradient Descent Provably Optimizes over-Parameterized Neural Networks.” http://arxiv.org/abs/1810.02054.

Dutta, Aritra, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. “On the Discrepancy Between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning.” In Proceedings of the Aaai Conference on Artificial Intelligence, 34:3817–24. 04. http://arxiv.org/abs/1911.08250.

Elsen, Erich, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2019. “Fast Sparse Convnets.” http://arxiv.org/abs/1911.09723.

Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. 2019. “Neural Architecture Search: A Survey.” http://arxiv.org/abs/1808.05377.

Engelbrecht, Andries Petrus, Ian Cloete, and Jacek M Zurada. 1995. “Determining the Significance of Input Parameters Using Sensitivity Analysis.” In International Workshop on Artificial Neural Networks, 382–88. Springer.

Engelbrecht, A. P. 2001. “A New Pruning Heuristic Based on Variance Analysis of Sensitivity Information.” IEEE Transactions on Neural Networks 12 (6): 1386–99. https://doi.org/10.1109/72.963775.

Engelbrecht, A. P., and I. Cloete. 1996. “A Sensitivity Analysis Algorithm for Pruning Feedforward Neural Networks.” In Proceedings of International Conference on Neural Networks (Icnn’96), 2:1274–8 vol.2. https://doi.org/10.1109/ICNN.1996.549081.

Evci, Utku, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. “Rigging the Lottery: Making All Tickets Winners.” http://arxiv.org/abs/1911.11134.

Evci, Utku, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. 2020. “Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win.” http://arxiv.org/abs/2010.03533.

Evci, Utku, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. 2020. “The Difficulty of Training Sparse Neural Networks.” http://arxiv.org/abs/1906.10732.

Fan, Angela, Edouard Grave, and Armand Joulin. 2020. “Reducing Transformer Depth on Demand with Structured Dropout.” In Proceedings of the Eighth International Conference on Learning Representations. http://arxiv.org/abs/1909.11556.

Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” http://arxiv.org/abs/2101.03961.

Finnoff, William, Ferdinand Hergert, and Hans Georg Zimmermann. 1993. “Improving Model Selection by Nonconvergent Methods.” Neural Networks 6 (6): 771–83.

Fletcher, L., V. Katkovnik, F. E. Steffens, and A. P. Engelbrecht. 1998. “Optimizing the Number of Hidden Nodes of a Feedforward Artificial Neural Network.” In 1998 Ieee International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), 2:1608–12 vol.2. https://doi.org/10.1109/IJCNN.1998.686018.

Frankle, Jonathan, and Michael Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” http://arxiv.org/abs/1803.03635.

Frankle, Jonathan, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020a. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” http://arxiv.org/abs/1912.05671.

———. 2020b. “Stabilizing the Lottery Ticket Hypothesis.” http://arxiv.org/abs/1903.01611.

———. 2021. “Pruning Neural Networks at Initialization: Why Are We Missing the Mark?” http://arxiv.org/abs/2009.08576.

Frankle, Jonathan, David J. Schwab, and Ari S. Morcos. 2020. “The Early Phase of Neural Network Training.” http://arxiv.org/abs/2002.10365.

Friedman, J., T. Hastie, and R. Tibshirani. 2010. “A Note on the Group Lasso and a Sparse Group Lasso.” http://arxiv.org/abs/1001.0736.

Friston, K.J. 2008. “Hierarchical Models in the Brain.” PLOS Computational Biology 4 (11): e1000211. https://doi.org/10.1371/journal.pcbi.1000211.

Gaier, Adam, and David Ha. 2019. “Weight Agnostic Neural Networks.” http://arxiv.org/abs/1906.04358.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:1050–9. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. http://proceedings.mlr.press/v48/gal16.html.

Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:3581–90. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/84ddfb34126fc3a48ee38d7044e87276-Paper.pdf.

Gale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of Sparsity in Deep Neural Networks.” http://arxiv.org/abs/1902.09574.

Gale, Trevor, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. “Sparse Gpu Kernels for Deep Learning.” http://arxiv.org/abs/2006.10901.

Ganesh, Prakhar, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.” http://arxiv.org/abs/2002.11985.

Ge, Dongdong, Xiaoye Jiang, and Yinyu Ye. 2011. “A Note on the Complexity of L P Minimization.” Mathematical Programming 129 (2): 285–99.

Georgiadis, Georgios. 2019. “Accelerating Convolutional Neural Networks via Activation Map Compression.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7085–95.

Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V Le. 2018. “DropBlock: A Regularization Method for Convolutional Networks.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 31:10727–37. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Paper.pdf.

Ghosh, Joydeep, and Kagan Tumer. 1994. “Structural Adaptation and Generalization in Supervised Feed-Forward Networks.” J. Artif. Neural Netw. 1 (4): 431–58.

Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In AISTATS, edited by Yee Whye Teh and D. Mike Titterington, 9:249–56. JMLR Proceedings. JMLR.org. http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GlorotB10.

Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011a. “Deep Sparse Rectifier Neural Networks.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–23.

———. 2011b. “Deep Sparse Rectifier Neural Networks.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–23.

Golub, Maximilian, Guy Lemieux, and Mieszko Lis. 2019. “Full Deep Neural Network Training on a Pruned Weight Budget.” http://arxiv.org/abs/1806.06949.

Gomez, Aidan N., Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. “Learning Sparse Networks Using Targeted Dropout.” http://arxiv.org/abs/1905.13678.

Gondimalla, Ashish, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. “SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks.” In Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture, 151–65. MICRO ’52. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3352460.3358291.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” http://arxiv.org/abs/1406.2661.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80. http://arxiv.org/abs/1406.2661.

Gopalakrishnan, Soorya, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. 2018. “Combating Adversarial Attacks Using Sparse Representations.” http://arxiv.org/abs/1803.03880.

Gordon, Ariel, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. 2018. “Morphnet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 1586–95.

Gordon, Mitchell A., Kevin Duh, and Nicholas Andrews. 2020. “Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.” In Proceedings of the 5th Workshop on Representation Learning for Nlp, 143–55. http://arxiv.org/abs/2002.08307.

Grönquist, Peter, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoefler. 2020. “Deep Learning for Post-Processing Ensemble Weather Forecasts.” http://arxiv.org/abs/2005.08748.

Gropp, William, Torsten Hoefler, Rajeev Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. Cambridge, MA: MIT Press.

Gropp, William, Torsten Hoefler, Rajeev Thakur, and Jesper Larsson Träff. 2011. “Performance Expectations and Guidelines for MPI Derived Datatypes.” In Recent Advances in the Message Passing Interface (Eurompi’11), 6960:150–59. Santorini, Greece: Springer.

Grunwald, Peter. 2004. “A Tutorial Introduction to the Minimum Description Length Principle.” http://arxiv.org/abs/math/0406077.

Grünwald, Peter D. 2007. The Minimum Description Length Principle. MIT press.

Grünwald, Peter D, and Abhijit Grunwald. 2007. The Minimum Description Length Principle. MIT press.

Gudovskiy, Denis, Alec Hodgkinson, and Luca Rigazio. 2018. “DNN Feature Map Compression Using Learned Representation over Gf (2).” In Proceedings of the European Conference on Computer Vision (Eccv), 0–0.

Guerra, Luis, Bohan Zhuang, Ian Reid, and Tom Drummond. 2020. “Automatic Pruning for Quantized Neural Networks.” http://arxiv.org/abs/2002.00523.

Guo, Demi, Alexander M. Rush, and Yoon Kim. 2020. “Parameter-Efficient Transfer Learning with Diff Pruning.” http://arxiv.org/abs/2012.07463.

Guo, Fu-Ming, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019. “Reweighted Proximal Pruning for Large-Scale Language Representation.” http://arxiv.org/abs/1909.12486.

Guo, Qipeng, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. “Star-Transformer.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1315–25. http://arxiv.org/abs/1902.09113.

Guo, Yiwen, Anbang Yao, and Yurong Chen. 2016. “Dynamic Network Surgery for Efficient Dnns.” http://arxiv.org/abs/1608.04493.

Guo, Yiwen, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. “Sparse Dnns with Improved Adversarial Robustness.” In Advances in Neural Information Processing Systems, 242–51.

Gupta, Manish, and Puneet Agrawal. 2020. “Compression of Deep Learning Models for Text: A Survey.” http://arxiv.org/abs/2008.05221.

Gupta, Udit, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, and David Brooks. 2019. “MASR: A Modular Accelerator for Sparse Rnns.” http://arxiv.org/abs/1908.08976.

Hagiwara, Masafumi. 1993. “Removal of Hidden Units and Weights for Back Propagation Networks.” In Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan), 1:351–54. IEEE.

———. 1994. “A Simple and Effective Method for Removal of Hidden Units and Weights.” Neurocomputing 6 (2): 207–18. https://doi.org/https://doi.org/10.1016/0925-2312(94)90055-8.

Han, Hong-Gui, and Jun-Fei Qiao. 2013. “A Structure Optimisation Algorithm for Feedforward Neural Network Construction.” Neurocomputing 99: 347–57.

Han, Song, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, et al. 2017. “ESE: Efficient Speech Recognition Engine with Sparse Lstm on Fpga.” http://arxiv.org/abs/1612.00694.

Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.” http://arxiv.org/abs/1602.01528.

Han, Song, Huizi Mao, and William J. Dally. 2016. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” http://arxiv.org/abs/1510.00149.

Han, Song, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, et al. 2017. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks.” http://arxiv.org/abs/1607.04381.

Han, Song, Jeff Pool, John Tran, and William Dally. 2015. “Learning Both Weights and Connections for Efficient Neural Network.” In Advances in Neural Information Processing Systems, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:1135–43. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf.

Hansen, Lars Kai, and others. 1994. “Controlled Growth of Cascade Correlation Nets.” In International Conference on Artificial Neural Networks, 797–800. Springer.

Hanson, Stephen, and Lorien Pratt. 1989. “Comparing Biases for Minimal Network Construction with Back-Propagation.” In Advances in Neural Information Processing Systems, edited by D. Touretzky, 1:177–85. Morgan-Kaufmann. https://proceedings.neurips.cc/paper/1988/file/1c9ac0159c94d8d0cbedc973445af2da-Paper.pdf.

Hassibi, Babak, and David G. Stork. 1992. “Second Order Derivatives for Network Pruning: Optimal Brain Surgeon.” In Advances in Neural Information Processing Systems 5, [Nips Conference], 164–71. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Hawkins, J. 2017. “Special Report : Can We Copy the Brain? - What Intelligent Machines Need to Learn from the Neocortex.” IEEE Spectrum 54 (6): 34–71. https://doi.org/10.1109/MSPEC.2017.7934229.

Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. “Pruning Untrained Neural Networks: Principles and Analysis.” http://arxiv.org/abs/2002.08797.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” http://arxiv.org/abs/1502.01852.

He, K., G. Gkioxari, P. Dollár, and R. Girshick. 2017. “Mask R-Cnn.” In 2017 Ieee International Conference on Computer Vision (Iccv), 2980–8. https://doi.org/10.1109/ICCV.2017.322.

He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” In IEEE Conference on Computer Vision and Pattern Recognition (Cvpr), 770–78.

He, Yang, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. “Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration.” http://arxiv.org/abs/1811.00250.

He, Yihui, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2019. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices.” http://arxiv.org/abs/1802.03494.

He, Yihui, Xiangyu Zhang, and Jian Sun. 2017. “Channel Pruning for Accelerating Very Deep Neural Networks.” http://arxiv.org/abs/1707.06168.

Hebb, Donald O. 1949. The Organization of Behavior: A Neuropsychological Theory. New York: Hardcover; Wiley.

Hegde, Kartik, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. “ExTensor: An Accelerator for Sparse Tensor Algebra.” In Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture, 319–33. MICRO ’52. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3352460.3358275.

Hendrycks, Dan, and Thomas Dietterich. 2019. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.” In Proceedings of the Seventh International Conference on Learning Representations. http://arxiv.org/abs/1903.12261.

Hendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. “Natural Adversarial Examples.” http://arxiv.org/abs/1907.07174.

Herculano-Houzel, Suzana, Bruno Mota, Peiyan Wong, and Jon H. Kaas. 2010. “Connectivity-Driven White Matter Scaling and Folding in Primate Cerebral Cortex.” Proceedings of the National Academy of Sciences 107 (44): 19008–13. https://doi.org/10.1073/pnas.1012590107.

Hill, P., A. Jain, M. Hill, B. Zamirai, C. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars. 2017. “DeftNN: Addressing Bottlenecks for Dnn Execution on Gpus via Synapse Vector Elimination and Near-Compute Data Fission.” In 2017 50th Annual Ieee/Acm International Symposium on Microarchitecture (Micro), 786–99.

Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” http://arxiv.org/abs/1207.0580.

Hinton, Geoffrey E, and Drew Van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory, 5–13.

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” http://arxiv.org/abs/1503.02531.

Hoefler, Torsten, and Roberto Belli. 2015. “Scientific Benchmarking of Parallel Computing Systems.” In, 73:1–73:12. Austin, TX, USA: ACM.

Hooker, Sara, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. 2019. “What Do Compressed Deep Neural Networks Forget?” http://arxiv.org/abs/1911.05248.

Hooker, Sara, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. “Characterising Bias in Compressed Models.” http://arxiv.org/abs/2010.03058.

Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” http://arxiv.org/abs/1704.04861.

Hoyer, Patrik O. 2004. “Non-Negative Matrix Factorization with Sparseness Constraints.” Journal of Machine Learning Research 5 (Nov): 1457–69.

Hu, Hengyuan, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. “Network Trimming: A Data-Driven Neuron Pruning Approach Towards Efficient Deep Architectures.” http://arxiv.org/abs/1607.03250.

Hu, Yuwei, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020. “FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’20. Atlanta, Georgia: IEEE Press.

Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In Computer Vision – Eccv 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Cham: Springer International Publishing.

Huang, Zehao, and Naiyan Wang. 2018. “Data-Driven Sparse Structure Selection for Deep Neural Networks.” http://arxiv.org/abs/1707.01213.

Huang, Ziyue, Wang Yilei, Ke Yi, and others. 2019. “Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation.” In Advances in Neural Information Processing Systems, 6371–81.

Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. “Binarized Neural Networks.” In Proceedings of the 30th International Conference on Neural Information Processing Systems, 4114–22. NIPS’16. Red Hook, NY, USA: Curran Associates Inc.

Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.” http://arxiv.org/abs/1602.07360.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” http://arxiv.org/abs/1502.03167.

Ivanov, Andrei, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2020. “Data Movement Is All You Need: A Case Study on Optimizing Transformers.” http://arxiv.org/abs/2007.00072.

Ivkin, Nikita, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, and others. 2019. “Communication-Efficient Distributed SGD with Sketching.” In Advances in Neural Information Processing Systems, 13144–54. http://arxiv.org/abs/1903.04488.

Jacobs, Robert A, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. “Adaptive Mixtures of Local Experts.” Neural Computation 3 (1): 79–87.

Jan, Niehues, Roldano Cattoni, Stuker Sebastian, Matteo Negri, Marco Turchi, Salesky Elizabeth, Sanabria Ramon, Barrault Loic, Specia Lucia, and Marcello Federico. 2019. “The Iwslt 2019 Evaluation Campaign.” In 16th International Workshop on Spoken Language Translation 2019.

Janowsky, Steven A. 1989. “Pruning Versus Clipping in Neural Networks.” Physical Review A 39 (12): 6600.

Jayakumar, Siddhant, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. 2020. “Top-Kast: Top-K Always Sparse Training.” Advances in Neural Information Processing Systems 33.

Jiang, Peng, and Gagan Agrawal. 2018. “A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication.” In Advances in Neural Information Processing Systems, 2525–36.

Jin, Sian, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. “DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression.” In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 159–70. HPDC ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3307681.3326608.

Jin, Xiaojie, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. 2016. “Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods.” http://arxiv.org/abs/1607.05423.

Jones, Sari, Lars Nyberg, Johan Sandblom, Anna Stigsdotter Neely, Martin Ingvar, Karl Magnus Petersson, and Lars Bäckman. 2006. “Cognitive and Neural Plasticity in Aging: General and Task-Specific Limitations.” Neuroscience & Biobehavioral Reviews 30 (6): 864–71.

Jordan, Michael I, and Robert A Jacobs. 1994. “Hierarchical Mixtures of Experts and the Em Algorithm.” Neural Computation 6 (2): 181–214.

Jorge, Pau de, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, and Puneet K. Dokania. 2020. “Progressive Skeletonization: Trimming More Fat from a Network at Initialization.” http://arxiv.org/abs/2006.09081.

Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” http://arxiv.org/abs/1802.08435.

Kameyama, K., and Y. Kosugi. 1991. “Automatic Fusion and Splitting of Artificial Neural Elements in Optimizing the Network Size.” In Conference Proceedings 1991 Ieee International Conference on Systems, Man, and Cybernetics, 1633–8 vol.3. https://doi.org/10.1109/ICSMC.1991.169926.

Kang, Minsoo, and Bohyung Han. 2020. “Operation-Aware Soft Channel Pruning Using Differentiable Masks.” http://arxiv.org/abs/2007.03938.

Kanjilal, P. P., P. K. Dey, and D. N. Banerjee. 1993. “Reduced-Size Neural Networks Through Singular Value Decomposition and Subset Selection.” Electronics Letters 29 (17): 1516–8. https://doi.org/10.1049/el:19931010.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” http://arxiv.org/abs/2001.08361.

Karimireddy, Sai Praneeth, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. 2019. “Error Feedback Fixes SignSGD and Other Gradient Compression Schemes.” In Proceedings of the Thirty-Sixth International Conference on Machine Learning, 3252–61. http://arxiv.org/abs/1901.09847.

Karnin, E. D. 1990. “A Simple Procedure for Pruning Back-Propagation Trained Neural Networks.” IEEE Transactions on Neural Networks 1 (2): 239–42. https://doi.org/10.1109/72.80236.

Kerr, Jason N. D., David Greenberg, and Fritjof Helmchen. 2005. “Imaging Input and Output of Neocortical Networks in Vivo.” Proceedings of the National Academy of Sciences 102 (39): 14063–8. https://doi.org/10.1073/pnas.0506029102.

Kim, D., J. Ahn, and S. Yoo. 2018. “ZeNA: Zero-Aware Neural Network Accelerator.” IEEE Design Test 35 (1): 39–46. https://doi.org/10.1109/MDAT.2017.2741463.

Kingma, Diederik P, Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick.” In Advances in Neural Information Processing Systems, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:2575–83. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf.

Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” http://arxiv.org/abs/1312.6114.

Kodryan, Maxim, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. 2019. “Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks.” In Proceedings of the 4th Workshop on Representation Learning for Nlp (Repl4nlp-2019), 40–48.

Konečnỳ, Jakub, and Peter Richtárik. 2018. “Randomized Distributed Mean Estimation: Accuracy Vs. Communication.” Frontiers in Applied Mathematics and Statistics 4: 62. http://arxiv.org/abs/1611.07555.

Krogh, Anders, and John A. Hertz. 1991. “A Simple Weight Decay Can Improve Generalization.” In Proceedings of the 4th International Conference on Neural Information Processing Systems, 950–57. NIPS’91. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Krueger, David, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. 2017. “Zoneout: Regularizing Rnns by Randomly Preserving Hidden Activations.” International Conference on Learning Representations (ICLR).

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. 2018. “Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization.” http://arxiv.org/abs/1811.04770.

Kunstner, Frederik, Philipp Hennig, and Lukas Balles. 2019. “Limitations of the Empirical Fisher Approximation for Natural Gradient Descent.” In Advances in Neural Information Processing Systems, 4156–67.

Kurtz, Mark, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. 2020. “Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks.” In International Conference on Machine Learning, 5533–43. PMLR.

Kusupati, Aditya, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. 2020. “Soft Threshold Weight Reparameterization for Learnable Sparsity.” http://arxiv.org/abs/2002.03231.

Kuzmin, Andrey, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. 2019. “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.” http://arxiv.org/abs/1912.09802.

Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, et al. 2019. “Natural Questions: A Benchmark for Question Answering Research.” Transactions of the Association for Computational Linguistics 7: 453–66.

Lample, Guillaume, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. “Large Memory Layers with Product Keys.” http://arxiv.org/abs/1907.05242.

Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. 2017. “FractalNet: Ultra-Deep Neural Networks Without Residuals.” International Conference on Learning Representations (ICLR).

Lauret, Philippe, Eric Fock, and Thierry Alex Mara. 2006. “A Node Pruning Algorithm Based on a Fourier Amplitude Sensitivity Test Method.” IEEE Transactions on Neural Networks 17 (2): 273–93.

Lavin, A., and S. Gray. 2016. “Fast Algorithms for Convolutional Neural Networks.” In 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 4013–21. https://doi.org/10.1109/CVPR.2016.435.

Lebedev, Vadim, and Victor Lempitsky. 2015. “Fast Convnets Using Group-Wise Brain Damage.” http://arxiv.org/abs/1506.02515.

Le Cun, Yann, John S. Denker, and Sara A. Solla. 1990. “Optimal Brain Damage.” In Advances in Neural Information Processing Systems 2, 598–605. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Lee, Namhoon, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. 2020. “A Signal Propagation Perspective for Pruning Neural Networks at Initialization.” http://arxiv.org/abs/1906.06307.

Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity.” http://arxiv.org/abs/1810.02340.

Lee, Namhoon, Thalaiyasingam Ajanthan, Philip H. S. Torr, and Martin Jaggi. 2020. “Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training.” http://arxiv.org/abs/2003.11316.

Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” http://arxiv.org/abs/2006.16668.

Li, Hao, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. “Pruning Filters for Efficient Convnets.” http://arxiv.org/abs/1608.08710.

Li, J., S. Jiang, S. Gong, J. Wu, J. Yan, G. Yan, and X. Li. 2019. “SqueezeFlow: A Sparse Cnn Accelerator Exploiting Concise Convolution Rules.” IEEE Transactions on Computers 68 (11): 1663–77. https://doi.org/10.1109/TC.2019.2924215.

Li, Xiaoya, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020. “SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection.” http://arxiv.org/abs/2003.09833.

Li, Yuanzhi, Colin Wei, and Tengyu Ma. 2020. “Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks.” http://arxiv.org/abs/1907.04595.

Li, Yunqiang, Silvia Laura Pintea, and Jan van Gemert. 2021. “Less Bits Is More: How Pruning Deep Binary Networks Increases Weight Capacity.” https://openreview.net/forum?id=Hy8JM_Fvt5N.

Li, Zhuohan, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020. “Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” http://arxiv.org/abs/2002.11794.

Liebenwein, Lucas, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. 2020. “Provable Filter Pruning for Efficient Neural Networks.” http://arxiv.org/abs/1911.07412.

Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2019. “Continuous Control with Deep Reinforcement Learning.” http://arxiv.org/abs/1509.02971.

Lillicrap, Timothy P, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. “Backpropagation and the Brain.” Nature Reviews Neuroscience, 1–12.

Lim, Hyeontaek, David Andersen, and Michael Kaminsky. 2019. “3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning.” In Proceedings of the Conference on Systems and Machine Learning. http://arxiv.org/abs/1802.07389.

Lin, Ji, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. “Runtime Neural Pruning.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:2181–91. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf.

Lin, Tao, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. “Dynamic Model Pruning with Feedback.” http://arxiv.org/abs/2006.07253.

Lin, Yujun, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2018. “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.” In Proceedings of the Sixth International Conference on Learning Representations. http://arxiv.org/abs/1712.01887.

Lin, Zi, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. 2020. “Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior.” In Findings of the Association for Computational Linguistics: EMNLP 2020, 719–30. http://arxiv.org/abs/2010.01791.

Lison, Pierre, Jörg Tiedemann, Milen Kouylekov, and others. 2019. “Open Subtitles 2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.” In LREC 2018, Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA).

Liu, Baoyuan, Min Wang, H. Foroosh, M. Tappen, and M. Penksy. 2015. “Sparse Convolutional Neural Networks.” In 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 806–14. https://doi.org/10.1109/CVPR.2015.7298681.

Liu, Lanlan, and Jia Deng. 2018. “Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution.” http://arxiv.org/abs/1701.00299.

Liu, Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. 2019. “Dynamic Sparse Graph for Efficient Deep Learning.” http://arxiv.org/abs/1810.00859.

Liu, Tianlin, and Friedemann Zenke. 2020. “Finding Trainable Sparse Networks Through Neural Tangent Transfer.” http://arxiv.org/abs/2006.08228.

Liu, Xingyu, Jeff Pool, Song Han, and William J. Dally. 2018. “Efficient Sparse-Winograd Convolutional Neural Networks.” International Conference on Learning Representations (ICLR).

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” http://arxiv.org/abs/1907.11692.

Liu, Zhuang, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. “Learning Efficient Convolutional Networks Through Network Slimming.” http://arxiv.org/abs/1708.06519.

Liu, Zhuang, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019. “Rethinking the Value of Network Pruning.” http://arxiv.org/abs/1810.05270.

Liu, Ziwei, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. “Deep Learning Face Attributes in the Wild.” In Proceedings of the Ieee International Conference on Computer Vision, 3730–8. http://arxiv.org/abs/1411.7766.

Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2018. “Bayesian Sparsification of Gated Recurrent Neural Networks.” http://arxiv.org/abs/1812.05692.

Loshchilov, Ilya, and Frank Hutter. 2019. “Decoupled Weight Decay Regularization.” In Proceedings of the Seventh International Conference on Learning Representations. http://arxiv.org/abs/1711.05101.

Louizos, Christos, Karen Ullrich, and Max Welling. 2017. “Bayesian Compression for Deep Learning.” http://arxiv.org/abs/1705.08665.

Louizos, Christos, Max Welling, and Diederik P. Kingma. 2018. “Learning Sparse Neural Networks Through _L_0 Regularization.” http://arxiv.org/abs/1712.01312.

Luo, Jian-Hao, and Jianxin Wu. 2019. “AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference.” http://arxiv.org/abs/1805.08941.

Luo, Jian-Hao, Jianxin Wu, and Weiyao Lin. 2017. “ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression.” http://arxiv.org/abs/1707.06342.

Ly, Alexander, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. 2017. “A Tutorial on Fisher Information.” http://arxiv.org/abs/1705.01064.

Lym, Sangkug, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. “PruneTrain.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November. https://doi.org/10.1145/3295500.3356156.

Madaan, Divyam, Jinwoo Shin, and Sung Ju Hwang. 2020. “Adversarial Neural Pruning with Latent Vulnerability Suppression.” http://arxiv.org/abs/1908.04355.

Maddison, Chris J., Andriy Mnih, and Yee Whye Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” International Conference on Learning Representations (ICLR).

Makhzani, Alireza, and Brendan Frey. 2015. “Winner-Take-All Autoencoders.” http://arxiv.org/abs/1409.2752.

Malach, Eran, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. 2020. “Proving the Lottery Ticket Hypothesis: Pruning Is All You Need.” http://arxiv.org/abs/2002.00585.

Malaviya, Chaitanya, Pedro Ferreira, and André FT Martins. 2018. “Sparse and Constrained Attention for Neural Machine Translation.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). http://arxiv.org/abs/1805.08241.

Mallya, Arun, and Svetlana Lazebnik. 2018. “PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning.” http://arxiv.org/abs/1711.05769.

Manessi, Franco, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schettini. 2018. “Automated Pruning for Deep Neural Network Compression.” 2018 24th International Conference on Pattern Recognition (ICPR), August. https://doi.org/10.1109/icpr.2018.8546129.

Mao, Huizi, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. “Exploring the Regularity of Sparse Structure in Convolutional Neural Networks.” http://arxiv.org/abs/1705.08922.

Mariet, Zelda, and Suvrit Sra. 2017. “Diversity Networks: Neural Network Compression Using Determinantal Point Processes.” http://arxiv.org/abs/1511.05077.

Martens, James, and Roger Grosse. 2015. “Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” http://arxiv.org/abs/1503.05671.

Martins, Andre, and Ramon Astudillo. 2016. “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification.” In International Conference on Machine Learning, 1614–23. http://arxiv.org/abs/1602.02068.

Mattson, Peter, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, et al. 2020. “MLPerf Training Benchmark.” http://arxiv.org/abs/1910.01500.

McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. “An Empirical Model of Large-Batch Training.” http://arxiv.org/abs/1812.06162.

McCarley, J. S., Rishav Chakravarti, and Avirup Sil. 2020. “Structured Pruning of a BERT-Based Question Answering Model.” http://arxiv.org/abs/1910.06360.

Mehta, Rahul. 2019. “Sparse Transfer Learning via Winning Lottery Tickets.” http://arxiv.org/abs/1905.07785.

Meng, Fanxu, Hao Cheng, Ke Li, Huixiang Luo, Xiaowei Guo, Guangming Lu, and Xing Sun. 2020. “Pruning Filter in Filter.” http://arxiv.org/abs/2009.14410.

Mhaskar, Hrushikesh, and Tomaso Poggio. 2016. “Deep Vs. Shallow Networks : An Approximation Theory Perspective.” http://arxiv.org/abs/1608.03287.

Michel, Paul, Omer Levy, and Graham Neubig. 2019. “Are Sixteen Heads Really Better Than One?” http://arxiv.org/abs/1905.10650.

Millidge, Beren, Alexander Tschantz, and Christopher L. Buckley. 2020. “Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs.” http://arxiv.org/abs/2006.04182.

Mishra, Asit K., Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. “WRPN: Wide Reduced-Precision Networks.” CoRR abs/1709.01134. http://arxiv.org/abs/1709.01134.

Mittal, Deepak, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. 2018. “Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks.” http://arxiv.org/abs/1801.10447.

Miyato, Takeru, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. “Spectral Normalization for Generative Adversarial Networks.” In Proceedings of the Sixth International Conference on Learning Representations. http://arxiv.org/abs/1802.05957.

Mocanu, Decebal Constantin, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. “Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science.” Nature Communications 9 (1): 1–12.

Molchanov, Dmitry, Arseniy Ashuha, and Dmitry Vetrov. 2016. “Dropout-Based Automatic Relevance Determination.” In Bayesian Deep Learning Workshop, Nips.

Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” http://arxiv.org/abs/1701.05369.

Molchanov, Pavlo, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. “Importance Estimation for Neural Network Pruning.” http://arxiv.org/abs/1906.10771.

Molchanov, Pavlo, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. “Pruning Convolutional Neural Networks for Resource Efficient Inference.” http://arxiv.org/abs/1611.06440.

Moody, John E. 1991. “Note on Generalization, Regularization and Architecture Selection in Nonlinear Learning Systems.” In Neural Networks for Signal Processing Proceedings of the 1991 Ieee Workshop, 1–10. IEEE.

Morcos, Ari S., Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. “One Ticket to Win Them All: Generalizing Lottery Ticket Initializations Across Datasets and Optimizers.” http://arxiv.org/abs/1906.02773.

Mostafa, Hesham, and Xin Wang. 2019. “Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization.” http://arxiv.org/abs/1902.05967.

Mozer, Michael C, and Paul Smolensky. 1988. “Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment.” Advances in Neural Information Processing Systems 1: 107–15.

Mrázová, I., and Z. Reitermanová. 2011. “A New Sensitivity-Based Pruning Technique for Feed-Forward Neural Networks That Improves Generalization.” In The 2011 International Joint Conference on Neural Networks, 2143–50. https://doi.org/10.1109/IJCNN.2011.6033493.

Mukherjee, Sayan, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. “Learning Theory: Stability Is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization.” Advances in Computational Mathematics 25 (1-3): 161–93.

Mussay, Ben, Daniel Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. 2020. “Data-Independent Structured Pruning of Neural Networks via Coresets.” http://arxiv.org/abs/2008.08316.

Narang, Sharan, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. “Exploring Sparsity in Recurrent Neural Networks.” http://arxiv.org/abs/1704.05119.

Narasimha, Pramod L., Walter H. Delashmit, Michael T. Manry, Jiang Li, and Francisco Maldonado. 2008. “An Integrated Growing-Pruning Method for Feedforward Network Training.” Neurocomputing 71 (13): 2831–47. https://doi.org/https://doi.org/10.1016/j.neucom.2007.08.026.

Neklyudov, Kirill, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Structured Bayesian Pruning via Log-Normal Multiplicative Noise.” http://arxiv.org/abs/1705.07283.

Neyshabur, Behnam. 2020. “Towards Learning Convolutions from Scratch.” http://arxiv.org/abs/2007.13657.

Neyshabur, Behnam, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. 2018. “Towards Understanding the Role of over-Parametrization in Generalization of Neural Networks.” http://arxiv.org/abs/1805.12076.

Ngiam, J., Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. “Tiled Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 23, 1279–87.

Niculae, Vlad, and Mathieu Blondel. 2017. “A Regularized Framework for Sparse and Structured Neural Attention.” In Advances in Neural Information Processing Systems, 3338–48. http://arxiv.org/abs/1705.07704.

Nilsson, Nils J. 2009. The Quest for Artificial Intelligence: A History of Ideas and Achievements. Cambridge University Press.

Niu, Yue, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. 2020. “Reuse Kernels or Activations? A Flexible Dataflow for Low-Latency Spectral Cnn Acceleration.” In Proceedings of the 2020 Acm/Sigda International Symposium on Field-Programmable Gate Arrays, 266–76. FPGA ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3373087.3375302.

Niu, Yue, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal Kannan, Yanzhi Wang, and Viktor Prasanna. 2019. “SPEC2: SPECtral Sparse Cnn Accelerator on Fpgas.” http://arxiv.org/abs/1910.11103.

Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. 2015. “Learning Deconvolution Network for Semantic Segmentation.” http://arxiv.org/abs/1505.04366.

Nowlan, Steven J, and Geoffrey E Hinton. 1992. “Simplifying Neural Networks by Soft Weight-Sharing.” Neural Computation 4 (4): 473–93.

Nvidia. 2020. “NVIDIA A100 Tensor Core Gpu Architecture.”

Olshausen, Bruno A, and David J Field. 1996. “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.” Nature 381 (6583): 607–9.

Orseau, Laurent, Marcus Hutter, and Omar Rivasplata. 2020. “Logarithmic Pruning Is All You Need.” http://arxiv.org/abs/2006.12156.

Osawa, Kazuki, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. “Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June. https://doi.org/10.1109/cvpr.2019.01264.

Pan, Wei, Hao Dong, and Yike Guo. 2016. “DropNeuron: Simplifying the Structure of Deep Neural Networks.” http://arxiv.org/abs/1606.07326.

Parashar, Angshuman, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. “SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks.” http://arxiv.org/abs/1708.04485.

Park, Jongsoo, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. “Faster Cnns with Direct Sparse Convolutions and Guided Pruning.” http://arxiv.org/abs/1608.01409.

Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. “Image Transformer.” In International Conference on Machine Learning, 4055–64. http://arxiv.org/abs/1802.05751.

Pedersen, Morten, Lars Hansen, and Jan Larsen. 1996. “Pruning with Generalization Based Weight Saliencies: Lambda Obd, Lambda Obs.” In Advances in Neural Information Processing Systems, edited by D. Touretzky, M. C. Mozer, and M. Hasselmo, 8:521–27. MIT Press. https://proceedings.neurips.cc/paper/1995/file/3473decccb0509fb264818a7512a8b9b-Paper.pdf.

Pensia, Ankit, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. 2020. “Optimal Lottery Tickets via Subsetsum: Logarithmic over-Parameterization Is Sufficient.” http://arxiv.org/abs/2006.07990.

Plummer, Bryan A., Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. 2020. “Shapeshifter Networks: Cross-Layer Parameter Sharing for Scalable and Effective Deep Learning.” http://arxiv.org/abs/2006.10598.

Polyak, A., and L. Wolf. 2015. “Channel-Level Acceleration of Deep Face Representations.” IEEE Access 3: 2163–75. https://doi.org/10.1109/ACCESS.2015.2494536.

Pooch, Udo W., and Al Nieder. 1973. “A Survey of Indexing Techniques for Sparse Matrices.” ACM Comput. Surv. 5 (2): 109–33. https://doi.org/10.1145/356616.356618.

Prabhu, Ameya, Girish Varma, and Anoop Namboodiri. 2018. “Deep Expander Networks: Efficient Deep Networks from Graph Theory.” http://arxiv.org/abs/1711.08757.

Prasanna, Sai, Anna Rogers, and Anna Rumshisky. 2020. “When BERT Plays the Lottery, All Tickets Are Winning.” http://arxiv.org/abs/2005.00561.

Prechelt, Lutz. 1997. “Connection Pruning with Static and Adaptive Pruning Schedules.” Neurocomputing 16 (1): 49–61. https://doi.org/https://doi.org/10.1016/S0925-2312(96)00054-9.

Qin, E., A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna. 2020. “SIGMA: A Sparse and Irregular Gemm Accelerator with Flexible Interconnects for Dnn Training.” In 2020 Ieee International Symposium on High Performance Computer Architecture (Hpca), 58–70. https://doi.org/10.1109/HPCA47549.2020.00015.

Raihan, Md Aamir, and Tor M. Aamodt. 2020. “Sparse Weight Activation Training.” http://arxiv.org/abs/2001.01969.

Rakin, Adnan Siraj, Zhezhi He, Li Yang, Yanzhi Wang, Liqiang Wang, and Deliang Fan. 2020. “Robust Sparse Regularization: Defending Adversarial Attacks via Regularized Sparse Network.” In Proceedings of the 2020 on Great Lakes Symposium on Vlsi, 125–30. GLSVLSI ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3386263.3407651.

Ramanujan, Vivek, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. “What’s Hidden in a Randomly Weighted Neural Network?” http://arxiv.org/abs/1911.13299.

Rasmussen, Carl Edward, and Zoubin Ghahramani. 2001. “Occam’s Razor.” In Advances in Neural Information Processing Systems, 294–300.

Reagen, B., P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G. Wei, and D. Brooks. 2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In 2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca), 267–78. https://doi.org/10.1109/ISCA.2016.32.

Reed, R. 1993. “Pruning Algorithms-a Survey.” IEEE Transactions on Neural Networks 4 (5): 740–47. https://doi.org/10.1109/72.248452.

Renda, Alex, Jonathan Frankle, and Michael Carbin. 2020. “Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” http://arxiv.org/abs/2003.02389.

Renggli, Cèdric, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. “SparCML: High-Performance Sparse Communication for Machine Learning.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15. http://arxiv.org/abs/1802.08021.

Reuther, Albert, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. “Survey of Machine Learning Accelerators.” http://arxiv.org/abs/2009.00993.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Variational Inference in Deep Latent Gaussian Models.” In International Conference on Machine Learning. Vol. 2.

Rhu, Minsoo, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. “Compressing Dma Engine: Leveraging Activation Sparsity for Training Deep Neural Networks.” In 2018 Ieee International Symposium on High Performance Computer Architecture (Hpca), 78–91. IEEE.

Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. 2021. “A Primer in BERTology: What We Know About How Bert Works.” Transactions of the Association for Computational Linguistics 8: 842–66. http://arxiv.org/abs/2002.12327.

Rosenbaum, Clemens, Tim Klinger, and Matthew Riemer. 2017. “Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning.” http://arxiv.org/abs/1711.01239.

Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Prentice Hall Press.

Sainath, T. N., B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. 2013. “Low-Rank Matrix Factorization for Deep Neural Network Training with High-Dimensional Output Targets.” In 2013 Ieee International Conference on Acoustics, Speech and Signal Processing, 6655–9. https://doi.org/10.1109/ICASSP.2013.6638949.

Sanh, Victor, Thomas Wolf, and Alexander M. Rush. 2020. “Movement Pruning: Adaptive Sparsity by Fine-Tuning.” http://arxiv.org/abs/2005.07683.

Savarese, Pedro, Hugo Silva, and Michael Maire. 2020. “Winning the Lottery with Continuous Sparsification.” http://arxiv.org/abs/1912.04427.

Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2017. “Group Sparse Regularization for Deep Neural Networks.” Neurocomputing 241: 81–89. https://doi.org/https://doi.org/10.1016/j.neucom.2017.02.029.

Scheffler, Paul, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, and Luca Benini. 2020. “Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra.” http://arxiv.org/abs/2011.08070.

See, Abigail, Minh-Thang Luong, and Christopher D. Manning. 2016. “Compression of Neural Machine Translation Models via Pruning.” http://arxiv.org/abs/1606.09274.

Sehwag, Vikash, Shiqi Wang, Prateek Mittal, and Suman Jana. 2020. “HYDRA: Pruning Adversarially Robust Neural Networks.” http://arxiv.org/abs/2002.10509.

Seide, Frank, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. “1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs.” In Fifteenth Annual Conference of the International Speech Communication Association.

Sharma, Aditya, Nikolas Wolfe, and Bhiksha Raj. 2017. “The Incredible Shrinking Neural Network: New Perspectives on Learning Representations Through the Lens of Pruning.” http://arxiv.org/abs/1701.04465.

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” http://arxiv.org/abs/1701.06538.

Shi, Shaohuai, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. 2019. “A Distributed Synchronous SGD Algorithm with Global Top-K Sparsification for Low Bandwidth Networks.” In 2019 Ieee 39th International Conference on Distributed Computing Systems Workshop on Networks, 2238–47. http://arxiv.org/abs/1901.04359.

Shi, Shaohuai, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, and Xiaowen Chu. 2019. “A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification.” In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 3411–7.

Shokri, Reza, and Vitaly Shmatikov. 2015. “Privacy-Preserving Deep Learning.” In Proceedings of the 22nd Acm Sigsac Conference on Computer and Communications Security, 1310–21.

Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” http://arxiv.org/abs/1703.00810.

Sietsma, Jocelyn, and Robert JF Dow. 1991. “Creating Artificial Neural Networks That Generalize.” Neural Networks 4 (1): 67–79.

Sietsma, and Dow. 1988. “Neural Net Pruning-Why and How.” In IEEE 1988 International Conference on Neural Networks, 325–33 vol.1. https://doi.org/10.1109/ICNN.1988.23864.

Sifre, Laurent, and Stéphane Mallat. 2014. “Rigid-Motion Scattering for Image Classification.” PhD thesis, Ecole Polytechnique, CMAP.

Singh, Sidak Pal, and Dan Alistarh. 2020. “WoodFisher: Efficient Second-Order Approximation for Neural Network Compression.” http://arxiv.org/abs/2004.14340.

Sinha, Samarth, Zhengli Zhao, Anirudh Goyal, Colin A Raffel, and Augustus Odena. 2020. “Top-K Training of GANs: Improving GAN Performance by Throwing Away Bad Samples.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/2002.06224.

Smith, Samuel L., Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2018. “Don’t Decay the Learning Rate, Increase the Batch Size.” http://arxiv.org/abs/1711.00489.

Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. “Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank.” In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–42.

Srinivas, Suraj, and R. Venkatesh Babu. 2015. “Data-Free Parameter Pruning for Deep Neural Networks.” http://arxiv.org/abs/1507.06149.

———. 2016. “Learning Neural Network Architectures Using Backpropagation.” http://arxiv.org/abs/1511.05497.

Srinivas, Suraj, Akshayvarun Subramanya, and R. Venkatesh Babu. 2016. “Training Sparse Neural Networks.” http://arxiv.org/abs/1611.06694.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014a. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (56): 1929–58.

———. 2014b. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” J. Mach. Learn. Res. 15 (1): 1929–58.

Stich, Sebastian U, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. “Sparsified SGD with Memory.” In Advances in Neural Information Processing Systems, 4447–58. http://arxiv.org/abs/1809.07599.

Strom, Nikko. 2015. “Scalable Distributed DNN Training Using Commodity GPU Cloud Computing.” In Sixteenth Annual Conference of the International Speech Communication Association.

Ström, Nikko. 1997. “Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks.” In Fifth European Conference on Speech Communication and Technology.

Su, Jingtong, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee. 2020. “Sanity-Checking Pruning Methods: Random Tickets Can Win the Jackpot.” http://arxiv.org/abs/2009.11094.

Suau, Xavier, Luca Zappella, and Nicholas Apostoloff. 2019. “Filter Distillation for Network Compression.” http://arxiv.org/abs/1807.10585.

Sun, Haobo, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, and Jiang Wang. 2019. “Sparse Gradient Compression for Distributed SGD.” In International Conference on Database Systems for Advanced Applications, 139–55. Springer.

Sun, Xu, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. “meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting.” In Proceedings of the Thirty-Fourth International Conference on Machine Learning. http://arxiv.org/abs/1706.06197.

Sun, Yi, Xiaogang Wang, and Xiaoou Tang. 2015. “Sparsifying Neural Network Connections for Face Recognition.” http://arxiv.org/abs/1512.01891.

Suresh, Ananda Theertha, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. 2017. “Distributed Mean Estimation with Limited Communication.” In International Conference on Machine Learning, 3329–37. http://arxiv.org/abs/1611.00429.

Suzuki, Kenji, Isao Horiba, and Noboru Sugie. 2001. “A Simple Neural Network Pruning Algorithm with Application to Filter Synthesis.” In Neural Processing Letters, 43–53.

Sze, V., Y. Chen, T. Yang, and J. S. Emer. 2017. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” Proceedings of the IEEE 105 (12): 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper with Convolutions.” In Computer Vision and Pattern Recognition (Cvpr). http://arxiv.org/abs/1409.4842.

Szegedy, C., V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. “Rethinking the Inception Architecture for Computer Vision.” In 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 2818–26. Los Alamitos, CA, USA: IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.308.

Tamura, S., M. Tateishi, M. Matumoto, and S. Akita. 1993. “Determination of the Number of Redundant Hidden Units in a Three-Layered Feedforward Neural Network.” In Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan), 1:335–38 vol.1. https://doi.org/10.1109/IJCNN.1993.713925.

Tan, Chong Min John, and Mehul Motani. 2020. “DropNet: Reducing Neural Network Complexity via Iterative Pruning.” In Proceedings of the 37th International Conference on Machine Learning, edited by Hal Daumé III and Aarti Singh, 119:9356–66. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v119/tan20a.html.

Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. “MnasNet: Platform-Aware Neural Architecture Search for Mobile.” http://arxiv.org/abs/1807.11626.

Tan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” http://arxiv.org/abs/1905.11946.

Tanaka, Hidenori, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. 2020. “Pruning Neural Networks Without Any Data by Iteratively Conserving Synaptic Flow.” http://arxiv.org/abs/2006.05467.

Tang, Hanlin, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. 2019. “DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression.” In Proceedings of the Thirty-Sixth International Conference on Machine Learning, 6155–65. http://arxiv.org/abs/1905.05957.

Tang, Yehui, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. 2021. “SCOP: Scientific Control for Reliable Neural Network Pruning.” http://arxiv.org/abs/2010.10732.

Tang, Zhenheng, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. 2020. “Communication-Efficient Distributed Deep Learning: A Comprehensive Survey.” http://arxiv.org/abs/2003.06307.

Tartaglione, Enzo, Skjalg Lepsøy, Attilio Fiandrotti, and Gianluca Francini. 2018. “Learning Sparse Neural Networks via Sensitivity-Driven Regularization.” http://arxiv.org/abs/1810.11764.

Tay, Yi, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021. “Long Range Arena: A Benchmark for Efficient Transformers.” In Proceedings of the Ninth International Conference on Learning Representations. http://arxiv.org/abs/2011.04006.

Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. “Efficient Transformers: A Survey.” http://arxiv.org/abs/2009.06732.

Tenney, Ian, Dipanjan Das, and Ellie Pavlick. 2019. “BERT Rediscovers the Classical NLP Pipeline.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4593–4601. http://arxiv.org/abs/1905.05950.

Theis, Lucas, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. 2018. “Faster Gaze Prediction with Dense Networks and Fisher Pruning.” http://arxiv.org/abs/1801.05787.

Thimm, Georg, and Emile Fiesler. 1995. “Evaluating Pruning Methods.” In National Chiao-Tung University, 2.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.

Tipping, Michael E. 2001. “Sparse Bayesian Learning and the Relevance Vector Machine.” Journal of Machine Learning Research 1 (Jun): 211–44.

Tompson, Jonathan, Ross Goroshin, Arjun Jain, Yann LeCun, and Christopher Bregler. 2015. “Efficient Object Localization Using Convolutional Networks.” http://arxiv.org/abs/1411.4280.

Tsuzuku, Yusuke, Hiroto Imachi, and Takuya Akiba. 2018. “Variance-Based Gradient Compression for Efficient Distributed Deep Learning.” In Proceedings of the Sixth International Conference on Learning Representations, Workshop Track. http://arxiv.org/abs/1802.06058.

Ullrich, Karen, Edward Meeds, and Max Welling. 2017. “Soft Weight-Sharing for Neural Network Compression.” http://arxiv.org/abs/1702.04008.

Unat, Didem, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, et al. 2017. “Trends in Data Locality Abstractions for HPC Systems.” IEEE Transactions on Parallel and Distributed Systems (TPDS) 28 (10).

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” http://arxiv.org/abs/1706.03762.

Verdenius, Stijn, Maarten Stol, and Patrick Forré. 2020. “Pruning via Iterative Ranking of Sensitivity Statistics.” http://arxiv.org/abs/2006.00896.

Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” http://arxiv.org/abs/1905.09418.

Wan, Li, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. “Regularization of Neural Networks Using Dropconnect.” In Proceedings of the 30th International Conference on Machine Learning, edited by Sanjoy Dasgupta and David McAllester, 28:1058–66. Proceedings of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR. http://proceedings.mlr.press/v28/wan13.html.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In Proceedings of the Seventh International Conference on Learning Representations. http://arxiv.org/abs/1804.07461.

Wang, Chaoqi, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. “Eigendamage: Structured Pruning in the Kronecker-Factored Eigenbasis.” http://arxiv.org/abs/1905.05934.

Wang, Hongyi, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. 2018. “ATOMO: Communication-Efficient Learning via Atomic Sparsification.” In Advances in Neural Information Processing Systems, 9850–61. http://arxiv.org/abs/1806.04090.

Wang, Linnan, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020. “FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks.” In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 113–24.

Wang, Ziheng, Jeremy Wohlwend, and Tao Lei. 2020. “Structured Pruning of Large Language Models.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (Emnlp), 6151–62. http://arxiv.org/abs/1910.04732.

Wangni, Jianqiao, Jialei Wang, Ji Liu, and Tong Zhang. 2018. “Gradient Sparsification for Communication-Efficient Distributed Optimization.” In Advances in Neural Information Processing Systems, 1299–1309. http://arxiv.org/abs/1710.09854.

Warstadt, Alex, Amanpreet Singh, and Samuel R Bowman. 2019. “Neural Network Acceptability Judgments.” Transactions of the Association for Computational Linguistics 7: 625–41. http://arxiv.org/abs/1805.12471.

Wei, Bingzhen, Xu Sun, Xuancheng Ren, and Jingjing Xu. 2017. “Minimal Effort Back Propagation for Convolutional Neural Networks.” http://arxiv.org/abs/1709.05804.

Wen, Wei, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. “Learning Structured Sparsity in Deep Neural Networks.” http://arxiv.org/abs/1608.03665.

White, David, and Panos A. Ligomenides. 1993. “GANNet: A Genetic Algorithm for Optimizing Topology and Weights in Neural Network Design.” In Proceedings of the International Workshop on Artificial Neural Networks: New Trends in Neural Computation, 322–27. IWANN ’93. Berlin, Heidelberg: Springer-Verlag.

Whitley, D., and C. Bogart. 1990. “The Evolution of Connectivity: Pruning Neural Networks Using Genetic Algorithms.” In Proceedings of the International Joint Conference on Neural Networks (Washington, DC), 134–37. IEEE Press.

Williams, Adina, Nikita Nangia, and Samuel R Bowman. 2018. “A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. http://arxiv.org/abs/1704.05426.

Williams, P. M. 1995. “Bayesian Regularization and Pruning Using a Laplace Prior.” Neural Computation 7 (1): 117–43. https://doi.org/10.1162/neco.1995.7.1.117.

Wortsman, Mitchell, Ali Farhadi, and Mohammad Rastegari. 2019. “Discovering Neural Wirings.” http://arxiv.org/abs/1906.00586.

Wortsman, Mitchell, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. “Supermasks in Superposition.” http://arxiv.org/abs/2006.14769.

Wu, Yuhuai, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. 2017. “Second-Order Optimization for Deep Reinforcement Learning Using Kronecker-Factored Approximation.” In NIPS, 5285–94. http://papers.nips.cc/paper/7112-second-order-optimization-for-deep-reinforcement-learning-using-kronecker-factored-approximation.

Xiao, Xia, Zigeng Wang, and Sanguthevar Rajasekaran. 2019. “AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, 32:13681–91. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/4efc9e02abdab6b6166251918570a307-Paper.pdf.

Xu, Jinhua, and Daniel WC Ho. 2006. “A New Training and Pruning Algorithm Based on Node Dependence and Jacobian Rank Deficiency.” Neurocomputing 70 (1-3): 544–58.

Yang, Dingqing, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2020. “Procrustes: A Dataflow and Accelerator for Sparse Deep Neural Network Training.” http://arxiv.org/abs/2009.10976.

Yang, Huanrui, Wei Wen, and Hai Li. 2020. “DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures.” http://arxiv.org/abs/1908.09979.

Yang, Tien-Ju, Yu-Hsin Chen, and Vivienne Sze. 2017. “Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning.” http://arxiv.org/abs/1611.05128.

Ye, Jianbo, Xin Lu, Zhe Lin, and James Z Wang. 2018. “Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers.” http://arxiv.org/abs/1802.00124.

Ye, Mao, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. 2020. “Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection.” http://arxiv.org/abs/2003.01794.

Yin, Penghang, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. 2019. “Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets.” http://arxiv.org/abs/1903.05662.

You, Haoran, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020. “Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks.” http://arxiv.org/abs/1909.11957.

You, Zhonghui, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. 2019. “Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks.” http://arxiv.org/abs/1909.08174.

Yu, D., F. Seide, G. Li, and L. Deng. 2012. “Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition.” In 2012 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 4409–12. https://doi.org/10.1109/ICASSP.2012.6288897.

Yu, Jiecao, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. “Scalpel: Customizing Dnn Pruning to the Underlying Hardware Parallelism.” ACM SIGARCH Computer Architecture News 45 (2): 548–60.

Yu, Ruichi, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. 2018. “NISP: Pruning Networks Using Neuron Importance Score Propagation.” http://arxiv.org/abs/1711.05908.

Yu, Xin, Zhiding Yu, and Srikumar Ramalingam. 2018. “Learning Strict Identity Mappings in Deep Residual Networks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 4432–40. http://arxiv.org/abs/1804.01661.

Yuan, Ming, and Yi Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1): 49–67.

Yun, Chulhee, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2020. “O(n) Connections Are Expressive Enough: Universal Approximability of Sparse Transformers.” In Advances in Neural Information Processing Systems.

Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/2007.14062.

Zeng, Wenyuan, and Raquel Urtasun. 2019a. “MLPrune: Multi-Layer Pruning for Automated Neural Network Compression.” https://openreview.net/forum?id=r1g5b2RcKm.

———. 2019b. “MLPrune: Multi-Layer Pruning for Automated Neural Network Compression.” https://openreview.net/forum?id=r1g5b2RcKm.

Zeng, Xiaoqin, and Daniel S Yeung. 2006. “Hidden Neuron Pruning of Multilayer Perceptrons Using a Quantified Sensitivity Measure.” Neurocomputing 69 (7-9): 825–37.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” http://arxiv.org/abs/1611.03530.

Zhang, Jeff (Jun), Parul Raj, Shuayb Zarar, Amol Ambardekar, and Siddharth Garg. 2019. “CompAct: On-Chip Compression of Activations for Low Power Systolic Array Based Cnn Acceleration.” ACM Trans. Embed. Comput. Syst. 18 (5s). https://doi.org/10.1145/3358178.

Zhang, Jiaqi, Xiangru Chen, Mingcong Song, and Tao Li. 2019. “Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks.” In Proceedings of the 46th International Symposium on Computer Architecture, 292–303. ISCA ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3307650.3322263.

Zhang, Jie-Fang, Ching-En Lee, C. Liu, Y. Shao, Stephen W. Keckler, and Zhengya Zhang. 2019. “SNAP: A 1.67 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm Cmos.” 2019 Symposium on VLSI Circuits, C306–C307.

Zhang, S., Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. “Cambricon-X: An Accelerator for Sparse Neural Networks.” In 2016 49th Annual Ieee/Acm International Symposium on Microarchitecture (Micro), 1–12. https://doi.org/10.1109/MICRO.2016.7783723.

Zhang, Xiangyu, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. 2015. “Efficient and Accurate Approximations of Nonlinear Convolutional Networks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 1984–92.

Zhang, Zhekai, Hanrui Wang, Song Han, and William J. Dally. 2020. “SpArch: Efficient Architecture for Sparse Matrix Multiplication.” http://arxiv.org/abs/2002.08947.

Zhao, Guangxiang, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019. “Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection.” http://arxiv.org/abs/1912.11637.

Zhao, Qibin, Masashi Sugiyama, and Andrzej Cichocki. 2017. “Learning Efficient Tensor Representations with Ring Structure Networks.” http://arxiv.org/abs/1705.08286.

Zhou, Guian, and Jennie Si. 1999. “Subset-Based Training and Pruning of Sigmoid Neural Networks.” Neural Networks 12 (1): 79–89.

Zhou, Hao, Jose M Alvarez, and Fatih Porikli. 2016. “Less Is More: Towards Compact Cnns.” In European Conference on Computer Vision, 662–77. Springer.

Zhou, Hattie, Janice Lan, Rosanne Liu, and Jason Yosinski. 2020. “Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.” http://arxiv.org/abs/1905.01067.

Zhou, X., Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. “Cambricon-S: Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach.” In 2018 51st Annual Ieee/Acm International Symposium on Microarchitecture (Micro), 15–28. https://doi.org/10.1109/MICRO.2018.00011.

Zhou, Xiao, Weizhong Zhang, Zonghao Chen, Shizhe Diao, and Tong Zhang. 2021. “Efficient Neural Network Training via Forward and Backward Propagation Sparsification.” Advances in Neural Information Processing Systems.

Zhou, Xiao, Weizhong Zhang, Hang Xu, and Tong Zhang. 2021. “Effective Sparsification of Neural Networks with Global Sparsity Constraint.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 3599–3608.

Zhu, Jingyang, Jingbo Jiang, Xizi Chen, and Chi-Ying Tsui. 2017. “SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity.” http://arxiv.org/abs/1711.01263.

Zhu, Jingyang, Zhiliang Qian, and Chi-Ying Tsui. 2016. “LRADNN: High-Throughput and Energy-Efficient Deep Neural Network Accelerator Using Low Rank Approximation.” In 2016 21st Asia and South Pacific Design Automation Conference (Asp-Dac), 581–86. https://doi.org/10.1109/ASPDAC.2016.7428074.

Zhu, Michael, and Suyog Gupta. 2017. “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression.” http://arxiv.org/abs/1710.01878.

Zhuang, Tao, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. 2020. “Neuron-Level Structured Pruning Using Polarization Regularizer.” Advances in Neural Information Processing Systems 33.

Zhuang, Zhuangwei, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2019. “Discrimination-Aware Channel Pruning for Deep Neural Networks.” http://arxiv.org/abs/1810.11809.