Pruning & Sparse NN: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resource. Conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude by learning only the important connections. This reduced the number of parameters of AlexNet by a factor of 9×, that of VGGNet by 13× without affecting their accuracy.
S. Han, J. Pool, J. Tran, W. J. Dally, “Learning both Weights and Connections for Efficient Neural Networks”, NIPS’15. |
Deep Compression: Large deep neural network model improves prediction accuracy but results in large demand for memory access, which is 100× more power hungry than ALU operations. “Deep Compression” introduces a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of deep neural networks. Experimented on Imagenet dataset: AlexNet got compressed by 35×, from 240MB to 6.9MB; VGGNet got compressed by 49×, from 552MB to 11.3MB, without affecting their accuracy. This algorithm helps putting deep learning into mobile App.
S. Han, H. Mao, W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR’16. Best Paper Award. |
Efficient Inference Engine (EIE): To execute DNNs on inexpensive, low-power embedded platform requires executing compressed, sparse DNNs. EIE is the first hardware accelerator for these highly-efficient networks. EIE exploits weight sparsity, weight sharing, and can skip zero activations from ReLU. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster, 24,000× and 3,000× more energy efficient than a CPU and GPU respectively. EIE both distributed storage and distributed computation to parallelize a sparsified layer across multiple PEs, which achieves load balance and good scalability. EIE is covered by TheNextPlatform, HackerNews, TechEmergence and Embedded Vision.
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA’16. |
Efficient Speech Recognition Engine (ESE): ESE takes the approach of EIE one step further to address not only feedforward neural networks but also recurrent neural networks (RNN and LSTM). The recurrent nature produces complicated data dependency, which is more challenging than feedforward neural nets. To deal with this problem, we designed a data flow that can effectively schedule the complex LSTM operations using multiple EIE cores. ESE also present an effective model compression algorithm for LSTM with hardware efficiency considerations, compressed the LSTM by 20x without hurting accuracy. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a processing power of 282 GOPS/s working directly on a compressed sparse LSTM network, corresponding to 2.52 TOPS/s on an uncompressed dense network.
S. Han, J. Kang, H. Mao, Y. Li, D. Xie, H. Luo, Y. Wang, H. Yang, W. J. Dally “ESE: Efficient Speech Recognition Engine for Compressed LSTM”, FPGA’17. Best Paper Award. |
Dense-Sparse-Dense Training (DSD): A critical issue for training large neural networks is to prevent overfitting while at the same time providing enough model capacity. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks to achieve higher accuracy. DSD training can improve the prediction accuracy of a wide range of neural networks: CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. DSD training flow produces the same model architecture and doesn’t incur any inference time overhead.
S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally, “DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow”, ICLR’17. |
Trained Tenary Quantization (TTQ): The deployment of large neural networks models can be difficult for mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, our models are nearly 16× smaller than full-precision models.
C. Zhu, S. Han, H. Mao, W. J. Dally, “Trained Ternary Quantization”, ICLR’17. |
SqueezeNet: Smaller CNN model is easier to deploy on mobile devices. SqueezeNet is a small CNN architecture that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Together with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510× smaller than AlexNet), which can fully fit on-chip SRAM, making it easier to deploy on embedded device.
F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer, “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5MB Model Size”, arXiv 16. |
Pruning Winograd Convolution: Winograd’s minimal filtering algorithm and network pruning both reduce the operations in CNNs. Unfortunately, these two methods cannot be combined. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we prune the weights in the ”Winograd domain” to exploit static weight sparsity. Second, we move the ReLU operation into the ”Winograd domain” to improve the sparsity of the transformed activations. On CIFAR-10, our method reduces the number of multiplications in the VGG-nagadomi model by 10.2× with no loss of accuracy.
X. Liu, J. Pool, S. Han, W. J. Dally, “Efficient Sparse-winograd Convolutional Neural Networks”, ICLR’18. |
Deep Gradient Compression: Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. In this paper, we find 99.9% of the gradient exchange in distributed SGD are redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. DGC achieves a gradient compression ratio from 270× to 600× without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet.Y. Lin, S. Han, H. Mao, Y. Wang, W. J. Dally, “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, ICLR’18. [pdf] [slides]. |