Song Han

Associate Professor, MIT EECS

Efficient AI Computing

from TinyML to LargeML


Song Han is an associate professor at MIT EECS. He received his PhD degree from Stanford University. He proposed the “Deep Compression” technique including pruning and quantization that is widely used for efficient AI computing, and “Efficient Inference Engine” that first brought weight sparsity to modern AI chips, which influenced NVIDIA’s Ampere GPU Architecture with Sparse Tensor Core. He pioneered the TinyML research that brings deep learning to IoT devices, enabling learning on the edge (appeared on MIT home page). His team’s work on hardware-aware neural architecture search (once-for-all network) enables users to design, optimize, shrink and deploy AI models to resource-constrained hardware devices, receiving the first place in many low-power computer vision contests in flagship AI conferences.  Song received best paper awards at ICLR and FPGA, faculty awards from Amazon, Facebook, NVIDIA, Samsung and SONY. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning”, IEEE “AIs 10 to Watch: The Future of AI” award, and Sloan Research Fellowship.

Song’s cutting-edge research in efficient AI computing has profoundly influenced the industry. He was the cofounder of DeePhi (now part of AMD), and cofounder of OmniML (now part of NVIDIA).

Group website:
TinyML project:
LLM compression:

EfficientML course:

Twitter, Github, LinkedInGoogle Scholar, YouTube


  • an open course to teach efficient AI computing.
  • Lecture website:
  • Lecture videos, slides and homework are open to the public.
  • Streamed via every Tues/Thur 3:35-5pm EST.
  • Five hands-on labs: pruning, quantization, neural architecture search, LLM compression, LLM deployment

Research Interests

Industry impact: our efficient ML research has influenced and landed in many industry products, thanks to the close collaboration with our sponsors: Intel OpenVino, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA FasterTransformer, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit,  ADI MAX78000/MAX78002 Model Training and Synthesis Tool, Ford Trailer Backup Assist.

Open source projects with over 1K Github stars:
TSM: Temporal Shift Module for Efficient Video Understanding
Once for All: Train One Network and Specialize it for Efficient Deployment
BEVFusion: Multi-task Multi-sensor Fusion with Bird Eye View Representation
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
Data-Efficient GANs with DiffAugment
GAN Compression


  • Sloan Research Fellowship, 2023
  • Red Dot Award, 2022
  • Samsung Global Research Outreach (GRO) Award, 2021
  • NVIDIA Academic Partnership Award, 2021, 2020
  • IEEE “AIs 10 to Watch: The Future of AI” Award, 2020
  • NSF CAREER Award, 2020
  • MIT Technology Review list of 35 Innovators Under 35, 2019
  • SONY Faculty Award, 2017/2018/2020
  • Amazon Machine Learning Research Award, 2018/2019
  • Facebook Research Award, 2019
  • Best paper award, FPGA’2017
  • Best paper award, ICLR’2016

Competition Awards


We thank the generous sponsors of our research: ADI, Amazon, AMD, Apple, ARM, Cognex, Facebook, Ford, Google, Hyundai, IBM, Intel, Microsoft, MIT AI Hardware Program, MIT Microsystems Technology Lab, MIT-IBM Watson AI Lab, National Science Foundation, NVIDIA, Qualcomm, Samsung, Semiconductor Research Corporation, SONY, TI. 



  • July 2023: Congrats Zhijian Liu and Hanrui Wang selected as the Rising Stars in Machine Learning and Systems.
  • July 2023: We present SmoothQuant at ICML for LLM quantization, compression and acceleration. 
  • July 2023: We released TinyChat, a cutting-edge inference library for LLM quantization (W4A16) designed for minimal resource consumption and fast LLM inference on mobile GPU platforms. It allows deploying LLM on edge devices such as NVIDIA Jetson Orin (13 tokens per second for LLaMA-2), empowering users with a real-time conversational experience on edge devices without the internet.  The current release supports:  LLaMA-2-7B/13B-chat;  Vicuna;  MPT-chat;  Falcon-instruct.
  • July 2023: TorchSparse++: A Unified Framework for Efficient Inference and Training for Sparse Point Cloud on GPUs is accepted by MICRO’23. TorchSparse is a high-performance point cloud training and inference engine for GPUs, significantly accelerating sparse convolution computation, widely used for LiDAR perception in autonomous vehicles. Unlike dense 2D computation, point cloud convolution involves sparse and irregular computation patterns, necessitating dedicated inference system support with specialized high-performance kernels.  TorchSparse optimizes the critical bottlenecks of sparse convolution: data movement and irregular computation.
  • July 2023: Tiny Training Engine: Pocket-sized Engine for Efficient On-device Training is accepted by MICRO’23. We present a portable and efficient training engine that enables on-device learning on various edge devices. It is compilation first: the entire training graph (including forward, backward and optimization step) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. Secondly, it supports sparse back-propagation: It prunes the backward graph and sparsely updates the model with actual memory saving and latency reduction. It also integrates a rich set of training graph optimizations.
  • July 2023: Best University Demo Award at DAC’23 for “An Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination” in collaboration with Anantha Chandrakasan’s team.
  • June 2023: On 50th ISCA, EIE is listed as the top-5 most cited paper in 50 years of ISCA, and second most cited paper of ISCA in the past five years. Here’s our invited retrospective paper.
  • June 2023: We present FlatFormer at CVPR’23. FlatFormer is an efficient point cloud transformer. It trades spatial proximity for better computational regularity, bridging the 3-4x latency gap between point cloud transformers and sparse convolutional models. FlatFormer is the first point cloud transformer that achieves real-time performance on edge GPUs while achieving superior accuracy on large-scale benchmarks.
  • June 2023: we present SparseViT at CVPR’23: accelerating ViT by taking advantage of sparsity. SparseViT explores activation sparsity for window-based vision transformers (ViTs) by skipping computation for less important regions. SparseViT achieves speedups of 1.5X, 1.4X, and 1.3X compared to its dense counterpart in 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.
  • June 2023: we present BEVFusion at ICRA’23 and CVPR worship on vision-centric and data-driven autonomous driving. BEVFusion facilitates efficient multi-task multi-sensor fusion by unifying camera, LiDAR, and radar features in a shared bird’s-eye view BEV space. We addressed an efficiency bottleneck by accelerating the key view transformation operator by 40 times. BEFusion achieved the leading solution on three popular 3D perception benchmarks, including nuScenes, Argoverse, and Waymo, across different mapless driving tasks, such as object detection/tracking and map segmentation.
  • June 2023: Song presented “Efficient Deep Learning Computing with Sparsity” as a keynote at CVPR workshop on efficient computer vision.
  • June 2023: Zhijian presented “Efficient 3D Perception for Autonomous Vehicles” as a keynote at CVPR workshop on efficient computer vision.
  • April 2023: SmoothQuant: Accurate and Efficient Post Training Quantization for Large Language Model is accepted by ICML’23. SmoothQuant is a training-free method to enable low precision quantization for LLMs. Quantizing LLM is quite different from quantizing vision models due to the large dynamic range of activations. Since matmul, X*W=Y, is linear, we can shift information in X or W around. As such, we can balance the quantization difficulty across both matrices (intuition: 1×100=10×10). We demonstrate up to 1.56x speedup and 2x memory reduction compared with fp16 with negligible loss in accuracy.
    paper / code
  • March 2023: SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer is accepted by CVPR’23. SparseViT explores activation sparsity for window-based ViTs. The key idea is to skip computation for less important regions, resulting in a 1.5x speedup with no accuracy loss. 
  • March 2023: FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer is accepted by CVPR’23. The key idea is to trade spatial proximity for better computational regularity. FlatFormer bridges the 3-4x latency gap between point cloud transformers and sparse convolutional models.
  • Jan 2023: BEVFusion: Multi-task Multi-sensor Fusion with Unified Bird’s Eye View Representation is accepted by ICRA’23.
    paper / code / website / demo
  • Nov 2022: We released SmoothQuant: Accurate and Efficient Post Training Quantization for Large Language Models over 100 Billion Parameters. paper / code
  • Nov 2022: I gave a guest lecture at UPenn TinyML class about MCUNet: Tiny Deep Learning on IoT Devices. video
  • Sep 2022: On-Device Training under 256KB Memory is accepted by NeurIPS’22. paper / website / demo
  • Sep 2022: Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models is accepted by NeurIPS’22.
  • Sep 2022: I’m opening a new course: TinyML and Efficient Deep Learning:
  • Aug 2022: Congrats Ji and Ligeng receiving the Qualcomm Innovation Fellowship for their pioneer work: On-Device Training under 256KB Memory. website
  • Aug 2022:  TorchSparse: Efficient Point Cloud Inference Engine is presented at MLSys’22. The sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently. Point cloud has received increased attention due to its importance in autonomous driving. However, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. TorchSparse is a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse optimizes for irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity. paper / code / slides / website
  • April 2022: Network Augmentation for Tiny Deep Learning is presented at ICLR’22. paper / code 
  • March 2022: A journal paper that summarizes our philosophies for mobile deep learning: Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications. We first present popular model compression methods, including pruning, factorization, quantization, as well as light-weight primitives. To reduce the manual design cost, we present the hardware-aware AutoML framework, including neural architecture search (ProxylessNAS, Once-for-all) and automated pruning (AMC) and quantization (HAQ). We then cover efficient on-device training to enable user customization based on the local data on mobile devices (TinyTL). Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives. paper / code 
  • March 2022: Lite Pose: Efficient Architecture for Human Pose Estimation is accepted by CVPR’22. Lite Pose accelerates human pose estimation by up to 5x on Snapdragon 855, Jetson Nano and Raspberry Pi.
  • Feb 2022: QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits is presented at HPCA 2022. paper / qmlsys website / code
  • Feb 2022: QuantumNAT: Quantum Noise-Aware Training with Noise Injection, Quantization and Normalization is accepted by DAC’22. Due to the large quantum noises (errors), the performance of quantum AI models has a severe degradation on real quantum devices. We present QuantumNAT, a QNN-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to QNN according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect of quantum noise. paper / website / code
  • Feb 2022: QOC: Quantum On-Chip Training with Parameter Shift and Gradient Pruning is accepted by DAC’22. In order to achieve scalable training of quantum AI models, the training process needs to be offloaded to real quantum machines instead of using exponential-cost classical simulators. One common approach to obtaining QNN gradients is parameter shift whose cost scales linearly with the number of qubits. We present QOC, the first experimental demonstration of practical on-chip parameterized quantum circuit training with parameter shift. Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naive parameter shift have low fidelity and thus degrade the training accuracy. To this end, we further propose probabilistic gradient pruning to firstly identify gradients with potentially large errors and then remove them. Specifically, small gradients have larger relative errors than large ones, thus having a higher probability to be pruned. paper / website code
  • Jan 2022: Network Augmentation for Tiny Deep Learning is accepted by ICLR’22.  Training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. paper
  • Jan 2022: TorchSparse: Efficient Point Cloud Inference Engine is accepted by MLSys’22. The sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently; existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. We introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: data movement and irregular computation. It optimizes the data orchestration by quantization and fused locality-aware memory access, reducing the memory movement cost by 2.7x. It also adopts adaptive MM grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. paper / code 
  • Dec 2021: MCUNet-v2: Memory Efficient Inference for Tiny Deep Learning is presented at NeurIPS’21. Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. The imbalanced memory distribution CNN exacerbates the issue: the first several blocks have an order of magnitude larger memory usage than the rest of the network. We propose a patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond classification but also detection.  paper / website / slides / demo / demo2 / MIT News / TechTalks
  • Dec 2021: Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning is presented at NeurIPS’21. Federated learning suffers from high communication latency. We propose Delayed Gradient Averaging (DGA), which delays the averaging step to allow local computation run ahead of communication. We theoretically prove that DGA attains a similar convergence rate as FedAvg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. DGA is implemented on 16-node Raspberry Pi cluster. With both IID and non-IID partitions, and show DGA can bring 2.55× to 4.07 × speedup. paper / website / slides / poster
  • Dec 2021: NAAS: Neural Accelerator Architecture Search is presented at DAC’21. We proposed a novel data-driven approach for AI-designed AI accelerator. Such data-driven method can find knowledge that is difficult to be explicitly expressed by human and can efficiently scale. Design spaces of hardware, compiler, and neural networks are tightly entangled, joint-optimization is better than separate optimization. Given the huge design space, data-driven approach is desirable. With the same runtime, machine learning methods can explore more data points than human designers. NAAS proposes a data-driven, automatic design space exploration of neural accelerator architectures that outperforms human design. paper / website / slides / video
  • Oct 2021: QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits to appear at HPCA’22. Quantum noise is the key challenge in Noisy Intermediate-Scale Quantum (NISQ) computers. We propose QuantumNAS (NAS: Noise-Adaptive Search), a comprehensive framework for noise-adaptive co-search of the variational circuit and qubit mapping. QuantumNAS decouples the circuit search and parameter training by introducing a novel SuperCircuit, followed by evolutionary co-search of SubCircuit and its qubit mapping. Finally, we perform iterative gate pruning and finetuning to remove redundant gates and reduce noise. QuantumNAS is the first to demonstrate over 95% 2-class, 85% 4-class, and 32% 10-class classification accuracy on real QC. It also achieves the lowest eigenvalue for VQE tasks. We open-source TorchQuantum for fast training of parameterized quantum circuits to facilitate future research. paper / qmlsys website / code
  • Oct 2021: PointAcc: Efficient Point Cloud Accelerator is presented at International Symposium on Microarchitecture (MICRO’21) . Deep learning on point clouds plays a vital role in a wide range of applications such as autonomous driving. Compared to projecting the point cloud to 2D space, directly processing 3D point cloud yields higher accuracy and lower #MACs. However, the extremely sparse nature of point cloud poses challenges to hardware acceleration. PointAcc proposes an versatile sorting engine to determine the nonzero input-output pairs, streams the sparse computation with reconfigurable caching, and temporally fuses consecutive dense layers to reduce the memory footprint. Co-designed with light-weight neural networks, PointAcc rivals the prior art by 100X speedup with 9.1% higher accuracy for semantic segmentation. paper / website / slides / talk / lightning talk
  • July 2021: LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision is accepted by International Conference on Computer Vision (ICCV’21) paper 
  • July 2021: SemAlign: Annotation-Free Camera-LiDAR Calibration with Semantic Alignment Loss is accepted by International Conference on Intelligent Robots and Systems (IROS’21) paper
  • June 2021: Congrats Zhijian and Yujun receiving the Qualcomm Innovation Fellowship for “Algorithm-Hardware Co-Design for Efficient LiDAR-Based Autonomous Driving” project
  • June 2021: Congrats Hanrui and Han receiving the Qualcomm Innovation Fellowship for “On-Device NLP Inference and Training with Algorithm-Hardware Co-Design” project
  • April 2021: Once-for-All (OFA) Network got a world-record in the open division of MLPerf Inference Benchmark: 1.078M inferences per second on 8 A100 GPUspaper / website / Github
  • March 2021: HAQ: Hardware-Aware Automated Quantization with Mixed Precision is integrated by Intel OpenVINO Toolkit.  paper
  • Feb 2021: Efficient and Robust LiDAR-Based End-to-End Navigation is accepted by ICRA’21. We introduce Fast-LiDARNet that is based on sparse GPU kernel optimization and hardware-aware neural architecture search, improving the speed from 5 fps to 47 fps; together with Hybrid Evidential Fusion that directly estimates the uncertainty and fuse the control predictions, which reduces the number of takeovers in road test. paper
  • Feb 2021: Anycost GANs for Interactive Image Synthesis and Editing is accepted by CVPR’21. GANs are big. GANs are slow. It takes seconds to edit a single on edge devices, prohibiting interactive user experience. Anycost GANs can be executed at various computational cost budgets (up to 10× computation reduction) and adapt to a wide range of hardware and latency requirements. When deployed on edge devices, our model achieves 6-12× speedup, enabling interactive image editing on mobile devices. paper / website / video / code
  • Oct 2020: SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning appeared at HPCA’21 and spotlighted by MIT News.  Paper  /  Slides  /  Intro Video /  Project Page
  • Jan 2021: IOS: Inter-operator Scheduler For CNN Acceleration is accepted by MLSys’21. Existing deep learning frameworks focus on optimizing intra-operator parallelization. However, a single operator can not fully utilize the available parallelism in GPU, especially under small batch size. We extensively study the parallelism between operators and propose Inter-Operator Scheduler (IOS) to automatically schedule the execution of multiple operators in parallel. paper / code / video / slides / poster
  • Dec 2020: MCUNet: Tiny Deep Learning on IoT Devices is presented at NeurIPS’20 as spotlight presentation. paper / website /  MIT News / Wired / Stacey on IoT / Morning Brew / IBM / Analytics Insight
  • Dec 2020: Tiny Transfer Learning: Reduce Activations, not Trainable Parameters for Efficient On-Device Learning is presented at NeurIPS’20. website / slides / code
  • Dec 2020: Differentiable Augmentation for Data-Efficient GAN Training is presented at NeurIPS’20. code / website / talk VentureBeat blog
  • Aug 2020: OnceForAll team received the first place in the Low-Power Computer Vision Challenge, mobile CPU detection track.
  • Aug 2020: OnceForAll team received the first place in the Low-Power Computer Vision Challenge, FPGA track.
  • July 2020: SPVNAS ranks first on SemanticKITTI.
  • July 2020: Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution is accepted by ECCV’20.
  • June 2020: Once-For-All Network (OFA) for on-device AI is highlighted by Qualcomm
  • June 2020: We open sourced Data-Efficient GAN Training with DiffAugment on Github. Covered by VentureBeat.
  • May 2020: HAT: Hardware-Aware Transformer for Efficient Natural Language Processing to appear at ACL’2020 paper / code / website. This is our second paper on efficient NLP on edge devices, together with Lite Transformer ICLR’20 paper / code / website / slides
  • April 2020: Slides for ICLR’20 NAS workshop and TinyML webinar “AutoML for TinyML with Once-for-All Network” is available
  • April 2020: Once-For-All Network (OFA) is covered by MIT News and Venture BeatReducing the carbon footprint of artificial intelligence: MIT system cuts the energy required for training and running neural networks.
  • Mar 2020: Point-Voxel CNN for Efficient 3D Deep Learning is highlighted by NVIDIA Jetson Community Project Spotlight.
  • Mar 2020: Point-Voxel CNN for Efficient 3D Deep Learning is deployed on MIT Driverless, improving the 3D detection accuracy from 95% to 99.93%, improving the detection range from 8m to 12m, reducing the latency from 2ms/object to 1.25ms/object demo
  • Feb 2020: SpArch: Efficient Architecture for Sparse Matrix Multiplication  appeared at International Symposium on High-Performance Computer Architecture (HPCA) 2020. Sparse Matrix Multiplication (SpMM) is an important primitive for many applications (graphs, sparse neural networks, etc). SpArch has a spatial merger array to perform parallel merge of the partial sum, and a Huffman Tree scheduler to determine the optimal order to merge the partial sums, reducing the DRAM access.  paper / slides / website 2min talk full talk
  • Feb 2020: GAN Compression: Learning Efficient Architectures for Conditional GANs and APQ: Joint Search for Network Architecture, Pruning and Quantization Policy are accepted by CVPR’20.
  • Feb 2020: With our efficient model, the Once-for-All Network, our team is awarded the first place in the Low Power Computer Vision Challenge (both classification and detection track).
  • Jan 2020: Song received the NSF CAREER Award for “Efficient Algorithms and Hardware for Accelerated Machine Learning”.
  • Dec 2019: Once-For-All Network (OFA) is accepted by ICLR’2020. Train only once, specialize for many hardware platforms, from CPU/GPU to hardware accelerators. OFA decouples model training from architecture search.OFA consistently outperforms SOTA NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNet-V3) while reducing orders of magnitude GPU hours and CO2 emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top1 accuracy under the mobile setting (<600M FLOPs). Paper / Code / Poster / MIT News / Qualcomm News / VentureBeat
  • Dec 2019: Lite Transformer with Long Short Term Attention is accepted by ICLR’2020. We investigate the mobile setting for NLP tasks to facilitate the deployment of NLP model on the edge devices. [Paper]
  • Nov 2019: AutoML for Architecting Efficient and Specialized Neural Networks to appear at IEEE Micro
  • Oct 2019: TSM is featured by MIT News Engadget / NVIDIA News / MIT Technology Review
  • Oct 2019: Our team is awarded the first place in the Low Power Computer Vision Challenge, DSP track at ICCV’19 using the Once-for-all Network.
  • Oct 2019: Our winning solution to the Visual Wake Words Challenge is highlighted by Google. The technique is ProxylessNASdemo / code
  • Oct 2019: Open source: the search code for ProxylessNAS is available on Github.
  • Oct 2019: Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos is accepted by NeurIPS workshop on Systems for ML. TSM, a compact model for video understanding, is hardware-friendly not only for inference but also for training. With TSM, we can scale up Kinetics training to 1536 GPUs and reduce the training time from 2 days to 15 minutes. TSM is highlighted at the opening remarks at AI Research Week hosted by the MIT-IBM Watson AI Labpaper
  • Oct 2019: Distributed Training across the World is accepted by NeurIPS workshop on Systems for ML.
  • Oct 2019: Neural-Hardware Architecture Search is accepted by NeurIPS workshop on ML for Systems.
  • Sep 2019: Point-Voxel CNN for Efficient 3D Deep Learning is accepted by NeurIPS’19 as spotlight presentation. paper / demo / playlist / talk / slides / code / website
  • Sep 2019: Deep Leakage from Gradients is accepted by NeurIPS’19. paper / poster / code / website
  • July 2019: TSM: Temporal Shift Module for Efficient Video Understanding is accepted by ICCV’19. Video understanding is more computationally intensive than images, making it harder to deploy on edge devices. Frames in the temporal dimension is highly redundant. TSM uses 2D convolution’s computation complexity and achieves better temporal modeling ability than 3D convolution. TSM also enables low-latency, real-time video recognition (13ms latency on Jetson Nano and 70ms latency on Raspberry PI-3). paper / demo / code / poster / industry integration@NVIDIA / MIT News / Engadget / MIT Technology Review / NVIDIA News / NVIDIA Jetson Developer Forum
  • June 2019: HAN Lab is awarded the first place in the Visual Wake-up Word Challenge@CVPR’19. The task is human detection on IoT device that has a tight computation budget:  <250KB model size, <250KB peak memory usage, <60M MAC. The techniques are described in the ProxylessNAS paper. code / Raspberry Pi and Pixel 3 demo
  • June 2019: Song is presenting “Design Automation for Efficient Deep Learning by Hardware-aware Neural Architecture Search and Compression” at ICML workshop on On-Device Machine Learning& Compact Deep Neural Network Representations, CVPR workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications, CVPR workshop on Efficient Deep Learning for Computer Vision, UCLA, TI andWorkshop on Approximate Computing Across the Stackpaper / slides
  • June 2019: Open source. AMC: AutoML for Model Compression and Acceleration on Mobile Devices is available on Github. AMC uses reinforcement learning to automatically find the optimal sparsity ratio for channel pruning.
  • June 2019: Open source. HAQ: Hardware-aware Automated Quantization with Mixed Precision is available on Github.
  • May 2019: Song Han received Facebook Research Award.
  • April 2019: Defensive Quantization on MIT News: Improving Security as Artificial Intelligence Moves to Smartphones.
  • April 2019: Our manuscript of Design Automation for Efficient Deep Learning Computing is available on arXiv (accepted by the Micro journal). slides
  • March 2019: ProxylessNAS is covered by MIT News: Kicking Neural Network Design Automation into High Gear and IEEE Spectrum: Using AI to Make Better AI.
  • March 2019: HAQ: Hardware-aware Automated Quantization with Multi-precision is accepted by CVPR’19  as oral presentation. HAQ leverages reinforcement learning to automatically determine the quantization policy (bit width per layer), and we take the hardware accelerator’s feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback (both latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. So far, ProxylessNAS [ICLR’19] => AMC [ECCV’18] => HAQ [CVPR’19] forms a pipeline of  efficient AutoML.
  • Feb 2019: Song presented “Bandwidth-Efficient Deep Learning with Algorithm and Hardware Co-Design” at ISSCC’19 in the forum “Intelligence at the Edge: How Can We Make Machine Learning More Energy Efficient?
  • Jan 2019: Song is appointed to the Robert J. Shillman (1974) Career Development Chair.
  • Jan 2019: “Song Han: Democratizing artificial intelligence with deep compression” by MIT Industry Liaison Program. article / video
  • Dec 2018: Congrats Xiangning received the 2nd place in the feedback phase of the NeuraIPS’18 AutoML Challenge: AutoML for Lifelong Machine Learning.
  • Dec 2018: Defensive Quantization: When Efficiency Meets Robustness is accepted by ICLR’19. Neural network quantization is becoming an industry standard to compress and efficiently deploy deep learning models. Is model compression a free lunch? No, if not treated carefully. We observe that the conventional quantization approaches are vulnerable to adversarial attacks. This paper aims to raise people’s awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models. paper / MIT News
  • Dec 2018: Learning to Design Circuits appeared at NeurIPS workshop on Machine Learning for Systems (full version accepted by DAC’2020). Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive and time consuming. This paper propose a learning based approach to size the transistors and help engineers to shorten the design cycle. paper
  • Dec 2018: Our work on ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware is accepted by ICLR’19. Neural Architecture Search (NAS) is computation intensive. ProxylessNAS saves the GPU hours by 200x than NAS, saves GPU memory by 10x than DARTS, while directly searching on ImageNet. ProxylessNAS is hardware-aware. It can design specialized neural network architecture for different hardware, making inference fast. With >74.5% top-1 accuracy, the measured latency of ProxylessNAS is 1.8x faster than MobileNet-v2, the current industry standard for mobile vision. paper / code / demo poster / MIT news / IEEE Spectrum / industry integration: @AWS@Facebook
  • Sep 2018: Song Han received Amazon Machine Learning Research Award.
  • Sep 2018: Song Han received SONY Faculty Award.
  • Sep 2018: Our work on AMC: AutoML for Model Compression and Acceleration on Mobile Devices is accepted by ECCV’18. This paper proposes learning-based method to perform model compression, rather than relying on human heuristics and rule-based methods. AMC can automate the model compression process, achieve better compression ratio, and also be more sample efficient. It takes shorter time can do better than rule-based heuristics. AMC compresses ResNet-50 by 5x without losing accuracy. AMC makes MobileNet-v1 2x faster with 0.4% loss of accuracy. paper website


Ph.D. Stanford University. Advisor: Prof. Bill Dally 

B.S. Tsinghua University


Email: FirstnameLastname [at] mit [dot] edu 

Office: 38-344. I’m fortunate to be at Prof. Paul Penfield and Prof. Paul E. Grey‘s former office.

PhD/intern recruiting email:please send your CV, research statement, pitch deck, and two reference contacts. I couldn’t timely reply inquiry emails if above documents are incomplete. Select ML+Sys track in the MIT PhD application system.