A Survey of Compression Methods for Efficient Model Inferencing

  IJCTT-book-cover
 
         
 
© 2025 by IJCTT Journal
Volume-73 Issue-2
Year of Publication : 2025
Authors : Dhivya Nagasubramanian
DOI :  10.14445/22312803/IJCTT-V73I2P105

How to Cite?

Dhivya Nagasubramanian, "A Survey of Compression Methods for Efficient Model Inferencing," International Journal of Computer Trends and Technology, vol. 73, no. 2, pp. 31-47, 2025. Crossref, https://doi.org/10.14445/22312803/IJCTT-V73I2P105

Abstract
The advent of Large Language Models (LLMs) has revolutionized the field of artificial intelligence, enabling a broad spectrum of applications across academic research and industrial domains. Central to this transformation is the rise of Transformer-based architectures, which have set new benchmarks in Natural Language Processing (NLP) tasks, including text generation, machine translation, and sentiment analysis. However, despite their remarkable performance, the computational demands of these models present significant challenges, particularly when it comes to deploying them in resource-constrained environments. Models like GPT-4, which boast upwards of 1.8 trillion parameters, require substantial processing power, memory, and storage, making them ill-suited for smaller devices such as those found on the Internet of Things (IoT) and embedded systems.
This limitation raises a critical need for methods to make LLMs more efficient and deployable on edge devices, which often have strict constraints on computational resources. Several promising techniques have emerged to address this challenge, particularly those focused on model compression. These approaches, which involve reducing the precision of model weights and activations, offer potential avenues for shrinking model size and accelerating inference speed. This paper explores a range of model compression techniques, particularly emphasizing their applicability to LLMs. Our goal is to identify strategies that can enhance the efficiency of LLMs, enabling their deployment on devices with limited resources. Furthermore, the synergistic potential of combining multiple compression methods to optimize model performance is being investigated. The ultimate aim is to contribute to democratizing AI by making state-of-the-art models more accessible for real-world applications across diverse devices.

Keywords
Large Language Models, Neural Network Quantization, Model Compression, Quantization, Knowledge Distillation, Pruning.

Reference

[1] R. Rosenfeld, “Two Decades of Statistical Language Modeling: Where Do We Go From Here?,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1270-1278, 2000.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Joshua T. Goodman, “A Bit of Progress in Language Modeling: Extended Version,” Microsoft Research Technical Report MSR-TR-2001 72, 2001.
[Google Scholar] [Publisher Link]
[3] Frederick Jelinek, Statistical Methods for Speech Recognition, MIT Press, pp. 1-283, 1997.
[Google Scholar] [Publisher Link]
[4] S. Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Stanley F. Chen, and Joshua Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359-394, 1999.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Jacob Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv, pp. 1-16, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Alec Radford et al., “Improving Language Understanding by Generative Pre-Training,” OpenAI Technical Report, pp. 1-12, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv, pp. 1-13, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Zhilin Yang et al., “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” Advances in Neural Information Processing Systems, vol. 32, pp. 5753-5763, 2019.
[Google Scholar] [Publisher Link]
[10] Yu Sun et al., “ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation,” arXiv, pp. 1 22, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Tom Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877 1901, 2020.
[Google Scholar] [Publisher Link]
[12] Susan Zhang et al., “OPT: Open Pre-trained Transformer Language Models,” arXiv, pp. 1-30, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Mohammad Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” arXiv, pp. 1 15, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, O'Reilly Media, pp. 1-408, 2022.
[Google Scholar] [Publisher Link]
[15] Barret Zoph et al., “ST-MoE: Designing Stable and Transferable Sparse Expert Models,” arXiv, pp. 1-38, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Wayne Xin Zhao et al., “A Survey of Large Language Models,” arXiv, pp. 1-140, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Victor Sanh et al., “DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,” arXiv, pp. 1-5, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Y. LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems, vol. 25, pp. 1097-1105, 2012.
[Google Scholar] [Publisher Link]
[20] David Silver et al., “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature, vol. 529, pp. 484-489, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang, “Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[Google Scholar] [Publisher Link]
[22] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[Google Scholar] [Publisher Link]
[23] Song Han, Huizi Mao, and William J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” arXiv, pp. 1-14, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Hao Li et al., “Pruning Filters for Efficient ConvNets,” arXiv, pp. 1-13, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Suraj Srinivas, and R. Venkatesh Babu, “Data-Free Parameter Pruning for Deep Neural Networks,” Proceedings of the British Machine Vision Conference, 2015.
[Google Scholar] [Publisher Link]
[26] Song Han et al., “Learning both Weights and Connections for Efficient Neural Networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[Google Scholar] [Publisher Link]
[27] Pavlo Molchanov et al., “Importance Estimation for Neural Network Pruning,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 11264-11272, 2019.
[Google Scholar] [Publisher Link]
[28] Aditya Kusupati et al., “Soft Threshold Weight Reparameterization for Learnable Sparsity,” Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 5544-5555, 2020.
[Google Scholar] [Publisher Link]
[29] Junjie Liu et al., “Dynamic Sparse Training: Find Efficient Sparse Networks from Scratch with Trainable Masks,” Advances in Neural Information Processing Systems, pp. 1-14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Yiwen Guo, Anbang Yao, and Yurong Chen, “Dynamic Network Surgery for Efficient DNNs,” Advances in Neural Information Processing Systems, vol. 29, 2016.
[Google Scholar] [Publisher Link]
[31] Xiangfei Hu et al., “LLaMA-Adapter: Efficient Fine-Tuning of Language Models with Zero-Init Attention,” arXiv, pp. 1-30, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Brian Lester, Rami Al-Rfou, and Noah Constant, “The Power of Scale for Parameter-Efficient Prompt Tuning,” arXiv, pp. 1-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Linfeng Zhang et al., “Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks Via Self Distillation,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713-3722, 2019.
[Google Scholar] [Publisher Link]
[34] Ron Banner, Yury Nahshan, and Daniel Soudry, “Post Training 4-bit Quantization of Convolutional Networks for Rapid-Deployment,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[Google Scholar] [Publisher Link]
[35] Ofir Zafrir et al., “Q8BERT: Quantized 8bit BERT,” Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition, Vancouver, BC, Canada, pp. 36-39, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Pierre Stock et al., “And the Bit Goes Down: Revisiting the Quantization of Neural Networks,” arXiv, pp. 1-11, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Tim Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv, pp. 1-26, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Hao Wu et al., “Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation,” arXiv, pp. 1-20, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Benoit Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704-2713, 2018.
[Google Scholar] [Publisher Link]
[40] Steven K. Esser et al., “Learned Step Size Quantization,” arXiv, pp. 1-12, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Yunfan Shao et al., “Character-LLM: A Trainable Agent for Role-Playing,” arXiv, pp. 1-35, 2023.
[Google Scholar] [Publisher Link]
[42] Markus Nagel et al., “Up or Down? Adaptive Rounding for Post-Training Quantization,” arXiv, pp. 1-12, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Zhewei Yao et al., “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers,” arXiv, pp. 1-24, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Vishnu Suresh Lokhande et al., “Generating Accurate Pseudo-Labels in Semi-Supervised Learning and Avoiding Overconfident Predictions via Hermite Polynomial Activations,” Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 11432-11440, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Zhen Dong et al., “HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision,” arXiv, pp. 1-12, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Kuan Wang et al., “HAQ: Hardware-Aware Automated Quantization with Mixed Precision,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612-8620, 2019.
[Google Scholar] [Publisher Link]
[47] Gustavo Aguilar et al., “Knowledge Distillation from Internal Representations,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 7350-7357, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Angeline Aguinaldo et al., “Compressing GANs Using Knowledge Distillation,” arXiv, pp. 1-10, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Rohan Anil et al., “Large Scale Distributed Neural Network Training through Online Distillation,” arXiv, pp. 1-12, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[51] William Chan, Nan Rosemary Ke, and Ian Lane, “Transferring Knowledge from an RNN to a DNN,” arXiv, pp. 1-5, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Yoonho Boo et al., “Stochastic Precision Ensemble: Self-Knowledge Distillation for Quantized Deep Neural Networks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, pp. 6794-6802, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Defang Chen et al., “Online Knowledge Distillation with Diverse Peers,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 3430-3437, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[54] Defang Chen et al., “Cross-Layer Distillation with Semantic Calibration,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, pp. 7028-7036, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[55] Vasileios Belagiannis, Azade Farshad, and Fabio Galasso, “Adversarial Network Compression,” Proceedings of the European Conference on Computer Vision Workshops, 2018.
[Google Scholar] [Publisher Link]
[56] Haoli Bai et al., “Few Shot Network Compression Via Cross Distillation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 3203-3210, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[57] Sangchul Hahn, and Heeyoul Choi, “Self-Knowledge Distillation In Natural Language Processing,” arXiv, pp. 1-8, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[58] Hankook Lee, Sung Ju Hwang, and Jinwoo Shin, “Rethinking Data Augmentation: Self-Supervision and Self-Distillation,” ICLR 2020 Conference, pp. 1-11, 2019.
[Google Scholar] [Publisher Link]
[59] Massimo Caccia et al., “Online Fast Adaptation and Knowledge Accumulation (OSAKA): A New Approach to Continual Learning,” Advances in Neural Information Processing Systems NeurIPS, vol. 33, 2020.
[Google Scholar] [Publisher Link]
[60] Tommaso Furlanello et al., “Born-Again Neural Networks,” Proceedings of the 35th International Conference on Machine Learning, pp. 1607-1616, 2018.
[Google Scholar] [Publisher Link]
[61] Hessam Bagherinezhad et al., “Label Refinery: Improving Imagenet Classification through Label Progression,” arXiv, pp. 1-16, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[62] Li Yuan et al., “Revisit Knowledge Distillation: A Teacher-Free Framework,” ICLR 2020 Conference, pp. 1-14, 2020.
[Google Scholar] [Publisher Link]