Backdoor Defense via Aggregation of Outsourced Models using Multi-Stage Knowledge Distillation

Heydari, Amirhossein; Mansouri, Azadeh; Mahmoudi-Aznaveh, Ahmad

doi:10.22042/isecure.2026.240527

Backdoor Defense via Aggregation of Outsourced Models using Multi-Stage Knowledge Distillation

Articles in Press

Document Type : Research Article

Authors

Amirhossein Heydari ¹

Azadeh Mansouri ¹

Ahmad Mahmoudi-Aznaveh ²

¹ Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi University, Tehran, Iran.

² Cyberspace Research Institute, Shahid Beheshti University, Tehran, Iran.

https://doi.org/10.22042/isecure.2026.240527

Abstract

Backdoor attacks pose a significant threat to deep learning systems by injecting hidden malicious behavior to the model while preserving high accuracy on clean data. Such attacks are particularly dangerous in scenarios where users rely on pre-trained models or outsource training to untrusted parties. In this work, we propose a practical defense strategy that assumes no knowledge of the backdoor trigger or the training process, relying on a small trusted clean dataset. Our method introduces a two-stage pipeline: First, we aggregate predictions from multiple potentially compromised models to train an intermediate Teacher-Aggregation (TA) model; then, we distill this knowledge into a compact light-weight student model. This multi-stage approach effectively alleviates backdoor effects while preserving clean accuracy. Experimental results on MNIST and CIFAR-10 demonstrate that our method significantly reduces the Attack Success Rate (ASR)—to approximately 0.1% on MNIST and 2.6% on CIFAR-10—outperforming baseline ensemble defenses. Furthermore, our lightweight student model is suitable for edge deployment, providing a generic and scalable defense that remains robust under minimal assumptions, making it well-suited for real-world applications in adversarial environments. Our code is available at: https://github.com/mr-pylin/backdoor-toolbox

Keywords

Backdoor Mitigation

Black-box Model Defense

Knowledge Distillation

Model Aggregation Strategy

Trigger Suppression

Untrusted Model Scenario

[1] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.

[2] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.

[3] Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bailey, Jingjing Chen, and Yu-Gang Jiang. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14443–14452, 2020.

[4] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021.

[5] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In European Conference on Computer Vision, pages 182–199. Springer, 2020.

[6] Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hanghang Tong, and Ting Wang. An invisible black-box backdoor attack through frequency domain. In European Conference on Computer Vision, pages 396–413. Springer, 2022.

[7] Qiuyu Duan, Zhongyun Hua, Qing Liao, Yushu Zhang, and Leo Yu Zhang. Conditional backdoor attack via jpeg compression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19823–19831, 2024.

[8] Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, and Shu-Tao Xia. Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24431–24441, 2024.

[9] Shuo Wang, Surya Nepal, Carsten Rudolph, Marthie Grobler, Shangyu Chen, and Tianle Chen. Backdoor attacks against transfer learning with pre-trained deep learning models. IEEE Transactions on Services Computing, 15(3):1526– 1539, 2020.

[10] Sanghyun Hong, Michael-Andrei PanaitescuLiess, Yigitcan Kaya, and Tudor Dumitras. Quanti-zation: Exploiting quantization artifacts for achieving adversarial outcomes. Advances in Neural Information Processing Systems, 34:9303– 9316, 2021.

[11] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE symposium on security and privacy (SP), pages 707–723. IEEE, 2019.

[12] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. Abs: Scanning neural networks for back-doors by artificial brain stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1265–1282, 2019.

[13] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International symposium on research in attacks, intrusions, and defenses, pages 273–294. Springer, 2018.

[14] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th annual computer security applications conference, pages 113–125, 2019.

[15] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900–14912, 2021.

[16] Kota Yoshida and Takeshi Fujino. Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In Proceedings of the 13th ACM workshop on artificial intelligence and security, pages 117–127, 2020.

[17] Zonghao Ying and Bin Wu. Nba: defensive distillation for backdoor removal via neural behavior alignment. Cybersecurity, 6(1):20, 2023.

[18] Yinshan Li, Hua Ma, Zhi Zhang, Yansong Gao, Alsharif Abuadbba, Minhui Xue, Anmin Fu, Yifeng Zheng, Said F Al-Sarawi, and Derek Abbott. Ntd: Non-transferability enabled deep learning backdoor detection. IEEE Transactions on Information Forensics and Security, 19:104– 119, 2023.

[19] Boheng Li, Yishuo Cai, Haowei Li, Feng Xue, Zhifeng Li, and Yiming Li. Nearest is not dearest: Towards practical defense against quantizationconditioned backdoor attacks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24523–24533, 2024.

[20] Huaibing Peng, Huming Qiu, Hua Ma, Shuo Wang, Anmin Fu, Said F Al-Sarawi, Derek Abbott, and Yansong Gao. On model outsourcing adaptive attacks to deep learning backdoor defenses. IEEE Transactions on Information Forensics and Security, 19:2356–2369, 2024.

[21] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, and Chao Shen. Backdoorbench: A comprehensive benchmark of backdoor learning. Advances in Neural Information Processing Systems, 35:10546–10559, 2022.

[22] Zhihuan Xing, Yuqing Lan, Yin Yu, Yong Cao, Xiaoyi Yang, Yichun Yu, and Dan Yu. Bdel: A backdoor attack defense method based on ensemble learning. In Pacific Rim International Conference on Artificial Intelligence, pages 221– 235. Springer, 2024.

[23] Zijie Zhang, Xinyuan Miao, Chenyu Zhou, Chenming Shang, Xi Chen, Xianglong Kong, Wei Huang, and Yi Cao. Bdekd: mitigating backdoor attacks in nlp models via ensemble knowledge distillation. Complex & Intelligent Systems, 11(9):1–17, 2025.

[24] Yifan Wang, Wei Fan, Keke Yang, Naji Alhusaini, and Jing Li. A knowledge distillation-based backdoor attack in federated learning. arXiv preprint arXiv:2208.06176, 2022.

[25] Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Bing Chen, and Weizhi Meng. Adfl: Defending backdoor attacks in federated learning via adversarial distillation. Computers & Security, 132:103366, 2023.

[26] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks. arXiv preprint arXiv:2101.05930, 2021.

[27] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradientbased localization. International journal of computer vision, 128:336–359, 2020.