DSRL-APT-2023: A New Synthetic Dataset for Advanced Persistent Threats

Document Type : Research Article

Authors

1 Department of Management, Science and Technology of Amirkabir University , Technology Tehran,Iran

2 Department of Industrial and Systems Engineering Tarbiat Modares University Tehran, Iran

3 Department of Computer Engineering Amirkabir University of Technology Tehran, Iran

Abstract
Detecting Advanced Persistent Threats (APTs) is crucial, and a practical approach involves using an intrusion detection system (IDS) integrated with supervised machine learning algorithms. These algorithms require a balanced dataset with ample attack samples to learn and recognize attack patterns effectively. However, widely used APT datasets, such as DAPT2020 and SCVIC-APT-2021, suffer from imbalance issues that limit the performance of machine learning-based intrusion detection systems (IDS). We introduce DSRL-APT-2023, a new balanced synthetic APT dataset generated using CTGAN to address this challenge. The CTGAN model is trained on the DAPT2020 dataset to create this balanced dataset. We evaluate and compare the performance of six standard supervised machine learning algorithms—Decision Tree, Support Vector Machine, K-Nearest Neighbor, Logistic Regression, Random Forest, and Multi-Layer Perceptron— alongside an intrusion detection system (IDS) called Intelligent Intrusion Detection System, which is based on tree-structured machine learning models. Our evaluation focuses on detecting attacks in DSRL-APT-2023 and compares its performance to DAPT2020 and SCVIC-APT-2021. Additionally, we assess the data quality of synthetic datasets generated by two prominent GANs, CopulaGAN, and CTGAN, with CTGAN demonstrating slightly superior performance in generating high-quality tabular data. Our results demonstrate that machine learning algorithms and the Intelligent IDS can accurately detect attacks in the synthetic dataset, as evidenced by the F1-Score metrics.

Keywords


[1] Adel Alshamrani, Sowmya Myneni, Ankur Chowdhary, and Dijiang Huang. A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities. IEEE Communications Surveys & Tutorials, 21(2):1851–1877, 2019.
[2] James P Anderson. Computer security threat monitoring and surveillance. Technical Report, James P. Anderson Company, 1980.
[3] Ankit Thakkar and Ritika Lohiya. A review of the advancement in intrusion detection datasets. Procedia Computer Science, 167:636–645, 2020.
[4] Sowmya Myneni, Ankur Chowdhary, Abdulhakim Sabur, Sailik Sengupta, Garima Agrawal, Dijiang Huang, and Myong Kang. Dapt 2020-constructing a benchmark dataset for advanced persistent threats. In Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego, CA, USA, August 24, 2020, Proceedings 1, pages 138–163. Springer, 2020.
[5] Jinxin Liu, Yu Shen, Murat Simsek, Burak Kantarci, Hussein T Mouftah, Mehran Bagheri, and Petar Djukic. A new realistic benchmark for advanced persistent threats in network traffic. IEEE Networking Letters, 4(3):162–166, 2022.
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[7] Zhaoqing Pan, Weijie Yu, Xiaokai Yi, Asifullah Khan, Feng Yuan, and Yuhui Zheng. Recent progress on generative adversarial networks(gans): A survey. IEEE access, 7:36322–36333, 2019.
[8] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
[9] CopulaGAN model. https://sdv.dev/SDV/user_guides/single_table/copulagan.html. Accessed: 25 January 2025.
[10] SDV Dev Team. Sdv documentation. https://docs.sdv.dev/sdv/. Accessed: 25 January 2025.
[11] Li Yang, Abdallah Moubayed, Ismail Hamieh, and Abdallah Shami. Tree-based intelligent intrusion detection system in internet of vehicles. In 2019 IEEE global communications conference(GLOBECOM), pages 1–6. IEEE, 2019.
[12] Sdmetrics documentation. https://docs.sdv.dev/sdmetrics/. Accessed: 25 January 2025.
[13] Unb-cs-ids dataset (2018). https://www.unb.ca/cic/datasets/ids-2018.html. Accessed: 25 January 2025.
[14] Unb-cs-ids dataset (2018). https://www.unb.ca/cic/datasets/ids-2017.html. Accessed: 25 January 2025.
[15] Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
[16] Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. computers & security, 31(3): 357–374, 2012.
[17] Unb-cs-ids dataset (2018). https://www.unb.ca/cic/datasets/ids.html. Accessed: 25 January 2025.
[18] Richard P Lippmann, David J Fried, Isaac Graf, Joshua W Haines, Kristopher R Kendall, David McClung, Dan Weber, Seth E Webster, Dan Wyschogrod, Robert K Cunningham, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00, vol-
ume 2, pages 12–26. IEEE, 2000.
[19] Benjamin Sangster, TJ O’connor, Thomas Cook, Robert Fanelli, Erik Dean, Christopher Morrell, and Gregory J Conti. Toward instrumenting network warfare competitions to generate labeled datasets. In CSET, 2009.
[20] Yusuke Takahashi, Shigeyoshi Shima, Rui Tanabe, and Katsunari Yoshioka. {APTGen}: An approach towards generating practical dataset labelled with targeted attack sequences. 2020.
[21] Stavroula Bourou, Andreas El Saer, Terpsichori Helen Velivassaki, Artemis Voulkidis, and Theodore Zahariadis. A review of tabular data synthesis using gans on an ids dataset. Information, 12(09):375, 2021.
[22] Jiayu Wang, Xuehu Yan, Lintao Liu, Longlong Li, and Yongqiang Yu. Cttgan: Traffic data synthesizing scheme based on conditional gan. Sensors, 22(14):5243, 2022.
[23] Ayesha Siddiqua Dina, AB Siddique, and D Manivannan. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access, 10:96731–96747, 2022.
[24] Drake Cullen, James Halladay, Nathan Briner, Ram Basnet, Jeremy Bergen, and Tenzin Doleck. Evaluation of synthetic data generation techniques in the domain of anonymous traffic classification. IEEE Access, 10:129612–129625, 2022.
[25] Cumulative distribution function. URL https://en.wikipedia.org/wiki/Cumulative_distribution_function. Accessed: September 14, 2024.
[26] Kolmogorov–smirnov test. URL https://en.wikipedia.org/wiki/Kolmogorov%E2% 80%93Smirnov_test. Accessed: 25 January 2025.
[27] Pearson correlation coefficient. URL https://en.wikipedia.org/wiki/Pearson_correlation_coefficient. Accessed: 25 January 2025.
[28] scipy.stats.pearsonr documentation. URL https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html. Accessed: 25 January 2025.
[29] Spearman’s rank correlation coefficient. URL https://en.wikipedia.org/wiki/Spearman’s_rank_correlation_coefficient. Accessed: 25 January 2025.
[30] scipy.stats.spearmanr documentation. URL https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats. spearmanr.html. Accessed: 25 January 2025.
[31] Contingency table. URL https://en.wikipedia.org/wiki/Contingency_table. Accessed: 25 January 2025.
[32] Total variation distance of probability measures. URL https://en.wikipedia. org/wiki/Total_variation_distance_of_ probability_measures. Accessed: 25 January 2025.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.