Handling imbalanced data in intrusion detection systems using generative adversarial networks
Abstract: Machine learning-Based intrusion detection has become more popular in the research community thanks to its capability in discovering unknown attacks. To develop a good detection model for an intrusion detection system (IDS) using machine learning, a great number of attack and normal data samples are required in the learning process. While normal data can be relatively easy to collect, attack data is much rarer and harder to gather. Subsequently, IDS datasets are often dominated by normal data and machine learning models trained on those imbalanced datasets are ineffective in detecting attacks. In this paper, we propose a novel solution to this problem by using generative adversarial networks to generate synthesized attack data for IDS. The synthesized attacks are merged with the original data to form the augmented dataset. Three popular machine learning techniques are trained on the augmented dataset. The experiments conducted on the three common IDS datasets and one our own dataset show that machine learning algorithms achieve better performance when trained on the augmented dataset of the generative adversarial networks compared to those trained on the original dataset and other sampling techniques. The visualization technique was also used to analyze the properties of the synthesized data of the generative adversarial networks and the others
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Handling imbalanced data in intrusion detection systems using generative adversarial networks
iques, we can see that both ACGAN and TomekSMOTE generate many samples that are in the middle area of the minor classes. Conversely, SMOTE-SVM and ACGAN-SVM only generate the samples that are near the borderline between classes. In the context of sampling techniques, the samples that are near the borderline between classes often contribute more significantly to the effectiveness of the classifiers than the samples located in the center area. Moreover, Fig. 3 also shows that TomekSMOTE and ACGAN sometimes create the samples that are overlapped with the samples of other classes. Subsequently, these samples will be difficult for the classifiers to separate correctly. Conversely, the generated samples of SMOTE- SVM and ACGAN-SVM are not overlapped with other classes. This evidences that using SVM helps to remove the noisy samples generated by ACGAN and SMOTE. This provides partial explanation for the better performance of ACGAN-SVM compared to ACGAN and others. Moreover, the superior performance of ACGAN-SVM and ACGAN to other sampling techniques could be that the generated samples of ACGAN and ACGAN-SVM are better follow the original distribution than the traditional techniques as analyzed in Subsection VI.3. Overall, the results in this section show that Generative Adversarial Networks can generate the meaningful samples for imbalanced IDS datasets. The classification algorithms that are trained on datasets augmented by ACGAN and par- ticularly ACGAN-SVM are often better than those trained on the original dataset and the datasets obtained by using some popular sampling techniques. VII. SUMMARY In this paper, we proposed a novel approach based on generative adversarial networks for addressing imbalanced datasets in IDS. Specifically, we proposed two techniques based on ACGAN and ACGAN-SVM to generate samples for the attack classes in IDS. The augmented datasets of ACGAN and ACGAN-SVM are then used as the training dataset for three popular classification algorithms, SVM, DT, and RF. The experiments were conducted on three common IDS datasets: NSL-KDD, UNSW-NB15, and CI- CIDS2017 and one our own dataset, i.e., RAWDATA. The results show that the augmented datasets of ACGAN and ACGAN-SVM help machine learning to enhance ac- curacy on the imbalanced datasets although the training processes of ACGAN and ACGAN-SVM are often slower than the traditional sampling approaches. We analysed the quality of the generated data of ACGAN and SMOTE- SVM and visualized the borderline synthesized samples of five tested sampling techniques. The visualization partially explains the better performance of ACGAN and particularly ACGAN-SVM compared to others. There are a number of research areas for future work that arise from this paper. First, we would like to examine the effectiveness of other deep learning generative models such as auto-encoder in generating the synthesized attacks for IDS. Second, the visualization technique has shed some 19 Research and Development on Information and Communication Technology light on the superior performance of ACGAN-SVM to other sampling techniques. However, this technique did not completely explain why ACGAN is also often better than SMOTE-SVM. We hypothesize that the generated data of ACGAN is better follow the original distribution than SMOTE-SVM. In the future, we will study the method to quantify the distribution of the synthesized samples of these sampling techniques [41]. Last but not least, we want to extend this approach to other problems in security and in other areas such as anomaly detection. ACKNOWLEDGEMENT This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2019.05. REFERENCES [1] M. Albayati and B. Issac, “Analysis of intelligent classifiers and enhancing the detection accuracy for intrusion detection system,” Int. J. Comput. Intell. Syst., vol. 8, pp. 841–853, 2015. [2] G. T. Nguyen, B. M. Nguyen, D. Tran, and L. Hluchý, “A heuristics approach to mine behavioural data logs in mobile malware detection system,” Data Knowl. Eng., vol. 115, pp. 129–151, 2018. [3] X. Jing, Z. Yan, and W. Pedrycz, “Security data collection and data analytics in the internet: A survey,” IEEE Com- munications Surveys & Tutorials, vol. Accepted Manuscript, 2018. [4] J. Hussain, S. Lalmuanawma, and L. Chhakchhuak, “A two- stage hybrid classification technique for network intrusion detection system,” Int. J. Comput. Intell. Syst., vol. 9, pp. 863–875, 2016. [5] D. A. Effendy, K. Kusrini, and S. Sudarmawan, “Classifica- tion of intrusion detection system (ids) based on computer network,” 2017 2nd International conferences on Informa- tion Technology, Information Systems and Electrical Engi- neering (ICITISEE), pp. 90–94, 2017. [6] B. Xu, S. Chen, H. Zhang, and T. Wu, “Incremental k- nn svm method in intrusion detection,” in International Conference in Software Engineering and Service Science (ICSESS), 2017, pp. 712–717. [7] A. Hadri, K. Chougdali, and R. Touahni, “Intrusion detection system using pca and fuzzy pca techniques,” 2016 Interna- tional Conference on Advanced Communication Systems and Information Security (ACOSIS), pp. 1–7, 2016. [8] A. R. Syarif and W. Gata, “Intrusion detection system using hybrid binary pso and k-nearest neighborhood algorithm,” in IEEE conference in Information and Communication Technology and System (ICTS), 2017, pp. 181–186. [9] H. Hindy, D. Brosset, E. Bayne, A. Seeam, C. Tachtatzis, R. C. Atkinson, and X. J. A. Bellekens, “A taxonomy and survey of intrusion detection system design techniques, network threats and datasets,” CoRR, vol. abs/1806.03517, 2018. [10] W. L. Al-Yaseen, Z. A. Othman, and M. Z. A. Nazri, “Multi- level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system,” Expert Syst. Appl., vol. 67, pp. 296–303, 2017. [11] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature-selection approach based on the cuttlefish optimiza- tion algorithm for intrusion detection systems,” Expert Syst. Appl., vol. 42, pp. 2670–2679, 2015. [12] B. W. Masduki, K. Ramli, F. A. Saputra, and D. Sugiarto, “Study on implementation of machine learning methods combination for improving attacks detection accuracy on intrusion detection system (ids),” 2015 International Con- ference on Quality in Research (QiR), pp. 56–64, 2015. [13] R. K. Malaiya, D. Kwon, J. Kim, S. C. Suh, H. Kim, and I. Kim, “An empirical evaluation of deep learning for network anomaly detection,” in 2018 International Con- ference on Computing, Networking and Communications, ICNC, 2018, pp. 893–898. [14] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim, “A survey of deep learning-based network anomaly detection,” Cluster Computing, pp. 1–13, 2017. [15] S. Rodda, “Network intrusion detection systems using neural networks,” in Information Systems Design and Intelligent Applications. Springer, 2018, pp. 903–908. [16] S. Rezaei and X. Liu, “Deep learning for encrypted traffic classification: An overview,” IEEE communications maga- zine, vol. 57, no. 5, pp. 76–81, 2019. [17] P. Wang, X. Chen, F. Ye, and Z. Sun, “A survey of techniques for mobile service encrypted traffic classification using deep learning,” IEEE Access, vol. 7, pp. 54 024–54 033, 2019. [18] S. Rodda and U. S. R. Erothi, “Class imbalance problem in the network intrusion detection systems,” in 2016 Interna- tional Conference on Electrical, Electronics, and Optimiza- tion Techniques (ICEEOT). IEEE, 2016, pp. 2685–2688. [19] M. Tavallaee, E. Bagheri, W. Lu, , and A. A. Ghorbani, “Nsl- kdd data set for network-based intrusion detection systems,” 2009, accessed: 2018-04-10. [20] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW- NB15 network data set),” in 2015 Military Communications and Information Systems Conference, MilCIS 2015, Can- berra, Australia, November 10-12, 2015, 2015, pp. 1–6. [21] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” in International Conference on In- formation Systems Security and Privacy (ICISSP), 2018, pp. 108–116. [22] L. Vu, B. C. Thanh, and Q. U. Nguyen, “A deep learning based method for handling imbalanced problem in network traffic classification,” in Proceeding of Symposium on Infor- mation and Communication Technology, 2017, pp. 333–339. [23] A. A. Aburomman and M. B. I. Reaz, “A novel svm-knn-pso ensemble method for intrusion detection system,” Appl. Soft Comput., vol. 38, pp. 360–372, 2016. [24] A. AminAburomman and M. B. IbneReaz, “A survey of intrusion detection systems based on ensemble and hybrid classifiers,” Computers & Security, vol. 65, pp. 135–152, 2017. [25] E. Hodo, X. J. A. Bellekens, A. Hamilton, C. Tachtatzis, and R. C. Atkinson, “Shallow and deep networks intrusion detection system: A taxonomy and survey,” CoRR, vol. abs/1701.02145, 2017. [26] Akashdeep, I. Manzoor, and N. Kumar, “A feature reduced intrusion detection system using ANN classifier,” Expert Syst. Appl., vol. 88, pp. 249–257, 2017. [27] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory under- sampling for class-imbalance learning,” Sixth International Conference on Data Mining (ICDM’06), pp. 965–969, 2006. [28] C. G. Cordero, E. Vasilomanolakis, A. Wainakh, M. Mu¨hlha¨user, and S. Nadjm-Tehrani, “On generating network traffic datasets with synthetic attacks for intrusion detection,” arXiv preprint arXiv:1905.00304, 2019. [29] L. Arnaboldi and C. Morisset, “Generating synthetic data for real world detection of dos attacks in the iot,” in Federation of International Conferences on Software Technologies: Ap- 20 Vol. 2020, No. 1, September plications and Foundations. Springer, 2018, pp. 130–145. [30] M. A. Salama, H. F. Eid, R. A. Ramadan, A. Darwish, and A. E. Hassanien, “Hybrid intelligent intrusion detection scheme,” Soft Computer Industry Application, pp. 193–303, 2011. [31] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural network classifier for intrusion de- tection,” in International Conference on Platform Technology and Service (PlatCon), 2016, pp. 1–5. [32] V. Belenko, V. Chernenko, M. Kalinin, and V. Krundyshev, “Evaluation of gan applicability for intrusion detection in self-organizing networks of cyber physical systems,” in 2018 International Russian Automation Conference (RusAuto- Con). IEEE, 2018, pp. 1–7. [33] A. D. Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi, “Racing for unbalanced methods selection,” in Intelligent Data Engineering and Automated Learning - IDEAL 2013, 2013, pp. 24–31. [34] C. Drummond and R. Holte, “C4.5 class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” Work- shop on Learning from Imbalanced Data Sets II, vol. 11, pp. 1–8, 2003. [35] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling tech- nique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. [36] H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline over-sampling for imbalanced data classification,” Interna- tional Journal of Knowledge Engineering and Soft Data Paradigms (IJKESDP), vol. 3, no. 1, pp. 4–21, 2011. [37] A. Namvar, M. Siami, F. Rabhi, and M. Naderpour, “Credit risk prediction in an imbalanced social lending environ- ment,” Int. J. Comput. Intell. Syst., vol. 11, no. 1, pp. 925– 935, 2018. [38] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, “A novel ensemble method for imbalanced data learning: Bagging of extrapolation-smote SVM,” Comp. Int. and Neurosc., vol. 2017, pp. 1 827 016:1–1 827 016:11, 2017. [39] J. Charlier, A. Singh, G. Ormazabal, R. State, and H. Schulzrinne, “Syngan: Towards generating synthetic net- work attacks using gans,” arXiv preprint arXiv:1908.09899, 2019. [40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [41] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial networks,” in Conference on Neural Information Processing Systems (NIPS), 2014, pp. 2672– 2680. [42] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 2642–2651. [43] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Sys- tems 29: Annual Conference on Neural Information Process- ing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 2226–2234. [44] Winpcap, “The industry standard windows packet library,” https://www.winpcap.org/default.htm, online; accessed 20 December 2019. [45] SQL map, “Automatic SQL injection and database takeover tool,” online; accessed 20 December 2019. [46] Scapy library, “Packet crafting for Python2 and Python3,” https://scapy.net/, online; accessed 20 December 2019. [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [49] G. Research, “Tensorflow tutorial,” https://www.tensorflow.org/, 2015, accessed: 2018-04- 24. [50] D. M. W. Powers, “Evaluation: from precision, recall and f- measure to roc, informedness, markedness and correlation,” Journal of Machine learning Technologies, vol. 2, pp. 37–63, 2011. Ly Vu received her B.Sc. and M.Sc. de- grees in computer science from Le Quy Don Technical University (LQDTU), Viet- nam and Inha University, Korea, respec- tively. She is currently pursuing the Ph.D. degree with LQDTU. She was a Lecturer with Le Quy Don Technical University. Her research interests include data mining, machine learning, deep learning, network security. Quang Uy Nguyen received B.Sc. and M.Sc. degree in computer science from Le Quy Don Technical University (LQDTU)), Vietnam and the PhD degree at University College Dublin, Ireland. Currently, he is a senior lecturer at LQDTU and the direc- tor of Machine Learning and Applications research group at LQDTU. His research interest includes Machine Learning, Computer Vision, Information Security, Evolutionary Algorithms and Genetic Programming. 21
File đính kèm:
- handling_imbalanced_data_in_intrusion_detection_systems_usin.pdf