Handling imbalanced data in intrusion detection systems using generative adversarial networks

Abstract: Machine learning-Based intrusion detection has become more popular in the research community thanks to its capability in discovering unknown attacks. To develop a good detection model for an intrusion detection system (IDS) using machine learning, a great number of attack and normal data samples are required in the learning process. While normal data can be relatively easy to collect, attack data is much rarer and harder to gather. Subsequently, IDS datasets are often dominated by normal data and machine learning models trained on those imbalanced datasets are ineffective in detecting attacks. In this paper, we propose a novel solution to this problem by using generative adversarial networks to generate synthesized attack data for IDS. The synthesized attacks are merged with the original data to form the augmented dataset. Three popular machine learning techniques are trained on the augmented dataset. The experiments conducted on the three common IDS datasets and one our own dataset show that machine learning algorithms achieve better performance when trained on the augmented dataset of the generative adversarial networks compared to those trained on the original dataset and other sampling techniques. The visualization technique was also used to analyze the properties of the synthesized data of the generative adversarial networks and the others

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 1

Trang 1

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 2

Trang 2

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 3

Trang 3

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 4

Trang 4

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 5

Trang 5

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 6

Trang 6

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 7

Trang 7

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 8

Trang 8

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 9

Trang 9

Handling imbalanced data in intrusion detection systems using generative adversarial networks trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 13 trang xuanhieu 10220
Bạn đang xem 10 trang mẫu của tài liệu "Handling imbalanced data in intrusion detection systems using generative adversarial networks", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Handling imbalanced data in intrusion detection systems using generative adversarial networks

Handling imbalanced data in intrusion detection systems using generative adversarial networks
iques, we
can see that both ACGAN and TomekSMOTE generate
many samples that are in the middle area of the minor
classes. Conversely, SMOTE-SVM and ACGAN-SVM only
generate the samples that are near the borderline between
classes. In the context of sampling techniques, the samples
that are near the borderline between classes often contribute
more significantly to the effectiveness of the classifiers than
the samples located in the center area.
Moreover, Fig. 3 also shows that TomekSMOTE and
ACGAN sometimes create the samples that are overlapped
with the samples of other classes. Subsequently, these
samples will be difficult for the classifiers to separate
correctly. Conversely, the generated samples of SMOTE-
SVM and ACGAN-SVM are not overlapped with other
classes. This evidences that using SVM helps to remove
the noisy samples generated by ACGAN and SMOTE. This
provides partial explanation for the better performance of
ACGAN-SVM compared to ACGAN and others. Moreover,
the superior performance of ACGAN-SVM and ACGAN
to other sampling techniques could be that the generated
samples of ACGAN and ACGAN-SVM are better follow
the original distribution than the traditional techniques as
analyzed in Subsection VI.3.
Overall, the results in this section show that Generative
Adversarial Networks can generate the meaningful samples
for imbalanced IDS datasets. The classification algorithms
that are trained on datasets augmented by ACGAN and par-
ticularly ACGAN-SVM are often better than those trained
on the original dataset and the datasets obtained by using
some popular sampling techniques.
VII. SUMMARY
In this paper, we proposed a novel approach based on
generative adversarial networks for addressing imbalanced
datasets in IDS. Specifically, we proposed two techniques
based on ACGAN and ACGAN-SVM to generate samples
for the attack classes in IDS. The augmented datasets of
ACGAN and ACGAN-SVM are then used as the training
dataset for three popular classification algorithms, SVM,
DT, and RF. The experiments were conducted on three
common IDS datasets: NSL-KDD, UNSW-NB15, and CI-
CIDS2017 and one our own dataset, i.e., RAWDATA.
The results show that the augmented datasets of ACGAN
and ACGAN-SVM help machine learning to enhance ac-
curacy on the imbalanced datasets although the training
processes of ACGAN and ACGAN-SVM are often slower
than the traditional sampling approaches. We analysed the
quality of the generated data of ACGAN and SMOTE-
SVM and visualized the borderline synthesized samples of
five tested sampling techniques. The visualization partially
explains the better performance of ACGAN and particularly
ACGAN-SVM compared to others.
There are a number of research areas for future work
that arise from this paper. First, we would like to examine
the effectiveness of other deep learning generative models
such as auto-encoder in generating the synthesized attacks
for IDS. Second, the visualization technique has shed some
19
Research and Development on Information and Communication Technology
light on the superior performance of ACGAN-SVM to
other sampling techniques. However, this technique did
not completely explain why ACGAN is also often better
than SMOTE-SVM. We hypothesize that the generated data
of ACGAN is better follow the original distribution than
SMOTE-SVM. In the future, we will study the method to
quantify the distribution of the synthesized samples of these
sampling techniques [41]. Last but not least, we want to
extend this approach to other problems in security and in
other areas such as anomaly detection.
ACKNOWLEDGEMENT
This research is funded by Vietnam National Foundation
for Science and Technology Development (NAFOSTED)
under grant number 102.05-2019.05.
REFERENCES
[1] M. Albayati and B. Issac, “Analysis of intelligent classifiers
and enhancing the detection accuracy for intrusion detection
system,” Int. J. Comput. Intell. Syst., vol. 8, pp. 841–853,
2015.
[2] G. T. Nguyen, B. M. Nguyen, D. Tran, and L. Hluchý, “A
heuristics approach to mine behavioural data logs in mobile
malware detection system,” Data Knowl. Eng., vol. 115, pp.
129–151, 2018.
[3] X. Jing, Z. Yan, and W. Pedrycz, “Security data collection
and data analytics in the internet: A survey,” IEEE Com-
munications Surveys & Tutorials, vol. Accepted Manuscript,
2018.
[4] J. Hussain, S. Lalmuanawma, and L. Chhakchhuak, “A two-
stage hybrid classification technique for network intrusion
detection system,” Int. J. Comput. Intell. Syst., vol. 9, pp.
863–875, 2016.
[5] D. A. Effendy, K. Kusrini, and S. Sudarmawan, “Classifica-
tion of intrusion detection system (ids) based on computer
network,” 2017 2nd International conferences on Informa-
tion Technology, Information Systems and Electrical Engi-
neering (ICITISEE), pp. 90–94, 2017.
[6] B. Xu, S. Chen, H. Zhang, and T. Wu, “Incremental k-
nn svm method in intrusion detection,” in International
Conference in Software Engineering and Service Science
(ICSESS), 2017, pp. 712–717.
[7] A. Hadri, K. Chougdali, and R. Touahni, “Intrusion detection
system using pca and fuzzy pca techniques,” 2016 Interna-
tional Conference on Advanced Communication Systems and
Information Security (ACOSIS), pp. 1–7, 2016.
[8] A. R. Syarif and W. Gata, “Intrusion detection system using
hybrid binary pso and k-nearest neighborhood algorithm,”
in IEEE conference in Information and Communication
Technology and System (ICTS), 2017, pp. 181–186.
[9] H. Hindy, D. Brosset, E. Bayne, A. Seeam, C. Tachtatzis,
R. C. Atkinson, and X. J. A. Bellekens, “A taxonomy
and survey of intrusion detection system design techniques,
network threats and datasets,” CoRR, vol. abs/1806.03517,
2018.
[10] W. L. Al-Yaseen, Z. A. Othman, and M. Z. A. Nazri, “Multi-
level hybrid support vector machine and extreme learning
machine based on modified k-means for intrusion detection
system,” Expert Syst. Appl., vol. 67, pp. 296–303, 2017.
[11] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel
feature-selection approach based on the cuttlefish optimiza-
tion algorithm for intrusion detection systems,” Expert Syst.
Appl., vol. 42, pp. 2670–2679, 2015.
[12] B. W. Masduki, K. Ramli, F. A. Saputra, and D. Sugiarto,
“Study on implementation of machine learning methods
combination for improving attacks detection accuracy on
intrusion detection system (ids),” 2015 International Con-
ference on Quality in Research (QiR), pp. 56–64, 2015.
[13] R. K. Malaiya, D. Kwon, J. Kim, S. C. Suh, H. Kim,
and I. Kim, “An empirical evaluation of deep learning for
network anomaly detection,” in 2018 International Con-
ference on Computing, Networking and Communications,
ICNC, 2018, pp. 893–898.
[14] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J.
Kim, “A survey of deep learning-based network anomaly
detection,” Cluster Computing, pp. 1–13, 2017.
[15] S. Rodda, “Network intrusion detection systems using neural
networks,” in Information Systems Design and Intelligent
Applications. Springer, 2018, pp. 903–908.
[16] S. Rezaei and X. Liu, “Deep learning for encrypted traffic
classification: An overview,” IEEE communications maga-
zine, vol. 57, no. 5, pp. 76–81, 2019.
[17] P. Wang, X. Chen, F. Ye, and Z. Sun, “A survey of techniques
for mobile service encrypted traffic classification using deep
learning,” IEEE Access, vol. 7, pp. 54 024–54 033, 2019.
[18] S. Rodda and U. S. R. Erothi, “Class imbalance problem in
the network intrusion detection systems,” in 2016 Interna-
tional Conference on Electrical, Electronics, and Optimiza-
tion Techniques (ICEEOT). IEEE, 2016, pp. 2685–2688.
[19] M. Tavallaee, E. Bagheri, W. Lu, , and A. A. Ghorbani, “Nsl-
kdd data set for network-based intrusion detection systems,”
 2009, accessed: 2018-04-10.
[20] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive
data set for network intrusion detection systems (UNSW-
NB15 network data set),” in 2015 Military Communications
and Information Systems Conference, MilCIS 2015, Can-
berra, Australia, November 10-12, 2015, 2015, pp. 1–6.
[21] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward
generating a new intrusion detection dataset and intrusion
traffic characterization,” in International Conference on In-
formation Systems Security and Privacy (ICISSP), 2018, pp.
108–116.
[22] L. Vu, B. C. Thanh, and Q. U. Nguyen, “A deep learning
based method for handling imbalanced problem in network
traffic classification,” in Proceeding of Symposium on Infor-
mation and Communication Technology, 2017, pp. 333–339.
[23] A. A. Aburomman and M. B. I. Reaz, “A novel svm-knn-pso
ensemble method for intrusion detection system,” Appl. Soft
Comput., vol. 38, pp. 360–372, 2016.
[24] A. AminAburomman and M. B. IbneReaz, “A survey of
intrusion detection systems based on ensemble and hybrid
classifiers,” Computers & Security, vol. 65, pp. 135–152,
2017.
[25] E. Hodo, X. J. A. Bellekens, A. Hamilton, C. Tachtatzis,
and R. C. Atkinson, “Shallow and deep networks intrusion
detection system: A taxonomy and survey,” CoRR, vol.
abs/1701.02145, 2017.
[26] Akashdeep, I. Manzoor, and N. Kumar, “A feature reduced
intrusion detection system using ANN classifier,” Expert
Syst. Appl., vol. 88, pp. 249–257, 2017.
[27] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory under-
sampling for class-imbalance learning,” Sixth International
Conference on Data Mining (ICDM’06), pp. 965–969, 2006.
[28] C. G. Cordero, E. Vasilomanolakis, A. Wainakh,
M. Mu¨hlha¨user, and S. Nadjm-Tehrani, “On generating
network traffic datasets with synthetic attacks for intrusion
detection,” arXiv preprint arXiv:1905.00304, 2019.
[29] L. Arnaboldi and C. Morisset, “Generating synthetic data for
real world detection of dos attacks in the iot,” in Federation
of International Conferences on Software Technologies: Ap-
20
Vol. 2020, No. 1, September
plications and Foundations. Springer, 2018, pp. 130–145.
[30] M. A. Salama, H. F. Eid, R. A. Ramadan, A. Darwish,
and A. E. Hassanien, “Hybrid intelligent intrusion detection
scheme,” Soft Computer Industry Application, pp. 193–303,
2011.
[31] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term
memory recurrent neural network classifier for intrusion de-
tection,” in International Conference on Platform Technology
and Service (PlatCon), 2016, pp. 1–5.
[32] V. Belenko, V. Chernenko, M. Kalinin, and V. Krundyshev,
“Evaluation of gan applicability for intrusion detection in
self-organizing networks of cyber physical systems,” in 2018
International Russian Automation Conference (RusAuto-
Con). IEEE, 2018, pp. 1–7.
[33] A. D. Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi,
“Racing for unbalanced methods selection,” in Intelligent
Data Engineering and Automated Learning - IDEAL 2013,
2013, pp. 24–31.
[34] C. Drummond and R. Holte, “C4.5 class imbalance, and cost
sensitivity: why under-sampling beats over-sampling,” Work-
shop on Learning from Imbalanced Data Sets II, vol. 11, pp.
1–8, 2003.
[35] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P.
Kegelmeyer, “Smote: Synthetic minority over-sampling tech-
nique,” Journal of Artificial Intelligence Research, vol. 16,
pp. 321–357, 2002.
[36] H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline
over-sampling for imbalanced data classification,” Interna-
tional Journal of Knowledge Engineering and Soft Data
Paradigms (IJKESDP), vol. 3, no. 1, pp. 4–21, 2011.
[37] A. Namvar, M. Siami, F. Rabhi, and M. Naderpour, “Credit
risk prediction in an imbalanced social lending environ-
ment,” Int. J. Comput. Intell. Syst., vol. 11, no. 1, pp. 925–
935, 2018.
[38] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, “A novel
ensemble method for imbalanced data learning: Bagging of
extrapolation-smote SVM,” Comp. Int. and Neurosc., vol.
2017, pp. 1 827 016:1–1 827 016:11, 2017.
[39] J. Charlier, A. Singh, G. Ormazabal, R. State, and
H. Schulzrinne, “Syngan: Towards generating synthetic net-
work attacks using gans,” arXiv preprint arXiv:1908.09899,
2019.
[40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,”
arXiv preprint arXiv:1701.07875, 2017.
[41] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio,
“Generative adversarial networks,” in Conference on Neural
Information Processing Systems (NIPS), 2014, pp. 2672–
2680.
[42] A. Odena, C. Olah, and J. Shlens, “Conditional image
synthesis with auxiliary classifier gans,” in Proceedings of
the 34th International Conference on Machine Learning,
ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,
2017, pp. 2642–2651.
[43] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung,
A. Radford, and X. Chen, “Improved techniques for training
gans,” in Advances in Neural Information Processing Sys-
tems 29: Annual Conference on Neural Information Process-
ing Systems 2016, December 5-10, 2016, Barcelona, Spain,
2016, pp. 2226–2234.
[44] Winpcap, “The industry standard windows packet library,”
https://www.winpcap.org/default.htm, online; accessed 20
December 2019.
[45] SQL map, “Automatic SQL injection and database takeover
tool,”  online; accessed 20 December
2019.
[46] Scapy library, “Packet crafting for Python2 and Python3,”
https://scapy.net/, online; accessed 20 December 2019.
[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” CoRR, vol. abs/1412.6980, 2014.
[48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn:
Machine learning in Python,” Journal of Machine Learning
Research, vol. 12, pp. 2825–2830, 2011.
[49] G. Research, “Tensorflow tutorial,”
https://www.tensorflow.org/, 2015, accessed: 2018-04-
24.
[50] D. M. W. Powers, “Evaluation: from precision, recall and f-
measure to roc, informedness, markedness and correlation,”
Journal of Machine learning Technologies, vol. 2, pp. 37–63,
2011.
Ly Vu received her B.Sc. and M.Sc. de-
grees in computer science from Le Quy
Don Technical University (LQDTU), Viet-
nam and Inha University, Korea, respec-
tively. She is currently pursuing the Ph.D.
degree with LQDTU. She was a Lecturer
with Le Quy Don Technical University.
Her research interests include data mining,
machine learning, deep learning, network security.
Quang Uy Nguyen received B.Sc. and
M.Sc. degree in computer science from Le
Quy Don Technical University (LQDTU)),
Vietnam and the PhD degree at University
College Dublin, Ireland. Currently, he is a
senior lecturer at LQDTU and the direc-
tor of Machine Learning and Applications
research group at LQDTU. His research
interest includes Machine Learning, Computer Vision, Information
Security, Evolutionary Algorithms and Genetic Programming.
21

File đính kèm:

  • pdfhandling_imbalanced_data_in_intrusion_detection_systems_usin.pdf