Detecting web attacks based on clustering algorithm and multi- branch CNN

Bài báo đề xuất và phát triển mô hình

phát hiện tấn công Web dựa trên kết hợp thuật

toán phân cụm và mạng nơ-ron tích chập (CNN)

đa nhánh. Tập đặc trưng ban đầu được phân cụm

thành các nhóm đặc trưng tương ứng. Mỗi nhóm

đặc trưng được khái quát hóa trong một nhánh

của mạng CNN đa nhánh để tạo thành một vector

đặc trưng thành phần. Các vector đặc trưng thành

phần được ghép lại thành một vector đặc trưng

tổng hợp và đưa vào lớp liên kết đầy đủ để phân

lớp. Sử dụng phương pháp kiểm thử chéo trên mô

hình đề xuất, độ chính xác đạt 98,8%, F1-score đạt

98,8% và tỉ lệ cải tiến độ chính xác là 1,479%.

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 1

Trang 1

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 2

Trang 2

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 3

Trang 3

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 4

Trang 4

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 5

Trang 5

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 6

Trang 6

Detecting web attacks based on clustering algorithm and multi- branch CNN trang 7

Trang 7

pdf 7 trang duykhanh 5360
Bạn đang xem tài liệu "Detecting web attacks based on clustering algorithm and multi- branch CNN", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Detecting web attacks based on clustering algorithm and multi- branch CNN

Detecting web attacks based on clustering algorithm and multi- branch CNN
tion for 
94.13% accuracy. Random Forest algorithm with 
 meaningful research results and good 
only NLP-based features gives the best 
 performance [12], [13]. 
performance with the 97.98% accuracy rate for 
detection of phishing URLs [1]. In [2], the In 2017, multi-branch CNN was proposed by 
authors use most of machine learning algorithms Amerini et al to detect double JPEG image 
to experiment with phishing detection using compression. It is then further developed in the 
hyperlink information and the results show that direction of proposing another feature set for 
Logistic Regression algorithm has the highest relatively high accuracy (average between 95% - 
accuracy (98.42%). In SQL Injection attack 99%) [14]. In 2019, the research groups 
detection, the authors used Naïve Bayes continued to propose branching CNN 
algorithm reached 93.3% [3]. In DoS, DDoS architecture for multiple sclerosis lesion 
attack detection, the authors [4] uses an SVM segmentation [15], or for myocardial infarction 
algorithm based on web log traces. screening from ECG images [16]. Therefore, it is 
 used effectively in medicine. There are very few 
 Deep learning is known as a subset of 
 research results that use this architecture for the 
machine learning, with outstanding performance 
 web attack detection problem [5]. 
in classification problems. Common deep 
learning models have also been used to detect Based on the above survey results, this paper 
several types of web attacks with great proposes new methods to Web attack detection 
efficiency. Feng et al. (2018) proposed a novel based on the combination of K-means clustering 
neural network based on a classification method algorithm and Multi-branch CNN. Our method 
for detection of phishing web pages using a will be developed, experimented and evaluated 
Monte Carlo algorithm and risk minimization in the following sections. 
principle. The CNN model [5] is used to detect 
website anomalies based on HTTP requests. The III. IDEA AND THE MATHEMATICAL MODEL 
Stacked Auto Encoder (SAE) model [6] is 
applied for anomaly detection in web application A. Basic idea 
firewall. Some other results such as: DoS attack The key idea of our paper is to use clustering 
detection based on Restricted Boltzmann algorithms to split an original feature set into the 
Machine [7], detection of code injection attacks 
 subsets corresponding to clusters; and put them 
on hybrid applications using Hybrid Deep 
Learning Network (HDLN) between CNN and to branches of a CNN to classify. Each cluster is 
LSTM achieves accuracy of over 97.5% [8], etc. put into a branch to generalize features to create 
 a component feature vector. The component 
 In addition, there are some studies using a 
 feature vectors are joined to generate a synthetic 
combination of machine learning/deep learning 
algorithms to classify attacks on websites with feature vector. This vector is put into a fully 
quite good results. An example would be connected layer of CNN to classify. Because the 
combining the neural network approach with features in a cluster have the closest metrics, it is 
reinforcement learning for phishing attack more efficient to build the component feature 
classification (Smadi, Aslam, and Zhang - 2018). vector for each cluster. 
32 No 2.CS (12) 2020 
 Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 Fig. 1. Overall research model. 
B. Building the mathematical model of the problem 퐾
Definition 1 – Component feature vector 푣 = ⋃ 푣푖 (4) 
 푖=1
 A component feature vector is the feature 
vector generated by a branch of a CNN, is ’: → and = {푣} (5) 
described by Equation (3). 
 The features in each cluster have similarities, 
Definition 2 – Synthetic feature vector so when using convolution and filtering part of a 
 A synthetic feature vector is the feature vector CNN branch, we obtain better generalization 
created by joining component feature vectors features. At the same time, each component 
described by Equation (4). feature vector is generated on a CNN branch so 
 it also carries the characteristics of each cluster. 
 As shown in Fig. 1, the original feature set 
 Each component feature vector is called v . The 
is clustered by K-means algorithm to K clusters i
 synthetic vector v is formed by combining 
shown in Equation (2). And, the overall 
 component features v . 
mathematical model of the problem is described i
by Equations (1) to (5). Based on the overall model of the problem, 
 the steps of building, analyzing, testing and 
 : → (1) evaluating methods will be presented in detail in 
 the following sections. 
 퐾
 = ⋃ 푖 IV. FEATURE SET CLUSTERING 
 (2) 
 푖=1 BASED ON K-MEANS ALGORITHM 
 K-means is one of the most popular clustering 
 algorithms. K-means clustering algorithm 
 푖 (3) 
 푣푖 = ( 푖) computes the centroids and iterates until it finds 
 Số 2.CS (12) 2020 33 
Journal of Science and Technology on Information security 
optimal centroid. It assumes that the number of V. EVALUATING THE METHOD 
clusters is already known. In this paper, we use 
 In order to evaluate the proposed method, we 
K-means algorithm to cluster the original feature 
 used a K-fold cross-validation method and 
set to K subsets of features. K-means algorithm 
 measures such as Accuracy, Precision, Recall 
is described as follows. 
 and F1-score. These measurements are 
K-means algorithm: calculated using Equation (9), (10) and (11). 
Input: 푃
 푃 푒 푖푠푖표푛 = (9) 
 A set of features. 푃 + 퐹푃
 Number of clusters 퐾. 푃
 푅푒 푙푙 = (10) 
Output: 퐾 subsets of features 푃 + 퐹 
 2 ∗ 푃 푒 푖푠푖표푛 ∗ 푅푒 푙푙
Algorithm: 퐹 − 푆 표 푒 = (11) 
 1 푃 푒 푖푠푖표푛 + 푅푒 푙푙
 1 Initialize cluster centroids randomly 
 where, 
 (6) TP is the true number of classified 
 patterns of attack state. 
 2 Put each point into the cluster which has 
 FP is the false number of classified 
 the nearest centroid patterns of attack state. 
 (7) TN is the true number of classified 
 patterns of normal state. 
 Stop if clusters do not change from the 
 FN is the false number of classified 
 previous step patterns of normal state. 
 3 Update centroids VI. EXPERIMENT 
 A. Experimental model 
 (8) To evaluate the proposed method, we 
 conducted experiments as shown in Fig. 2. In 
 Fig. 2. Experimental model. 
34 No 2.CS (12) 2020 
 Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
experiments, an original feature set is clustered binary feature set as shown in Fig. 5. and Fig. 6. 
into three clusters; the original feature set is Fig. 5 shows a part of the query string, used as a 
passed through a one-branch CNN and each raw feature, having Xpath and XSS labels. Fig. 6 
cluster is passed through a branch of a multi- shows some binary features converted by 
branch CNN. raw features. 
B. Experimental program and data 
 In this experiment, we installed the web attack 
detection program according to CNN in Python 
language, using the TensorFlow library. The two 
CNN network structures installed in the program 
consist of a one-branch CNN and a multi-branch 
CNN, described in Fig. 4 and Fig. 3. The multi-
branches have three branches corresponding to 
the three clusters, with 585, 835 and 223 
elements. To do our experiment, we use the 
dataset in [17]. 
 Fig. 5. A part of query string in the original feature set. 
 Fig. 6. A part of CNN feature set. 
 D. Experimental results and evaluation 
 The accuracy and relevant measurements 
 when experimenting on the three data sets with 
 CNN model by the cross-testing method are 
 Fig. 3. Experimental Structure of 
 summarized in Table 1. The average 
 CNN-multi-branches. 
 improvement rate is 1.479%. Comparing the 
 improvement level of the proposed method when 
 experimenting on 3 clusters, it is summarized in 
 chart form as Fig. 7. 
 As shown in Table 2, compared with some 
 machine learning models in the study [18], 
 including SVM, PCA, etc., the proposed model 
 has higher accuracy. At the same time, the use of 
 Fig. 4. Experimental structure of a CNN-1branch. the K-means algorithm to group the features also 
 improves the accuracy. This is because after 
C. Feature conversion clustering, we obtain groups of similar features, 
 In order to create binary matrices inputted to so the generalization of features in the 
a CNN, we convert the original feature set to a convolution layers is more efficient. 
 Số 2.CS (12) 2020 35 
Journal of Science and Technology on Information security 
 TABLE 1. EXPERIMENTAL RESULTS 
 Times 
 Average 
 Models 1 2 3 4 5 
 F1- F1- F1- F1- F1- F1-
 Acc Acc Acc Acc Acc Acc 
 Score Score Score Score Score Score 
 CNN-
 0.962 0.967 0.974 0.965 0.968 0.983 0.975 0.981 0.969 0.973 0.970 0.974 
 1branch 
 CNN-multi-
 0.985 0.986 0.989 0.991 0.983 0.984 0.991 0.995 0.995 0.985 0.989 0.988 
 branches 
 Improvement 
 2.391 1.965 1.540 2.694 1.550 0.102 1.641 1.427 2.683 1.233 1.960 1.479 
 rate (%) 
 TABLE 2. COMPARING TO OTHER METHODS learning models; studying on dynamic cluster 
 numbers; experimenting with other actual data 
 Naive AGGRE Auto 
 Method PCA CNN sets with a higher number of classes and more 
 bayes GATE_ANY encoder diverse forms of attacks. 
 Acc. 0.941 0.933 0.906 0.737 0.988 
 REFERENCES 
 [1] Ozgur Koray Sahingoz, Ebubekir Buber, Onder 
 Demir, Banu Diri, Machine learning based 
 phishing detection from URLs, Expert Systems 
 With Applications 117, 2019, pp. 345–357. 
 [2] Ankit Kumar Jain1 · B. B. Gupta, A Machine 
 Learning based Approach for phishing detection 
 using hyperlinks information, © Springer-
 Verlag GmbH Germany, part of Springer 
 Nature 2018. 
 [3] Anamika Joshi, Geetha V, SQL Injection 
 Detection using Machine Learning, 2014 
 Fig. 7. Comparison of CNN-1 branch and 
 International Conference on Control, 
 CNN-multi-branches 
 Instrumentation, Communication and 
 VII. CONCLUSION Computational Technologies (ICCICCT), 2014. 
 The main contribution of this paper is to [4] Yuchun Tang, Zhenyu Zhong, Yuanchen He, 
propose and develop the new method of web System and Method for Detection of DoS 
attack detections, associated clustering by K- Attacks, Apr. 25, 2013. 
means algorithm and classifying by a multi- [5] Ming Zhang, Boyi Xu, Shuai Bai, Shuaibing Lu, 
branch CNN. The proposed method is evaluated and Zhechao Lin, A Deep Learning Method to 
using K-fold cross-validation with good results. Detect Web Attacks Using a Specially Designed 
Our method is better than the original method on CNN, ICONIP 2017, Part V, LNCS 10638, 
both F1-score and accuracy. 2017, pp. 828–836. 
 Despite the positive results, this paper still has [6] Ali Moradi Vartouni, Saeed Sedighian Kashi, 
 Mohammad Teshnehlab, An Anomaly 
some limitations such as: the number of classes Detection Method to Detect Web Attacks Using 
is small, the number of samples is limited, and Stacked Auto-Encoder, 6th Iranian Joint 
the cluster number is fixed. Therefore, we will Congress on Fuzzy and Intelligent Systems 
continue to research and improve the (CFIS), 2018. 
methodology in the paper including: 
 [7] Ruibo Yan, Xi Xiao, Guangwu Hu, Sancheng 
experimenting with other machine learning/deep Peng, Yong Jiang, New deep learning method to 
36 No 2.CS (12) 2020 
 Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 detect code injection attacks on hybrid [17] Web attack detection dataset: 
 applications, The Journal of Systems and https://github.com/DuckDuckBug/cnn_waf 
 Software 137, 2018, pp. 67–77. 
 [18] Pan Yao, Sun Fangzhou, Teng Zhongwei, White 
[8] Yadigar Imamverdiyev, Fargana Abdullayeva, Jules, Schmidt Douglas, Staples Jacob and 
 Deep Learning Method for Denial of Service Krause Lee, Detecting web attacks with end-to-
 Attack Detection Based on Restricted end deep learning. Journal of Internet Services 
 Boltzmann Machine, Mary Ann Liebert, Inc., and Applications, 2019. 
 Big Data, Volume 6 Number 2, 2018. 
[9] Coenen, F., Goulbourne, G. and Leng, P., Tree 
 ABOUT THE AUTHOR 
 Structures for Mining association Rules, Journal 
 of Data Mining and Knowledge Discovery, Vol Pham Van Huong 
 8, No 1, 2003, pp. 25-51. Workplace: Academy of 
[10] Asantha Thilina, Shakthi Attanayake, Sacith Cryptography Techniques 
 Samarakoon, Dahami Nawodya, Lakmal Email: huongpv@actvn.edu.vn 
 Rupasinghe, Nadith Pathirage, Tharindu Education: Received Bachelor's 
 Edirisinghe, Kesavan Krishnadeva, Intruder degree in 2005, Master's degree in 
 Detection Using Deep Learning and Association 2008 and PhD in 2015 in Information 
 Rule Mining, IEEE International Conference on Technology from University of Engineering and 
 Computer and Information Technology, 2016. Technology, VNU. 
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Recent research direction: IoT, AIoT, embedded 
 software optimization and big data, deep learning for 
 and Xiaowei Xu, A density-based algorithm for 
 information security. 
 discovering clusters in large spatial databases 
 with noise, In Proceedings of the 2nd ACM 
 International Conference on Knowledge Le Thi Hong Van 
 Discovery and Data Mining (KDD), 1996, pp. Workplace: Academy of 
 226–231. Cryptography Techniques 
[12] Junhao Gan, Yufei Tao, DBSCAN revisited: Email: lthvan@actvn.edu.vn 
 Mis-Claim, Un-fixability and Approximation, Education: Received Engineer's 
 SIGMODE 2015. degree in 2009 and Master's degree in 
 2013 in Information Security from 
[13] Erich Schubert, Jorg Sander, Martin Ester, Hans- Academy of Cryptography Techniques. 
 Peter Kriegel, Xiaowei Xu, DBSCAN Revisited, 
 Recent research direction: information security, 
 Revisited: Why and How You Should (Still) Use 
 cryptography, IoT and application of AI, machine 
 DBSCAN, ACM Trans. Database Syst. 42, 3, learning for information security. 
 Article 19, 2017. 
[14] 14. Bin Li, Hu Luo, Haoxin Zhang, Shunquan Pham Sy Nguyen 
 Tan, Zhongzhou Ji, A multi-branch 
 Workplace: Informatics center, The 
 convolutional neural network for detecting 
 Government Office 
 double JPEG compression, Arxiv, 2017. 
 Email: phamsynguyen@chinhphu.vn 
[15] Shahab Aslani, Michael Dayan, Loredana Education: Received Engineer’s 
 Storelli, Massimo Filippi, Vittorio Murino, degree in Information Security in 
 Maria A Rocca, Diego Sona, Multi-branch 2013; received Master’s degree in 
 Convolutional Neural Network for Multiple Information Security in 2016 from Academy of 
 Sclerosis Lesion Segmentation, Arxiv, Cryptography Techniques. 
 April 2019. Recent research direction: web hacking, malware 
[16] Pengyi Hao, Xiang Gao, Zhihe Li, Jinglin detection, information security. 
 Zhang, Fuli Wu, Cong Bai, Multi-branch fusion 
 network for Myocardial infarction screening 
 from 12-lead ECG images, Computer Methods 
 and Programs in Biomedicine 184, 2020. 
 Số 2.CS (12) 2020 37 

File đính kèm:

  • pdfdetecting_web_attacks_based_on_clustering_algorithm_and_mult.pdf