Anomaly detection system of web access using user behavior features
The growth, accessibility of the Internet and the explosion of personal
computing devices have made applications on the web growing robustly,
especially for e-commerce and public services. Unfortunately, the vulnerabilities of these web services also increased rapidly. This leads to the
need of monitoring the users accesses to these services to distinguish abnormal and malicious behaviors from the log data in order to ensure the
quality of these web services as well as their safety. This work presents
methods to build and develop a rule-based systems allowing services’
administrators to detect abnormal and malicious accesses to their web
services from web logs. The proposed method investigates characteristics of user behaviors in the form of HTTP requests and extracts efficient
features to precisely detect abnormal accesses. Furthermore, this report
proposes a way to collect and build datasets for applying machine learning techniques to generate detection rules automatically. The anomaly
detection system of was tested and evaluated its performance on 4 different web sites with approximately one million log lines per day.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Anomaly detection system of web access using user behavior features
er web service access analysis system. This work uses security assessment tools to generate abnormal access patterns which can be used for building model. In addition to the abnormal access patterns, normal ones are generated automatically by the web service structure scanning tool. For dynamic web sites, the scanning process is manually supported by the ad- ministrator. Thus, the system will be provided with 2 sample data access files including abnormal pattern and normal pattern. OWASP Zed Attack Proxy (ZAP) [18] is one of the most popular open source security tools and is actively maintained by a large user community. ZAP can help web service administrators to automatically find security issues, especially during the development and testing of applications. In this work, ZAP is used to generate abnormal access queries to web ser- vices that the administrators want to monitor. ZAP is set to maximum oper- ating mode to obtain the most types of access (hijacking, XSS, SQL injection, etc.) as well as the maximum number of generated samples. These samples are saved as semi-structured files for later processing (e.g.: anomaly.csv). At the same time, ZAP generates reports that show specific security issues of the interested web services. These will be collected and saved in a separate file (e.g.: blacklist.csv). In addition to security analysis tools, ZAP provides a mechanism for scan- ning web service structures through web spider and AJAX spider services. These tools allow to collect information about the structure of service pages and save them in the text file (e.g.: normal.csv). This work conducted using ZAP tool to collect data of web service access to 4 test websites with the follow- ing volume: about 300,000 abnormal queries, 200,000 normal accesses, about 6,000 web pages with security issues (blacklist). In addition to the data generated through the ZAP toolkit, log data to four test web sites is also used to generate the dataset. Access by users with an HTTP code outside of the normal range (return code > 200) will be consid- ered an abnormal access due to an error. In addition, accesses in the log files coincide with those of anomaly.csv and blacklist.csv files, which are considered abnormal. The way to identify unusual access from the log file is not entirely straightforward, but it is still useful because from an administrative point of view, access to the faulty web service should be alerted. Data from the web log files from 4 test sites was sampled over 4 to 5 days with each site. The data collected included nearly 340,000 normal and nearly 56,000 abnormal accesses. Among the test sites, 1 site has an average data P. H. Duy, N. T. T. Thuy, N. N. Diep 127 volume significantly lower (about 25%) compared to the remaining sites. The data collected from the log file combined with the data generated from the ZAP toolkit, after eliminating duplicates, constructs the dataset for build- ing classification model. In fact, this dataset is quite balanced, including nearly 470,000 normal and 380,000 abnormal accesses. 4.4 Experiment and evaluation 4.4.1 Performance experiment As mentioned above, an anomaly detection system for web access was developed based on Python 3.6 and scikit-learn library. The dataset from the above is divided in the ration of 7 : 3 for training and testing corresponding to the number of accesses 554, 000 : 238, 000. These accesses are represented by TF- IDF to perform machine learning techniques using random tree learning. The parameters used in the classification model building process are set to the default level. Test results show that the accuracy of classification model is about 95%. Table 2 details the classification of normal and anomaly accesses from a learned model. Table 2: Confusion matrix Normal Abnormal Normal 139,167 3,039 Abnormal 8,311 105,578 The results obtained from building classification models are relatively good. The classification model obtained from the model building step was used on the access data obtained from the log file to produce results as shown in Table 3. The data in the table shows that at some point the number of user access doubles the average at other times. The number of abnormal fluctuations is not proportional to the access volume of the user. Figure 5 shows the percentage of hits checked (which do not coincide with other accesses) and anomalies in the number of queries tested. During the observed period, the number of anomalies fluctuated around 2% of the total number of queries examined. Particularly, in the 3rd and 4th observation range, the rate of abnormalities increased sharply. Several abnormal user accesses are shown in Table 4. Accesses from 1 to 4 are relatively clear acts of attack on web services. Access 5 and 6 are not really obvious behaviors or attempts to browse the directory structure of a website. However, these behaviors still need to be monitored by administrators. 128 Anomaly Detection System of Web Access ... Table 3: Anomaly detection results Total Total check Abnormal access 3,759,770 39,311 238 917,004 7,989 102,445 1,486,577 30,884 2,718 682,988 25,936 1,291 1,640,803 32,530 262 1,356,999 30,716 147 1,007,064 27,002 60 1,864,161 30,051 74 2,434,863 35,028 204 1,788,273 30,232 458 1,895,390 44,230 116 1,553,298 38,047 90 Figure 5: Percentage of prediction results 4.4.2 Run-time experiment Despite the better detection quality of TF-IDF model when applying against CSIC data-set, this model perform is equally good compared to common-feature model with accuracy of 95.57% and 96.01% respectively with the data-set sup- ported by ZAP tool. For training phase using the same machine and data-set, it takes almost 7 minutes to construct TF-IDF model compared about a half of minute to build common-feature model. Therefore, it is more important and interesting to investigate the run-time of these two models during testing phase (detecting anomaly). These models, namely TF-IDF and common-feature models, are used to detect the anomaly against the datasets collected in 8 different days and each P. H. Duy, N. T. T. Thuy, N. N. Diep 129 Table 4: Abnormal access No. Abnormal access /Default.aspx?sname=..%2f..%2f..%2fetc%2fpasswd&sid=1293&pa 1 geid=32306 /Default.aspx?sname=http%3a%2f%2fwww.google.com+&sid=1293 2 &pageid=32306 /wps/wcm/connect/309b0a0042eaedc58881ccd8919db02e/HINHLO 3 N.bmp?MOD=AJPERES /Default.aspx?sname=c%3A%2FWindows%2Fsystem.ini&sid=4&p 4 ageid=468 5 /public/upload nhieuanh/server/php/ index.php 6 /vanban.aspx?type=%2527%253e%253c%2500rhLvZ%253e dataset in these days is run by 5 times. The average run-time of these models is recorded in the unit of seconds and illustrated in the table. As showed in the table, the common-feature model performs better providing that the number of log records below 600,000 but the model is quite slower when the log records above 1 million. Despite its simplicity in computing feature, the common-feature model cannot keep pace with TF-IDF model when data size increases. 4.5 Discussions Applying rule generation techniques to anomaly detection helps administrators easily visualize how the detection system works. Machine learning techniques using the decision tree algorithm allow the development of anomaly detection rules quickly and efficiently. This technique is also one of the common and typical for detecting anomalies based on the generation of anomalous behavioral classification tree. The experiment result showed that using TD-IDF features to represent user behavior from access log data achieves good results compared to other representation. The advantage of a decision-based technique is that the training speed is fast but the effectiveness of the detection system depends on the quality of the dataset used to build the analytical model. The report proposes a way to build a dataset that meets the need for detecting anomaly and monitoring user access based on the ZAP security tool. This dataset is stored in semi-structured files including anomaly (anomaly.csv) and normal (normal.csv) samples, and black- lists (blacklist.csv). This greatly supports administration in analysis and mon- itoring of web services with limited resources. The simple structure through semi-structured files allows the administrator to append malicious or normal accesses. In other words, administrators or operators of web services can main- tain a library of user access behaviors that appropriately accommodates to 130 Anomaly Detection System of Web Access ... Figure 6: Comparison of the run-time of the two models (TF-IDF and common- feature) during testing phase (detecting anomaly) their own needs. The anomaly detection system proposed to use a group of decision tree algorithms, namely random forests, based on an assessment of the training rate, performance and abnormal performance. On the other hand, the use of other algorithms such as SVM machine learning vector or deep learning techniques, can also improve the detection performance of the abnormal classification model obtained. However, these techniques are quite limited at the complexity of deployment and training time, and may require special hardware and software (especially for deep learning techniques). The proposed system has been performing an experimental analysis of quite a large amount of data up to millions of user accesses. The Python environ- ment and NoSQL MongoDB database combine quite well when handling such volumes. However, the proposed system is not really geared towards handling big data like the Apache Spark platform. Even though, large data processing platforms often provide toolkits connected to the Python environment due to the popularity of this environment. It is possible to integrate and extend the proposed system with large data processing platforms like Apache Spark. 5Conclusion With the increasing popularity of web services, the issue of administration and monitoring user behaviors becomes even more urgent to ensure the quality of service as well as the security of the web services. Anomaly detection in web services can range from detecting misuse of users to malicious purposes which P. H. Duy, N. T. T. Thuy, N. N. Diep 131 degrade the quality of website service to commit fraudulent behaviors. The paper explores how to detect unusual accesses from log data on a web server based on automatic rules generation by applying random forest algo- rithms. User access to the services is represented by TF-IDF feature thanks to its detecting performance. In addition, the report presented the method to build and maintain a dataset for the development of an extraordinary clas- sification model based on the ZAP security tool. Administrators can easily maintain and update datasets according to their own management and super- vision needs. Testing on the proposed anomaly detection system shows that the system works relatively well, reaches 95% accurate detection and is capable of processing and monitoring the volume of query data up to millions of Records of user queries. In the future, anomaly detection systems could be further investigated to in- corporate more advanced machine learning algorithms to enhance the anomaly detection performance. On the other hand, the analysis of anomalous access behavior can be more detailed such as XSS, SQL or hijacking instead of normal and abnormal as currently. Integration with large data processing platforms is also a practical task to meet the needs of administration and monitoring with large-scale web services. References [1] De Stefano, Claudio, Carlo Sansone, and Mario Vento. ”To reject or not to reject: that is the question-an answer in case of neural classifiers.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30.1 (2000): 84-94. [2] Barbara, Daniel, Ningning Wu, and Sushil Jajodia. ”Detecting novel network intrusions using bayes estimators.” Proceedings of the 2001 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2001. [3] Fan, Wei, et al. ”Using artificial anomalies to detect unknown and known network intrusions.” Knowledge and Information Systems 6.5 (2004): 507-527. [4] Helmer, Guy G., et al. ”Intelligent agents for intrusion detection.” 1998 IEEE Infor- mation Technology Conference, Information Environment for the Future (Cat. No. 98EX228). IEEE, 1998. [5] Lee, Wenke, Salvatore J. Stolfo, and Philip K. Chan. ”Learning patterns from unix process execution traces for intrusion detection.” AAAI Workshop on AI Approaches to Fraud Detection and Risk Management. 1997. [6] Salvador, Stan, Philip Chan, and John Brodie. ”Learning States and Rules for Time Series Anomaly Detection.” FLAIRS conference. 2004. [7] Teng, Henry S., Kaihu Chen, and Stephen C. Lu. ”Security audit trail analysis using inductively generated predictive rules.” Sixth Conference on Artificial Intelligence for Applications. IEEE, 1990. [8] Agrawal, Rakesh, and Ramakrishnan Srikant. ”Mining sequential patterns.” icde.Vol. 95. 1995. [9] Mahoney, Matthew V., and Philip K. Chan. Learning rules for anomaly detection of hostile network traffic. 2003. [10] Chan, Philip K., Matthew V. Mahoney, and Muhammad H. Arshad. A machine learning approach to anomaly detection. 2003. 132 Anomaly Detection System of Web Access ... [11] Tandon, Gaurav, and Philip K. Chan. ”Weighting versus pruning in rule validation for detecting network and host anomalies.” Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007. [12] Chan, Gaik-Yee, Chien-Sing Lee, and Swee-Huay Heng. ”Discovering fuzzy association rule patterns and increasing sensitivity analysis of XML-related attacks.” Journal of Network and Computer Applications 36.2 (2013): 829-842. [13] Ezeife, Christie I., Jingyu Dong, and Akshai K. Aggarwal. ”SensorWebIDS: a web mining intrusion detection system.” International Journal of web Information Systems 4.1 (2008): 97-120. [14] Breiman, Leo. ”Random Forests.” Machine learning 45.1 (2001): 5-32. [15] Nguyen, Hai Thanh, et al. ”Application of the generic feature selection measure in detection of web attacks.” Computational Intelligence in Security for Information Sys- tems. Springer, Berlin, Heidelberg, 2011. 25-32. [16] Christopher, D. Manning, Raghavan Prabhakar, and Schtze Hinrich. ”Introduction to information retrieval.” An Introduction To Information Retrieval 151.177 (2008): 5. [17] Gimnez, Carmen Torrano, Alejandro Prez Villegas, and Gonzalo lvarez Maran. ”HTTP dataset CSIC 2010.” Information Security Institute of CSIC (Spanish Research Na- tional Council) (2010). [18] Bennetts, Simon. ”Owasp zed attack proxy.” AppSec USA (2013).
File đính kèm:
- anomaly_detection_system_of_web_access_using_user_behavior_f.pdf