Anomaly detection system of web access using user behavior features

The growth, accessibility of the Internet and the explosion of personal

computing devices have made applications on the web growing robustly,

especially for e-commerce and public services. Unfortunately, the vulnerabilities of these web services also increased rapidly. This leads to the

need of monitoring the users accesses to these services to distinguish abnormal and malicious behaviors from the log data in order to ensure the

quality of these web services as well as their safety. This work presents

methods to build and develop a rule-based systems allowing services’

administrators to detect abnormal and malicious accesses to their web

services from web logs. The proposed method investigates characteristics of user behaviors in the form of HTTP requests and extracts efficient

features to precisely detect abnormal accesses. Furthermore, this report

proposes a way to collect and build datasets for applying machine learning techniques to generate detection rules automatically. The anomaly

detection system of was tested and evaluated its performance on 4 different web sites with approximately one million log lines per day.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

18 trang duykhanh 11780

Download

Bạn đang xem 10 trang mẫu của tài liệu "Anomaly detection system of web access using user behavior features", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Anomaly detection system of web access using user behavior features

er web service access analysis system. This work uses
security assessment tools to generate abnormal access patterns which can be
used for building model. In addition to the abnormal access patterns, normal
ones are generated automatically by the web service structure scanning tool.
For dynamic web sites, the scanning process is manually supported by the ad-
ministrator. Thus, the system will be provided with 2 sample data access ﬁles
including abnormal pattern and normal pattern.
OWASP Zed Attack Proxy (ZAP) [18] is one of the most popular open
source security tools and is actively maintained by a large user community.
ZAP can help web service administrators to automatically ﬁnd security issues,
especially during the development and testing of applications.
In this work, ZAP is used to generate abnormal access queries to web ser-
vices that the administrators want to monitor. ZAP is set to maximum oper-
ating mode to obtain the most types of access (hijacking, XSS, SQL injection,
etc.) as well as the maximum number of generated samples. These samples
are saved as semi-structured ﬁles for later processing (e.g.: anomaly.csv). At
the same time, ZAP generates reports that show speciﬁc security issues of the
interested web services. These will be collected and saved in a separate ﬁle
(e.g.: blacklist.csv).
In addition to security analysis tools, ZAP provides a mechanism for scan-
ning web service structures through web spider and AJAX spider services.
These tools allow to collect information about the structure of service pages
and save them in the text ﬁle (e.g.: normal.csv). This work conducted using
ZAP tool to collect data of web service access to 4 test websites with the follow-
ing volume: about 300,000 abnormal queries, 200,000 normal accesses, about
6,000 web pages with security issues (blacklist).
In addition to the data generated through the ZAP toolkit, log data to four
test web sites is also used to generate the dataset. Access by users with an
HTTP code outside of the normal range (return code > 200) will be consid-
ered an abnormal access due to an error. In addition, accesses in the log ﬁles
coincide with those of anomaly.csv and blacklist.csv ﬁles, which are considered
abnormal. The way to identify unusual access from the log ﬁle is not entirely
straightforward, but it is still useful because from an administrative point of
view, access to the faulty web service should be alerted.
Data from the web log ﬁles from 4 test sites was sampled over 4 to 5 days
with each site. The data collected included nearly 340,000 normal and nearly
56,000 abnormal accesses. Among the test sites, 1 site has an average data
P. H. Duy, N. T. T. Thuy, N. N. Diep 127
volume signiﬁcantly lower (about 25%) compared to the remaining sites.
The data collected from the log ﬁle combined with the data generated from
the ZAP toolkit, after eliminating duplicates, constructs the dataset for build-
ing classiﬁcation model. In fact, this dataset is quite balanced, including nearly
470,000 normal and 380,000 abnormal accesses.
4.4 Experiment and evaluation
4.4.1 Performance experiment
As mentioned above, an anomaly detection system for web access was developed
based on Python 3.6 and scikit-learn library. The dataset from the above is
divided in the ration of 7 : 3 for training and testing corresponding to the
number of accesses 554, 000 : 238, 000. These accesses are represented by TF-
IDF to perform machine learning techniques using random tree learning. The
parameters used in the classiﬁcation model building process are set to the
default level. Test results show that the accuracy of classiﬁcation model is
about 95%. Table 2 details the classiﬁcation of normal and anomaly accesses
from a learned model.
Table 2: Confusion matrix
Normal Abnormal
Normal 139,167 3,039
Abnormal 8,311 105,578
The results obtained from building classiﬁcation models are relatively good.
The classiﬁcation model obtained from the model building step was used on
the access data obtained from the log ﬁle to produce results as shown in Table
3.
The data in the table shows that at some point the number of user access
doubles the average at other times. The number of abnormal ﬂuctuations is not
proportional to the access volume of the user. Figure 5 shows the percentage of
hits checked (which do not coincide with other accesses) and anomalies in the
number of queries tested. During the observed period, the number of anomalies
ﬂuctuated around 2% of the total number of queries examined. Particularly, in
the 3rd and 4th observation range, the rate of abnormalities increased sharply.
Several abnormal user accesses are shown in Table 4. Accesses from 1 to 4
are relatively clear acts of attack on web services. Access 5 and 6 are not really
obvious behaviors or attempts to browse the directory structure of a website.
However, these behaviors still need to be monitored by administrators.
128 Anomaly Detection System of Web Access ...
Table 3: Anomaly detection results
Total
Total check Abnormal
access
3,759,770 39,311 238
917,004 7,989 102,445
1,486,577 30,884 2,718
682,988 25,936 1,291
1,640,803 32,530 262
1,356,999 30,716 147
1,007,064 27,002 60
1,864,161 30,051 74
2,434,863 35,028 204
1,788,273 30,232 458
1,895,390 44,230 116
1,553,298 38,047 90
Figure 5: Percentage of prediction results
4.4.2 Run-time experiment
Despite the better detection quality of TF-IDF model when applying against
CSIC data-set, this model perform is equally good compared to common-feature
model with accuracy of 95.57% and 96.01% respectively with the data-set sup-
ported by ZAP tool. For training phase using the same machine and data-set,
it takes almost 7 minutes to construct TF-IDF model compared about a half
of minute to build common-feature model. Therefore, it is more important and
interesting to investigate the run-time of these two models during testing phase
(detecting anomaly).
These models, namely TF-IDF and common-feature models, are used to
detect the anomaly against the datasets collected in 8 diﬀerent days and each
P. H. Duy, N. T. T. Thuy, N. N. Diep 129
Table 4: Abnormal access
No. Abnormal access
/Default.aspx?sname=..%2f..%2f..%2fetc%2fpasswd&sid=1293&pa
1
geid=32306
/Default.aspx?sname=http%3a%2f%2fwww.google.com+&sid=1293
2
&pageid=32306
/wps/wcm/connect/309b0a0042eaedc58881ccd8919db02e/HINHLO
3
N.bmp?MOD=AJPERES
/Default.aspx?sname=c%3A%2FWindows%2Fsystem.ini&sid=4&p
4
ageid=468
5 /public/upload nhieuanh/server/php/ index.php
6 /vanban.aspx?type=%2527%253e%253c%2500rhLvZ%253e
dataset in these days is run by 5 times. The average run-time of these models
is recorded in the unit of seconds and illustrated in the table. As showed
in the table, the common-feature model performs better providing that the
number of log records below 600,000 but the model is quite slower when the
log records above 1 million. Despite its simplicity in computing feature, the
common-feature model cannot keep pace with TF-IDF model when data size
increases.
4.5 Discussions
Applying rule generation techniques to anomaly detection helps administrators
easily visualize how the detection system works. Machine learning techniques
using the decision tree algorithm allow the development of anomaly detection
rules quickly and eﬃciently. This technique is also one of the common and
typical for detecting anomalies based on the generation of anomalous behavioral
classiﬁcation tree. The experiment result showed that using TD-IDF features
to represent user behavior from access log data achieves good results compared
to other representation.
The advantage of a decision-based technique is that the training speed is
fast but the eﬀectiveness of the detection system depends on the quality of the
dataset used to build the analytical model. The report proposes a way to build
a dataset that meets the need for detecting anomaly and monitoring user access
based on the ZAP security tool. This dataset is stored in semi-structured ﬁles
including anomaly (anomaly.csv) and normal (normal.csv) samples, and black-
lists (blacklist.csv). This greatly supports administration in analysis and mon-
itoring of web services with limited resources. The simple structure through
semi-structured ﬁles allows the administrator to append malicious or normal
accesses. In other words, administrators or operators of web services can main-
tain a library of user access behaviors that appropriately accommodates to
130 Anomaly Detection System of Web Access ...
Figure 6: Comparison of the run-time of the two models (TF-IDF and common-
feature) during testing phase (detecting anomaly)
their own needs.
The anomaly detection system proposed to use a group of decision tree
algorithms, namely random forests, based on an assessment of the training rate,
performance and abnormal performance. On the other hand, the use of other
algorithms such as SVM machine learning vector or deep learning techniques,
can also improve the detection performance of the abnormal classiﬁcation model
obtained. However, these techniques are quite limited at the complexity of
deployment and training time, and may require special hardware and software
(especially for deep learning techniques).
The proposed system has been performing an experimental analysis of quite
a large amount of data up to millions of user accesses. The Python environ-
ment and NoSQL MongoDB database combine quite well when handling such
volumes. However, the proposed system is not really geared towards handling
big data like the Apache Spark platform. Even though, large data processing
platforms often provide toolkits connected to the Python environment due to
the popularity of this environment. It is possible to integrate and extend the
proposed system with large data processing platforms like Apache Spark.
5Conclusion
With the increasing popularity of web services, the issue of administration and
monitoring user behaviors becomes even more urgent to ensure the quality of
service as well as the security of the web services. Anomaly detection in web
services can range from detecting misuse of users to malicious purposes which
P. H. Duy, N. T. T. Thuy, N. N. Diep 131
degrade the quality of website service to commit fraudulent behaviors.
The paper explores how to detect unusual accesses from log data on a web
server based on automatic rules generation by applying random forest algo-
rithms. User access to the services is represented by TF-IDF feature thanks
to its detecting performance. In addition, the report presented the method
to build and maintain a dataset for the development of an extraordinary clas-
siﬁcation model based on the ZAP security tool. Administrators can easily
maintain and update datasets according to their own management and super-
vision needs. Testing on the proposed anomaly detection system shows that the
system works relatively well, reaches 95% accurate detection and is capable of
processing and monitoring the volume of query data up to millions of Records
of user queries.
In the future, anomaly detection systems could be further investigated to in-
corporate more advanced machine learning algorithms to enhance the anomaly
detection performance. On the other hand, the analysis of anomalous access
behavior can be more detailed such as XSS, SQL or hijacking instead of normal
and abnormal as currently. Integration with large data processing platforms is
also a practical task to meet the needs of administration and monitoring with
large-scale web services.
References
[1] De Stefano, Claudio, Carlo Sansone, and Mario Vento. ”To reject or not to reject: that
is the question-an answer in case of neural classiﬁers.” IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews) 30.1 (2000): 84-94.
[2] Barbara, Daniel, Ningning Wu, and Sushil Jajodia. ”Detecting novel network intrusions
using bayes estimators.” Proceedings of the 2001 SIAM International Conference on
Data Mining. Society for Industrial and Applied Mathematics, 2001.
[3] Fan, Wei, et al. ”Using artiﬁcial anomalies to detect unknown and known network
intrusions.” Knowledge and Information Systems 6.5 (2004): 507-527.
[4] Helmer, Guy G., et al. ”Intelligent agents for intrusion detection.” 1998 IEEE Infor-
mation Technology Conference, Information Environment for the Future (Cat. No.
98EX228). IEEE, 1998.
[5] Lee, Wenke, Salvatore J. Stolfo, and Philip K. Chan. ”Learning patterns from unix
process execution traces for intrusion detection.” AAAI Workshop on AI Approaches
to Fraud Detection and Risk Management. 1997.
[6] Salvador, Stan, Philip Chan, and John Brodie. ”Learning States and Rules for Time
Series Anomaly Detection.” FLAIRS conference. 2004.
[7] Teng, Henry S., Kaihu Chen, and Stephen C. Lu. ”Security audit trail analysis using
inductively generated predictive rules.” Sixth Conference on Artiﬁcial Intelligence for
Applications. IEEE, 1990.
[8] Agrawal, Rakesh, and Ramakrishnan Srikant. ”Mining sequential patterns.” icde.Vol.
95. 1995.
[9] Mahoney, Matthew V., and Philip K. Chan. Learning rules for anomaly detection of
hostile network traﬃc. 2003.
[10] Chan, Philip K., Matthew V. Mahoney, and Muhammad H. Arshad. A machine learning
approach to anomaly detection. 2003.
132 Anomaly Detection System of Web Access ...
[11] Tandon, Gaurav, and Philip K. Chan. ”Weighting versus pruning in rule validation
for detecting network and host anomalies.” Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2007.
[12] Chan, Gaik-Yee, Chien-Sing Lee, and Swee-Huay Heng. ”Discovering fuzzy association
rule patterns and increasing sensitivity analysis of XML-related attacks.” Journal of
Network and Computer Applications 36.2 (2013): 829-842.
[13] Ezeife, Christie I., Jingyu Dong, and Akshai K. Aggarwal. ”SensorWebIDS: a web
mining intrusion detection system.” International Journal of web Information Systems
4.1 (2008): 97-120.
[14] Breiman, Leo. ”Random Forests.” Machine learning 45.1 (2001): 5-32.
[15] Nguyen, Hai Thanh, et al. ”Application of the generic feature selection measure in
detection of web attacks.” Computational Intelligence in Security for Information Sys-
tems. Springer, Berlin, Heidelberg, 2011. 25-32.
[16] Christopher, D. Manning, Raghavan Prabhakar, and Schtze Hinrich. ”Introduction to
information retrieval.” An Introduction To Information Retrieval 151.177 (2008): 5.
[17] Gimnez, Carmen Torrano, Alejandro Prez Villegas, and Gonzalo lvarez Maran. ”HTTP
dataset CSIC 2010.” Information Security Institute of CSIC (Spanish Research Na-
tional Council) (2010).
[18] Bennetts, Simon. ”Owasp zed attack proxy.” AppSec USA (2013).

File đính kèm:

anomaly_detection_system_of_web_access_using_user_behavior_f.pdf