Representation model of requests to web resources, based on a vector space model and attributes of requests for HTTP protocol
Trong những năm gần đây, số
lượng sự cố liên quan đến các ứng dụng Web có
xu hướng tăng lên do sự gia tăng số lượng người
dùng thiết bị di động, sự phát triển của Internet
cũng như sự mở rộng của nhiều dịch vụ của nó.
Do đó càng làm tăng khả năng bị tấn công vào
thiết bị di động của người dùng cũng như hệ
thống máy tính. Mã độc thường được sử dụng để
thu thập thông tin về người dùng, dữ liệu cá
nhân nhạy cảm, truy cập vào tài nguyên Web
hoặc phá hoại các tài nguyên này. Mục đích của
nghiên cứu nhằm tăng cường độ chính xác phát
hiện các cuộc tấn công máy tính vào các ứng
dụng Web. Bài báo trình bày một mô hình biểu
diễn các yêu cầu Web, dựa trên mô hình không
gian vectơ và các thuộc tính của các yêu cầu đó
sử dụng giao thức HTTP. So sánh với các nghiên
cứu được thực hiện trước đây cho phép chúng
tôi ước tính độ chính xác phát hiện xấp xỉ 96%
cho các ứng dụng Web khi sử dụng bộ dữ liệu
KDD 99 trong đào tạo cũng như phát hiện tấn
công đi kèm với việc biểu diễn truy vấn dựa trên
không gian vectơ và phân loại dựa trên mô hình
cây quyết định.

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7
Tóm tắt nội dung tài liệu: Representation model of requests to web resources, based on a vector space model and attributes of requests for HTTP protocol
files used to analyze
the object belongs to the class of the only nearest network packages, an average of 40.3% in
neighbor. comparison with the standard module.
In [17], the authors used a combined In [21], a comparative analysis of the
approach – a combination of the genetic capabilities of an artificial neural network and
algorithm [18] and the k-nearest neighbor the decision trees method for solving problems
classifier to detect denial of service attacks. of detecting computer attacks is carried out.
The goal of the genetic algorithm is to find the The researchers came to the conclusions that
artificial neural network is effective for
optimal weight vector, in which represents
i generalization and not suitable for detecting
the weight of features 1 in. For two new attacks, while decision trees are effective
vectors features X { x12 , x ,..., xn } and for both tasks.
5. Support vector machine
Y { y12 , y ,..., yn } distance between them
will be calculated as follows: The initial data in the support vector
machine method is a set of elements located
46 No 2.CS (10) 2019
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
in space. The dimension of space corresponds A. Formation of feature space for our model
to the number of classifying signs, their value
To set the model for presenting requests to
determining the position of elements (points)
Web resources, the author has carried out the
in space.
formation of a corresponding feature space, that
The support vector machine method has allowed to evaluate its adequacy from the
refers to linear classification methods. Two standpoint of solving the problem of detecting
sets of points belonging to two different computer attacks on Web applications.
classes are separated by a hyperplane in
In fig.2 the main stages of analyzing an
space. At the same time, the hyperplane is
HTTP request received at the Web server input
constructed in such a way that the distances
are demonstrated. We divided the dataset into
from it to the nearest instances of both
two parts: requests with information about
classes (support vectors) were maximum,
attacks and normal requests. In the learning
which ensures the strict accuracy of
process, we will calculate all the necessary
classification.
values such as the expected value and the
The support vector machine method allows variance of normal queries, then these values
[22; 23]: are stored in the database MySQL for the attack
• obtaining a classification function with a detection process. The analysis is performed on
minimum upper estimate of the expected risk the appropriate fields of the protocol to ensure
(level of classification error); further possibility of its representation in the
• using a linear classifier to work with vector space model. It also analyzes and
nonlinearly shared data. calculates a number of attributes selected by the
author. Thus, the proposed query representation
III. MODEL FOR PRESENTING model allows moving from the text
REQUESTS TO WEB RESOURCES, BASED representation to the totality of features of the
ON THE VECTOR SPACE MODEL AND vector space model for the corresponding
ATTRIBUTES OF REQUESTS VIA HTTP protocol fields and query attributes.
The anomaly detection approach is based on The basic steps to form a model for each
the analysis of HTTP requests processed by query are the following:
most common Web servers (for example, • Extracting and analyzing data: analysis of
Apache or nginx) and is intended to be built in all the incoming requests from the Web
Web Application Firewall (WAF). WAF
browser is carried out.
analyzes all requests coming to the Web server
• Transformation into a vector space model:
and makes decisions about their execution on
the server (Fig.1). it is used to transform text data into a vector
representation using the TF-IDF algorithm
[24], which allows estimating the weight of
features for the entire text data array.
Calculation of attribute values: the values of 8
attributes proposed by the author are calculated.
1. Extracting and analyzing data
At the entrance of the Web server requests via
HTTP are received. An example of the contents
of a GET request is shown in Fig.3.
Fig.1. WAF in Web Application Security System
No 2.CS (10) 2019 47
Journal of Science and Technology on Information Security
Fig. 2. Example of the content fields of
HTTP request (GET method)
2. Conversion to a Vector Space Model
To convert strings into a vector form,
allowing further application of machine learning Fig.3 - Analysis of incoming requests for Web
applications within the framework of the proposed model
methods, an approach based on the TF-IDF
method was chosen [24]. The length of the request fields sent from
TF-IDF is a statistical measure used to the browser (A1).
assess the importance of words in the context The distribution of characters in the
of a document that is part of a document request (A2).
collection or corpus. The weight of a word is Structural inference (A3).
proportional to the number of uses of the word Token finder (A4).
in the document and inversely proportional to Attribute order (A5).
the frequency of the word use in other The author proposed to introduce 3
documents of the collection. Application of the additional attributes to improve the accuracy of
TF-IDF approach to the problem being solved attack detection.
is carried out for each request. The length of the request sent from the
For each word 푡 in the query in the total browser (A6)
of queries the value tfidf is calculated From the analysis of legitimate requests via
the HTTP protocol, it was found out that their
according to the following expression:
length varies slightly. However, in the event of an
tfidf(,)(,)() t d tf t d idf t (2)
attack, the length of the data field may change
The values of tf, idf are calculated in significantly (for example, in the case of SQL
accordance with expressions (3), (4) respectively, injection or cross-site scripting).
where 푣 is the rest of the words in the query . Therefore, to estimate the limiting thresholds
count(,) t d for changing the length of requests, two of the
tf(,) t d (3)
count(,) v d parameters are evaluated: the expected value and
d variance 2 for the training set of legitimate data.
||D Using Chebyshev's inequality, we can estimate
idf( t ) log (4)
|d D : t d | the probability that a random variable will take a
value far from its mean (expression (5)).
Thus, after converting the query ∈ into
2
the vector representation | | it will be set using Px(| | ) , (5)
the set of weights {푤푡∈ } for each value t from
the dictionary T. where is a random variable, 휏 is the threshold
3. Calculation of attribute values value of its change.
In [25], 5 basic attributes were proposed for Accordingly, for any probability distribution
building a detection system computer attacks on with mean and variance , it is necessary to
web applications: choose a value such that a deviation x from the
48 No 2.CS (10) 2019
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
mean 휇, when the threshold is exceeded, results
in blocking the query with the lowest level of
errors of the first and second kind.
The attribute value is equal to the probability
value from expression (5):
A6 P (| x | ) . (6)
Appearance of new characters (A7)
Fig. 4. An example of the complete dangerous HTTP
From the training sample of legitimate request with the POST method
requests, we have to select some non-repeating When analyzing a full HTTP request, the
characters (including various encodings) in order author focuses on the data in a red frame (Fig. 3).
to compose the set of symbols of the alphabet . After the extraction process, the data will be
Thus, when the symbol bA appears in the saved in the appropriate files (good_request.txt
query, the value of the counter for this attribute is and bad_request.txt). The structure of these files
increased by one. The value of the attribute itself is shown in Fig. 4.
is calculated as the ratio of the counter value to
the power of the alphabet set:
p
A7 b (7)
||A
The emergence of new keywords (A8)
From the training sample of legitimate
queries, we have to select some non-repeating
terms (words) - 푡 in order to compose a set of
terms of the dictionary. Thus, when the word
T appears in the query, the counter value p
for this attribute is increased by one. The value of Fig.5. File of dangerous HTTP request
the attribute itself is calculated as the ratio of the A preliminary study allowed us to obtain
value of the counter to the power of the set of an estimate of the accuracy of detecting attacks
terms of the dictionary: on Web applications of 96% for the data set [15]
p
A8 (8) using the entered query attributes, query vector
||T representation models and classifier based on
decision trees. This fact allows us to conclude
IV. CONCLUSION that it is possible to build an algorithm for
For testing the operation of machine learning detecting computer attacks on Web applications
methods, a data set from several data sources of based on the proposed model for presenting
system protection tools will be used, such as log requests to Web resources based on the vector
files of the intrusion detection and prevention space model and differing in the attribute
system, HTTP requests (GET, POST method) of attributes of requests via HTTP.
the web application firewall, etc.
REFERENCES
[1] ]. Kaspersky Lab. Security report. - 2019. - (дата
обращения: 15.04.2019). http:/ / www. securelist. com
/ en / analysis / 204792244 / The - geography - of -
cybercrime - Western - Europe- and-North-America.
[2]. A survey of intrusion detection techniques in cloud / C.
Modi [et al.] // Journal of Network and Computer
Applications. - Vol. 36, no. 1. - P. 42-57, 2013.
No 2.CS (10) 2019 49
Journal of Science and Technology on Information Security
[3]. Khamphakdee N., Benjamas N., Saiyod S. Improving Уфимского государственного авиационного
intrusion detection system based on snort rules for тех¬нического университета. - 2015. - Т. 19, 4 (70).
network probe attack detection // Information and [17]. Su M.-Y. Real-time anomaly detection systems for
Communication Technology (IColCT), 2014 2nd Denial-of-Service attacks by weighted k- nearest-
International Conference On. - IEEE. - P. 69-74. 2014. neighbor classifiers // Expert Systems with
[4]. A stateful intrusion detection system for world-wide Applications. - Vol. 38, no. 4. - P. 3492-3498. - 2011.
web servers / G. Vigna [et al.] // Computer Security [18]. Lee C. H., Chung J. W., Shin S. W. Network
Applications Conference, 2003. Proceedings. 19th intrusion detection through genetic feature selection //
Annual. - IEEE.. - P. 34-43., 2003 Software Engineering, Artificial Intelligence,
[5]. Sekar R. An Efficient Black-box Technique for Networking, and Parallel/Distributed Computing,
Defeating Web Application Attacks. // NDSS. - 2009. 2006. SNPD 2006. Seventh ACIS International
[6]. Mutz D., Vigna G., Kemmerer R. An experience Conference on. - IEEE - P. 109-114, 2006.
developing an IDS stimulator for the blackbox testing [19]. Intrusion detection with genetic algorithms and fuzzy
of network intrusion detection systems // Computer logic / E. Ireland [et al.] // UMM CSci senior seminar
Security Applications Conference, 2003. Proceedings. conference..- Pp. 1-6, 2013.
19th Annual. - IEEE- P. 374-383, . 2003.. [20]. Kruegel C., Toth T. Using decision trees to improve
[7]. Li X., Xue Y. BLOCK: a black-box approach for signature-based intrusion detection // Recent Advances
detection of state violation attacks towards web in Intrusion Detection. - Springer - P. 173-191, 2003.
applications // Proceedings of the 27th Annual [21]. Bouzida Y., Cuppens F. Neural networks vs.
Computer Security Applications Conference. - ACM -
P. 247-256, 2011.
[8]. Saxena P., Sekar R., Puranik V. Efficient fine-grained ABOUT THE AUTHORS
binary instrumentationwith applications to taint-
Manh Thang Nguyen
tracking // Proceedings of the 6th annual IEEE/ACM
international symposium on Code generation and Workplace: Information Technology
optimization. - ACM..- P. 74-83, 2008. Faculty – Academy of cryptography
techniques.
[9]. Браницкий А. А., Котенко И. В. Анализ и
классификация методов обнаружения сетевых Email: chieumatxcova@gmail.com
атак // Труды СПИИРАН. - Т. 2, № 45. - С. Training process:
207—244, 2016. 2005-2007: Student at the Military
[10]. Heckerman D. A tutorial on learning with Bayesian Technical Academy.
networks // Innovations in Bayesian networks. -
2007-2013: Student at the Applied Mathematics and
Springer. - P. 33-82, 2008.
Informatics Faculty - Lipetsk State Pedagogical
[11]. Friedman N., Geiger D., Goldszmidt M. Bayesian University – Russia Federation.
network classifiers // Machine learning. - - Vol. 29, no.
2017-present: Post-graduate student at the Military
2-3. - P. 131-163, 1997.
Academy of the Federal Guard Service Russian
[12]. Goldszmidt M. Bayesian network classifiers // Wiley Federation.
Encyclopedia of Operations Research and
Management Science. - 2010. Research today: Computer network, network security,
machine learning and data mining.
[13]. Barbara D., Wu N., Jajodia S. Detecting novel
network intrusions using bayes estimators //
Proceedings of the 2001 SIAM International D.S. Alexander Kozachok
Conference on Data Mining. - SIAM. - P. 1-17, . 2001 . Workplace: The Academy of
[14]. Нейросетевая технология обнаружения сетевых Federal Guard Service of the
атак на информационные ресурсы / Ю. Г. Russian Federation.
Емельянова [и др.] // Программные системы: Email: alex.totrin@gmail.com
теория и приложения. - Т. 2, № 3. - С. 3-15., 2011.
The education process: has
[15]. A Detailed Analysis of the KDD CUP 99 Data Set / received PhD. degree in
M. Tavallaee [и др.] // Proceedings of the Second Engineering Sciences in Academy
IEEE International Conference on Computational of Federal Guard Service of the
Intelligence for Security and Defense Applications. - Russian Federation in Dec. 2012.
Ottawa, Ontario, Canada: IEEE Press. - С. 53—58. -
(CISDA’09). - URL: Research today: Information security; Unauthorized access
1736481.17 36489, 2009. protection; Mathematical cryptography; theoretical
problems of computer.
[16]. Васильев В.И., Шарабыров И.В.
Интеллектуальная система обнаружения атак в
ло¬кальных беспроводных сетях // Вестник
50 No 2.CS (10) 2019
File đính kèm:
representation_model_of_requests_to_web_resources_based_on_a.pdf

