Representation model of requests to web resources, based on a vector space model and attributes of requests for HTTP protocol
Trong những năm gần đây, số
lượng sự cố liên quan đến các ứng dụng Web có
xu hướng tăng lên do sự gia tăng số lượng người
dùng thiết bị di động, sự phát triển của Internet
cũng như sự mở rộng của nhiều dịch vụ của nó.
Do đó càng làm tăng khả năng bị tấn công vào
thiết bị di động của người dùng cũng như hệ
thống máy tính. Mã độc thường được sử dụng để
thu thập thông tin về người dùng, dữ liệu cá
nhân nhạy cảm, truy cập vào tài nguyên Web
hoặc phá hoại các tài nguyên này. Mục đích của
nghiên cứu nhằm tăng cường độ chính xác phát
hiện các cuộc tấn công máy tính vào các ứng
dụng Web. Bài báo trình bày một mô hình biểu
diễn các yêu cầu Web, dựa trên mô hình không
gian vectơ và các thuộc tính của các yêu cầu đó
sử dụng giao thức HTTP. So sánh với các nghiên
cứu được thực hiện trước đây cho phép chúng
tôi ước tính độ chính xác phát hiện xấp xỉ 96%
cho các ứng dụng Web khi sử dụng bộ dữ liệu
KDD 99 trong đào tạo cũng như phát hiện tấn
công đi kèm với việc biểu diễn truy vấn dựa trên
không gian vectơ và phân loại dựa trên mô hình
cây quyết định.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Tóm tắt nội dung tài liệu: Representation model of requests to web resources, based on a vector space model and attributes of requests for HTTP protocol
files used to analyze the object belongs to the class of the only nearest network packages, an average of 40.3% in neighbor. comparison with the standard module. In [17], the authors used a combined In [21], a comparative analysis of the approach – a combination of the genetic capabilities of an artificial neural network and algorithm [18] and the k-nearest neighbor the decision trees method for solving problems classifier to detect denial of service attacks. of detecting computer attacks is carried out. The goal of the genetic algorithm is to find the The researchers came to the conclusions that artificial neural network is effective for optimal weight vector, in which represents i generalization and not suitable for detecting the weight of features 1 in. For two new attacks, while decision trees are effective vectors features X { x12 , x ,..., xn } and for both tasks. 5. Support vector machine Y { y12 , y ,..., yn } distance between them will be calculated as follows: The initial data in the support vector machine method is a set of elements located 46 No 2.CS (10) 2019 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin in space. The dimension of space corresponds A. Formation of feature space for our model to the number of classifying signs, their value To set the model for presenting requests to determining the position of elements (points) Web resources, the author has carried out the in space. formation of a corresponding feature space, that The support vector machine method has allowed to evaluate its adequacy from the refers to linear classification methods. Two standpoint of solving the problem of detecting sets of points belonging to two different computer attacks on Web applications. classes are separated by a hyperplane in In fig.2 the main stages of analyzing an space. At the same time, the hyperplane is HTTP request received at the Web server input constructed in such a way that the distances are demonstrated. We divided the dataset into from it to the nearest instances of both two parts: requests with information about classes (support vectors) were maximum, attacks and normal requests. In the learning which ensures the strict accuracy of process, we will calculate all the necessary classification. values such as the expected value and the The support vector machine method allows variance of normal queries, then these values [22; 23]: are stored in the database MySQL for the attack • obtaining a classification function with a detection process. The analysis is performed on minimum upper estimate of the expected risk the appropriate fields of the protocol to ensure (level of classification error); further possibility of its representation in the • using a linear classifier to work with vector space model. It also analyzes and nonlinearly shared data. calculates a number of attributes selected by the author. Thus, the proposed query representation III. MODEL FOR PRESENTING model allows moving from the text REQUESTS TO WEB RESOURCES, BASED representation to the totality of features of the ON THE VECTOR SPACE MODEL AND vector space model for the corresponding ATTRIBUTES OF REQUESTS VIA HTTP protocol fields and query attributes. The anomaly detection approach is based on The basic steps to form a model for each the analysis of HTTP requests processed by query are the following: most common Web servers (for example, • Extracting and analyzing data: analysis of Apache or nginx) and is intended to be built in all the incoming requests from the Web Web Application Firewall (WAF). WAF browser is carried out. analyzes all requests coming to the Web server • Transformation into a vector space model: and makes decisions about their execution on the server (Fig.1). it is used to transform text data into a vector representation using the TF-IDF algorithm [24], which allows estimating the weight of features for the entire text data array. Calculation of attribute values: the values of 8 attributes proposed by the author are calculated. 1. Extracting and analyzing data At the entrance of the Web server requests via HTTP are received. An example of the contents of a GET request is shown in Fig.3. Fig.1. WAF in Web Application Security System No 2.CS (10) 2019 47 Journal of Science and Technology on Information Security Fig. 2. Example of the content fields of HTTP request (GET method) 2. Conversion to a Vector Space Model To convert strings into a vector form, allowing further application of machine learning Fig.3 - Analysis of incoming requests for Web applications within the framework of the proposed model methods, an approach based on the TF-IDF method was chosen [24]. The length of the request fields sent from TF-IDF is a statistical measure used to the browser (A1). assess the importance of words in the context The distribution of characters in the of a document that is part of a document request (A2). collection or corpus. The weight of a word is Structural inference (A3). proportional to the number of uses of the word Token finder (A4). in the document and inversely proportional to Attribute order (A5). the frequency of the word use in other The author proposed to introduce 3 documents of the collection. Application of the additional attributes to improve the accuracy of TF-IDF approach to the problem being solved attack detection. is carried out for each request. The length of the request sent from the For each word 푡 in the query in the total browser (A6) of queries the value tfidf is calculated From the analysis of legitimate requests via the HTTP protocol, it was found out that their according to the following expression: length varies slightly. However, in the event of an tfidf(,)(,)() t d tf t d idf t (2) attack, the length of the data field may change The values of tf, idf are calculated in significantly (for example, in the case of SQL accordance with expressions (3), (4) respectively, injection or cross-site scripting). where 푣 is the rest of the words in the query . Therefore, to estimate the limiting thresholds count(,) t d for changing the length of requests, two of the tf(,) t d (3) count(,) v d parameters are evaluated: the expected value and d variance 2 for the training set of legitimate data. ||D Using Chebyshev's inequality, we can estimate idf( t ) log (4) |d D : t d | the probability that a random variable will take a value far from its mean (expression (5)). Thus, after converting the query ∈ into 2 the vector representation | | it will be set using Px(| | ) , (5) the set of weights {푤푡∈ } for each value t from the dictionary T. where is a random variable, 휏 is the threshold 3. Calculation of attribute values value of its change. In [25], 5 basic attributes were proposed for Accordingly, for any probability distribution building a detection system computer attacks on with mean and variance , it is necessary to web applications: choose a value such that a deviation x from the 48 No 2.CS (10) 2019 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin mean 휇, when the threshold is exceeded, results in blocking the query with the lowest level of errors of the first and second kind. The attribute value is equal to the probability value from expression (5): A6 P (| x | ) . (6) Appearance of new characters (A7) Fig. 4. An example of the complete dangerous HTTP From the training sample of legitimate request with the POST method requests, we have to select some non-repeating When analyzing a full HTTP request, the characters (including various encodings) in order author focuses on the data in a red frame (Fig. 3). to compose the set of symbols of the alphabet . After the extraction process, the data will be Thus, when the symbol bA appears in the saved in the appropriate files (good_request.txt query, the value of the counter for this attribute is and bad_request.txt). The structure of these files increased by one. The value of the attribute itself is shown in Fig. 4. is calculated as the ratio of the counter value to the power of the alphabet set: p A7 b (7) ||A The emergence of new keywords (A8) From the training sample of legitimate queries, we have to select some non-repeating terms (words) - 푡 in order to compose a set of terms of the dictionary. Thus, when the word T appears in the query, the counter value p for this attribute is increased by one. The value of Fig.5. File of dangerous HTTP request the attribute itself is calculated as the ratio of the A preliminary study allowed us to obtain value of the counter to the power of the set of an estimate of the accuracy of detecting attacks terms of the dictionary: on Web applications of 96% for the data set [15] p A8 (8) using the entered query attributes, query vector ||T representation models and classifier based on decision trees. This fact allows us to conclude IV. CONCLUSION that it is possible to build an algorithm for For testing the operation of machine learning detecting computer attacks on Web applications methods, a data set from several data sources of based on the proposed model for presenting system protection tools will be used, such as log requests to Web resources based on the vector files of the intrusion detection and prevention space model and differing in the attribute system, HTTP requests (GET, POST method) of attributes of requests via HTTP. the web application firewall, etc. REFERENCES [1] ]. Kaspersky Lab. Security report. - 2019. - (дата обращения: 15.04.2019). http:/ / www. securelist. com / en / analysis / 204792244 / The - geography - of - cybercrime - Western - Europe- and-North-America. [2]. A survey of intrusion detection techniques in cloud / C. Modi [et al.] // Journal of Network and Computer Applications. - Vol. 36, no. 1. - P. 42-57, 2013. No 2.CS (10) 2019 49 Journal of Science and Technology on Information Security [3]. Khamphakdee N., Benjamas N., Saiyod S. Improving Уфимского государственного авиационного intrusion detection system based on snort rules for тех¬нического университета. - 2015. - Т. 19, 4 (70). network probe attack detection // Information and [17]. Su M.-Y. Real-time anomaly detection systems for Communication Technology (IColCT), 2014 2nd Denial-of-Service attacks by weighted k- nearest- International Conference On. - IEEE. - P. 69-74. 2014. neighbor classifiers // Expert Systems with [4]. A stateful intrusion detection system for world-wide Applications. - Vol. 38, no. 4. - P. 3492-3498. - 2011. web servers / G. Vigna [et al.] // Computer Security [18]. Lee C. H., Chung J. W., Shin S. W. Network Applications Conference, 2003. Proceedings. 19th intrusion detection through genetic feature selection // Annual. - IEEE.. - P. 34-43., 2003 Software Engineering, Artificial Intelligence, [5]. Sekar R. An Efficient Black-box Technique for Networking, and Parallel/Distributed Computing, Defeating Web Application Attacks. // NDSS. - 2009. 2006. SNPD 2006. Seventh ACIS International [6]. Mutz D., Vigna G., Kemmerer R. An experience Conference on. - IEEE - P. 109-114, 2006. developing an IDS stimulator for the blackbox testing [19]. Intrusion detection with genetic algorithms and fuzzy of network intrusion detection systems // Computer logic / E. Ireland [et al.] // UMM CSci senior seminar Security Applications Conference, 2003. Proceedings. conference..- Pp. 1-6, 2013. 19th Annual. - IEEE- P. 374-383, . 2003.. [20]. Kruegel C., Toth T. Using decision trees to improve [7]. Li X., Xue Y. BLOCK: a black-box approach for signature-based intrusion detection // Recent Advances detection of state violation attacks towards web in Intrusion Detection. - Springer - P. 173-191, 2003. applications // Proceedings of the 27th Annual [21]. Bouzida Y., Cuppens F. Neural networks vs. Computer Security Applications Conference. - ACM - P. 247-256, 2011. [8]. Saxena P., Sekar R., Puranik V. Efficient fine-grained ABOUT THE AUTHORS binary instrumentationwith applications to taint- Manh Thang Nguyen tracking // Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and Workplace: Information Technology optimization. - ACM..- P. 74-83, 2008. Faculty – Academy of cryptography techniques. [9]. Браницкий А. А., Котенко И. В. Анализ и классификация методов обнаружения сетевых Email: chieumatxcova@gmail.com атак // Труды СПИИРАН. - Т. 2, № 45. - С. Training process: 207—244, 2016. 2005-2007: Student at the Military [10]. Heckerman D. A tutorial on learning with Bayesian Technical Academy. networks // Innovations in Bayesian networks. - 2007-2013: Student at the Applied Mathematics and Springer. - P. 33-82, 2008. Informatics Faculty - Lipetsk State Pedagogical [11]. Friedman N., Geiger D., Goldszmidt M. Bayesian University – Russia Federation. network classifiers // Machine learning. - - Vol. 29, no. 2017-present: Post-graduate student at the Military 2-3. - P. 131-163, 1997. Academy of the Federal Guard Service Russian [12]. Goldszmidt M. Bayesian network classifiers // Wiley Federation. Encyclopedia of Operations Research and Management Science. - 2010. Research today: Computer network, network security, machine learning and data mining. [13]. Barbara D., Wu N., Jajodia S. Detecting novel network intrusions using bayes estimators // Proceedings of the 2001 SIAM International D.S. Alexander Kozachok Conference on Data Mining. - SIAM. - P. 1-17, . 2001 . Workplace: The Academy of [14]. Нейросетевая технология обнаружения сетевых Federal Guard Service of the атак на информационные ресурсы / Ю. Г. Russian Federation. Емельянова [и др.] // Программные системы: Email: alex.totrin@gmail.com теория и приложения. - Т. 2, № 3. - С. 3-15., 2011. The education process: has [15]. A Detailed Analysis of the KDD CUP 99 Data Set / received PhD. degree in M. Tavallaee [и др.] // Proceedings of the Second Engineering Sciences in Academy IEEE International Conference on Computational of Federal Guard Service of the Intelligence for Security and Defense Applications. - Russian Federation in Dec. 2012. Ottawa, Ontario, Canada: IEEE Press. - С. 53—58. - (CISDA’09). - URL: Research today: Information security; Unauthorized access 1736481.17 36489, 2009. protection; Mathematical cryptography; theoretical problems of computer. [16]. Васильев В.И., Шарабыров И.В. Интеллектуальная система обнаружения атак в ло¬кальных беспроводных сетях // Вестник 50 No 2.CS (10) 2019
File đính kèm:
- representation_model_of_requests_to_web_resources_based_on_a.pdf