Classification of sequences generated by compression and encryption algorithms
According to reports from the Infowatch
Analytics Center [1-3], the number of
confidential data leaks is growing from year to
year. Fig.1 provides statistics on the leaks
occurred in 2011-2018. The total damage from
leaks in 2013 amounted to more than US$7.5
billion. Equifax, the international credit
reference bureau, has already spent US$1.4
billion to eliminate one information leakage that
occurred in 2017 and the amount has continued
to grow due to repetitive claims. The share of
và các dãy được tạo ra bởi các thuật toán nén và
mã hóa. Các kết quả của việc đánh giá dẫn tới kết
luận rằng không gian đặc trưng được đề xuất có
thể được sử dụng để xác định các thuật toán nén
ZIP, RAR và các thuật toán mã hóa AES, 3DES
với độ chính xác lớn hơn 95%.
Keywords—Identification of compression and
encryption algorithms, statistical testing of
information.
Từ khóa—Xác định các thuật toán nén và mã
hóa, kiểm tra thông tin thống kê.
I. INTRODUCTION
According to reports from the Infowatch
Analytics Center [1-3], the number of
confidential data leaks is growing from year to
year. Fig.1 provides statistics on the leaks
occurred in 2011-2018. The total damage from
leaks in 2013 amounted to more than US$7.5
billion. Equifax, the international credit
reference bureau, has already spent US$1.4
billion to eliminate one information leakage that
occurred in 2017 and the amount has continued
to grow due to repetitive claims. The share of
This manuscript is received April 22, 2019. It is commented
on July 30, 2019 and is accepted on August 6, 2019 by the
first reviewer. It is commented on August 20, 2019 and is
accepted on August 27, 2019 by the second reviewer.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Tóm tắt nội dung tài liệu: Classification of sequences generated by compression and encryption algorithms
of 2019, 80% classify files the authors used the NIST of the traffic will be the same. Encryption statistical test suite indicating encrypted data can also be used to hide interactions with with an accuracy of more than 0,95. malware command servers and solve other In order to prevent information leakage, it problems. According to the Ponemon is necessary to block the transmission of Institute report for 2016, almost half (41%) encrypted data, which determines the attackers use encryption to bypass relevance of solving the problem of mechanisms of detecting their unauthorized classifying encrypted, compressed and activity. Security tools cannot inspect pseudo-random sequences (PRS). encrypted traffic (according to the Ponemon Institute, 64% of companies cannot detect III. PROBLEM STATEMENT malicious code in encrypted traffic) [10]. In general, the research problem is The existing security solutions are able to formulated as follows: it is necessary to map the analyze the content of encrypted traffic, for original set of bit sequences X to the new set of example, by means of a man-in-the-middle classes Y by using the selected feature space. attack, but due to the high cost of implementing To be able to estimate the accuracy of the the methods they are practically not applicable classification, the accuracy characteristic in real conditions [10]. However, there is at calculated by Formula 1 was used. least one way to analyze the content of TP TN Accuracy , (1) encrypted traffic without decrypting it – Cisco TP TN FP FN ETA (Encrypted Traffic Analytics), which where TP is the number of objects correctly allows, on the basis of network telemetry assigned to class i, TN is the number of objects received from the network equipment and correctly assigned to class j, FP is the number of machine learning algorithms, classifying false positives (type I error), FN is the number encrypted traffic, simultaneously separating the of false negatives (type II error). pure traffic in it from the malicious one. It is not the data field in the encrypted batch that is used In a formalized form, the classification task for analysis, but rather its header, from which can be defined by the following expression: the extended telemetry is obtained [10]: F : X Y , (2) 1. from Netflow – addresses and ports of the where X is the initial set of bit sequences to source and destination (SrcIP, DstIP, SrcPort, be classified, Y is the set of classes. DstPort), information on the protocol, the number of transferred packets and bytes; 4 No 2.CS (10) 2019 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin The accuracy value in the classification subsamples equal to 10 were chosen) of machine must satisfy the condition presented in learning algorithms with default parameters was expression (3): used [18-23]: Decision Tree Classifier (DTC), Random Forest Classifier (RFC). p(F(xi ) y j | i j) 0,95, (3) The results of the experiments are presented where p is the probability, F(х) is the in Table 2. function of displaying the i-th file of the set X, y TABLE 1. THE RESULTS OF CROSS- is the mark of the class from the set Y, i, j are VALADATION OF MACHINE LEARNING the indices in the set of X and Y, respectively. ALGORITHMS IN THE NIST TEST FEATURE SPACE ON A SAMPLE COMPRESSED AND ENCRYPTED To develop and evaluate the classification SEQUENCES method of encrypted, compressed and PRS the following particular tasks were set: Algorithm Accuracy value 1. To check the possibility of binary Decision Tree 0,57508 classification of sequences generated by Random Forest 0,642579 encryption and compression algorithms. From the analysis of the results presented in 2. To check the possibility of multi-class Table 2, it is reasonable to conclude that the RFC classification of sequences generated by has greater accuracy, but the resulting accuracy encryption, compression, and PRS algorithms. value of detecting sequence types equal to 0,642 IV. TESTING THE POSSIBILITY OF A does not allow constructing a classifier that meets BINARY CLASSIFICATION OF the requirement presented in expression 3. Thus, it SENQUENCES GENERATED BY was concluded that the feature space based on the COMPRESSION AND ENCRYPTION NIST test results would not allow solving the ALGORITHMS problem of binary classification taking into 4.1 Binary classification for features account the selected restrictions. generated on the basis of NIST test results 4.2 Binary classification on the feature space generated on the basis of the results of In the course of the research it was assumed analyzing N length subsequence frequency. that the results of NIST tests might be a feature space for constructing a classifier that allows Then it was assumed in the research that as a distinguishing pseudo-random sequences by a feature space for solving the problem of bit source type. sequence binary classification the results of analyzing the frequency of independent bit To test the possibility of solving the subsequences of different length N (in bits) classification problem, the NIST SP 800- without taking into account the complete 22rev1a statistical test suite was used to overlap of each subsequence can be used. For evaluate random and pseudo-random sequence example, for the sequence S = "00011011" and generators in cryptography [17]. N = 2 bits, the frequency occurrence of N bit To conduct the experiment the initial sample length subsequences is presented in Table 3. of 1000 files, 600 KB each, containing the text TABLE 2. EXAMPLE OF COUNTING THE in Russian, was converted by encryption and FREQUENCIES OF BIT SUBSEQUENCES compression algorithms. As a result, the resulting sample of 4000 files was divided into Subsequence Amount Frequency 2 classes: 00 1 0,142857143 1. Encrypted by AES, 3DES algorithms. 01 2 0,285714286 2. Compressed by RAR, ZIP algorithms. Next, the files from the sample were 10 1 0,142857143 processed with a package of statistical tests, as a 11 2 0,285714286 result, 188 features were obtained. To assess the applicability of the selected In the works of the authors it was shown that feature space, cross-validation (the number of for binary sequences it was enough to analyze No 2.CS (10) 2019 5 Journal of Science and Technology on Information Security half of all possible subsequences [24,25]. Using In the course of the experiments to assess the this assumption, the dimension of the feature possibility of constructing binary classifiers,the space for N bit length subsequences has been RFC showed the highest accuracy, the results halved, the number of possible feature on the are presented in Table 5. assumption is presented in Table 4. An average accuracy of more than 0,95 is TABLE 3: THE DIMENSION OF THE FEATURE achieved at N> = 8 bits. SPACE FOR N BIT LENGTH The dependence of the average accuracy of classifying files on the basis of the RFC on the Cross- length of the subsequence and the time spent on Length of Amount Amount of validation subsequence of features, time on the selection of features from the training (N), bit features total assumption, sample is shown in Fig.2. A sufficient ratio of min accuracy and time is determined at N = 9 bits length of the subsequence, in this case there is a 4 8 16 6 significant increase in the accuracy of sequence 5 16 32 11 classification at maintaining an acceptable amount of time spent on extracting features. 6 32 64 15 7 64 128 22 8 128 256 37 9 256 512 69 TABLE 4. ACCURACY OF DISTINGUISHING FILE TYPES WHEN USING RFC Algorithm accuracy for length sequence N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 File type Time taken to retrieve features in minutes 6 11 15 22 37 69 135 268 AES/7-Z 0,821 0,843 0,834 0,835 0,938 0,993 0,998 1,000 AES/RAR 0,986 0,992 0,994 0,992 0,991 0,997 0,993 0,993 AES/ZIP 0,986 0,992 0,988 0,990 0,993 0,998 0,999 0,999 3DES/7-Z 0,834 0,846 0,865 0,865 0,947 0,993 0,998 1,000 3DES/RAR 0,984 0,990 0,991 0,993 0,991 0,996 0,994 0,994 3DES/ZIP 0,991 0,988 0,991 0,989 0,991 0,997 0,998 0,999 7-Z/ZIP 0,968 0,974 0,974 0,973 0,971 0,972 0,976 0,978 7-Z/RAR 0,948 0,960 0,964 0,963 0,964 0,983 0,996 0,999 RAR/ZIP 0,840 0,864 0,864 0,863 0,868 0,977 0,999 1,000 Mean value 0,929 0,939 0,941 0,940 0,962 0,990 0,995 0,996 значение Fig.2. Dependence of the file type classification accuracy at using the RFC on the subsequence length 6 No 2.CS (10) 2019 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin The AES/3DES pair was removed from the as a feature space for the classifying binary tested sample. Fig.3. shows the dependence of sequences generated by encryption and the accuracy of distinguishing sequences compression algorithms was verified. The encrypted with the AES and 3DES algorithms accuracy of distinguishing the RFC sequences on the length of the subsequence N. was 0,64, which does not satisfy the Due to the ability of ciphers to dissipate the requirement specified in expression (3). statistics of the source data, the accuracy of the To solve the problem of binary sequence binary classification of sequences encrypted with the classification, a new feature space was proposed, it AES and 3DES algorithms is on average 0,512, was formed by counting the frequency of different which makes it impossible to construct a length bit subsequences. The classification classifier for sequences of this type. accuracy of the RFC at the length of the In the course of the research, the possibility subsequence N = 9 bits is 0,99, it satisfies the of applying the results of NIST statistical tests requirement specified in expression (3). Fig.3. Dependence of the accuracy of distinguishing sequences encrypted with the AES and 3DES algorithms on the length of the subsequence REFERENCES [5]. Y. Miao, Z. Ruan, L. Pan, Y. Wang, J. Zhang, Y. [1]. INFOWATCH company group site. URL: Xiang. Automated Big Traffic Analytics for Cyber https://www.infowatch.ru/analytics/reports.4.html Security // Eprint arXiv:1804.09023, bibcode: (дата обращения: 30.05.2019). 2018arXiv180409023M. 2018. [2]. INFOWATCH company group site. URL: [6]. S. Miller, K. Curran, T. Lunney. Multilayer https://www.infowatch.ru/sites/default/files/report/ Perceptron Neural Network for Detection of analytics/russ/infowatch_otchet_032014_smb_fin.p Encrypted VPN Network Traffic // International df (дата обращения: 30.05.2019). Conference on Cyber Situational Awarness, Data Analytics and Assessment. 2018. ISBN: 978-1- [3]. INFOWATCH company group site. URL: 5386-4565-9. https://www.infowatch.ru/analytics/leaks_monitori ng/15678 (дата обращения: 30.05.2019). [7]. P. Wang, X. Chen, F. Ye, Z. Sun. A Survey of Techniques for Mobile Service Encrypted Traffic [4]. X. Huang, Y. Lu, D. Li, M. Ma. A novel mechanism Classification Using Deep Learning // IEEE for fast detection of transformed data leakage // Access. Special section on challenges and IEEE Access. Special section on challenges and opportunities of big data against cyber crime. Vol. opportunities of big data against cyber crime. Vol. 7, 2019. pp. 54024-54033 doi: 6, 2018. pp. 35926-35936 10.1109/ACCESS.2019.2912896 No 2.CS (10) 2019 7 Journal of Science and Technology on Information Security [8]. K. Demertzis, N. Tziritas, P. Kikiras, S.L. Sanchez, [16]. Breiman L., Friedman J., Olshen R., Stone C. L. Iliadis. The Next Generation Cognitive Security Classification and Regression Trees // Wadsworth, Operations Center: Adaptive Analytic Lambda Belmont, CA. 1984. 368 p. ISBN: 9781351460491. Architecture for Efficient Defense against [17]. Hastie T., Tibshirani R., Friedman J. Elements of Adversarial Attacks // Big Data and Cognitive Statistical Learning // Springer. 2009. pp. 587-601. Computing, 2019 3(6). ISBN: 978-0387848570. [9]. H. Zhang, C. Papadopoulos, D. Massey. Detecting [18]. L. Breiman, A. Cutler. Random Forests // encrypted botnet traffic // 16th IEEE Global URL:https://www.stat.berkeley.edu/~breiman/Ran Internet Symposium. 2013. p. 3453. domForests/cc_home.htm (дата обращения: [10]. T. Radivilova, L. Kirichenko, D. Ageyev, M. 14.01.2019). Tawalbeh, V. Bulakh Decrypting SSL/TLS Traffic [19]. S. Raska. Python and machine learning // M .: for Hidden Threats Detection // IEEE 9th DMK-Press. 2017. 418 p. ISBN: 978-5-97060- International Conference on Dependable Systems, 409-0. Services and Technologies (DESSERT), 2018. [20]. L. Breiman. Random Forests // Journal Machine ISBN: 978-1-5386-5903-8. Learning 45(1). 2001. pp. 5-32. [11]. M. Piccinelli, P. Gubian. Detecting hidden [21]. M.Yu. Konyshev. Formation of probability encrypted volume files via statistical analysis // distributions of binary vectors of the source of International Journal of Cyber-Security and Digital errors of a Markov discrete communication Forensics. Vol. 3(1). 2013 pp. 30-37. channel with memory using the method of [12]. NIST STS manual. URL: "grouping probabilities" of error vectors. / https://csrc.nist.gov/Projects/Random-Bit- M.Yu. Konyshev, A.Yu. Barabashov, K.E. Generation/ (дата обращения: 14.01.2019). Petrov, A.A. Dvilyansky // Industrial ACS and [13]. Toolkit for the transport layer security and secure controllers. 2018. № 3. P. 42-52. sockets layer protocols. [22]. M.Yu. Konyshev. A compression algorithm for a URL: (дата обращения: 14.01.2019) series of distributions of binary multidimensional [14]. Archive manager WinRAR. URL: random variables. / M.Yu. Konyshev, A.A. (дата обращения: 14.01.2019). Dvilyansky, K.E. Petrov, G.A. Ermishin // Industrial ACS and controllers. 2016. No. 8. P. 47-50. [15]. Pedregosa F., et al. Scikit-learn: Machine Learning in Python // Journal of Machine Learning Research 12. 2011. pp. 2825-2830. ABOUT THE AUTHORS D.S. Alexander Kozachok Spirin Andrey Andreevich Workplace: The Academy of Workplace: The Academy of Federal Guard Service of the Federal Guard Service of the Russian Federation. Russian Federation. Email: alex.totrin@gmail.com Email: spirin_aa@bk.ru The education process: The education process: received PhD. degree in graduated from the Academy Engineering Sciences in of the Federal Guard Service Academy of Federal Guard of the Russian Federation in 2010. Service of the Russian Research today: Information security, information Federation in Dec. 2012. leakage prevention systems, statistical testing. Research today: Information security; Unauthorized access protection; Mathematical cryptography; theoretical problems of computer. 8 No 2.CS (10) 2019
File đính kèm:
- classification_of_sequences_generated_by_compression_and_enc.pdf