Classification of sequences generated by compression and encryption algorithms

According to reports from the Infowatch

Analytics Center [1-3], the number of

confidential data leaks is growing from year to

year. Fig.1 provides statistics on the leaks

occurred in 2011-2018. The total damage from

leaks in 2013 amounted to more than US$7.5

billion. Equifax, the international credit

reference bureau, has already spent US$1.4

billion to eliminate one information leakage that

occurred in 2017 and the amount has continued

to grow due to repetitive claims. The share of

và các dãy được tạo ra bởi các thuật toán nén và

mã hóa. Các kết quả của việc đánh giá dẫn tới kết

luận rằng không gian đặc trưng được đề xuất có

thể được sử dụng để xác định các thuật toán nén

ZIP, RAR và các thuật toán mã hóa AES, 3DES

với độ chính xác lớn hơn 95%.

Keywords—Identification of compression and

encryption algorithms, statistical testing of

information.

Từ khóa—Xác định các thuật toán nén và mã

hóa, kiểm tra thông tin thống kê.

I. INTRODUCTION

According to reports from the Infowatch

Analytics Center [1-3], the number of

confidential data leaks is growing from year to

year. Fig.1 provides statistics on the leaks

occurred in 2011-2018. The total damage from

leaks in 2013 amounted to more than US$7.5

billion. Equifax, the international credit

reference bureau, has already spent US$1.4

billion to eliminate one information leakage that

occurred in 2017 and the amount has continued

to grow due to repetitive claims. The share of

This manuscript is received April 22, 2019. It is commented

on July 30, 2019 and is accepted on August 6, 2019 by the

first reviewer. It is commented on August 20, 2019 and is

accepted on August 27, 2019 by the second reviewer.

 

Classification of sequences generated by compression and encryption algorithms trang 1

Trang 1

Classification of sequences generated by compression and encryption algorithms trang 2

Trang 2

Classification of sequences generated by compression and encryption algorithms trang 3

Trang 3

Classification of sequences generated by compression and encryption algorithms trang 4

Trang 4

Classification of sequences generated by compression and encryption algorithms trang 5

Trang 5

Classification of sequences generated by compression and encryption algorithms trang 6

Trang 6

pdf 6 trang duykhanh 8200
Bạn đang xem tài liệu "Classification of sequences generated by compression and encryption algorithms", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Classification of sequences generated by compression and encryption algorithms

Classification of sequences generated by compression and encryption algorithms
 of 2019, 80% classify files the authors used the NIST 
of the traffic will be the same. Encryption statistical test suite indicating encrypted data 
can also be used to hide interactions with with an accuracy of more than 0,95. 
malware command servers and solve other In order to prevent information leakage, it 
problems. According to the Ponemon is necessary to block the transmission of 
Institute report for 2016, almost half (41%) encrypted data, which determines the 
attackers use encryption to bypass relevance of solving the problem of 
mechanisms of detecting their unauthorized classifying encrypted, compressed and 
activity. Security tools cannot inspect pseudo-random sequences (PRS). 
encrypted traffic (according to the Ponemon 
Institute, 64% of companies cannot detect III. PROBLEM STATEMENT 
malicious code in encrypted traffic) [10]. In general, the research problem is 
 The existing security solutions are able to formulated as follows: it is necessary to map the 
analyze the content of encrypted traffic, for original set of bit sequences X to the new set of 
example, by means of a man-in-the-middle classes Y by using the selected feature space. 
attack, but due to the high cost of implementing To be able to estimate the accuracy of the 
the methods they are practically not applicable classification, the accuracy characteristic 
in real conditions [10]. However, there is at calculated by Formula 1 was used. 
least one way to analyze the content of TP TN
 Accuracy , (1) 
encrypted traffic without decrypting it – Cisco TP TN FP FN
ETA (Encrypted Traffic Analytics), which 
 where TP is the number of objects correctly 
allows, on the basis of network telemetry 
 assigned to class i, TN is the number of objects 
received from the network equipment and 
 correctly assigned to class j, FP is the number of 
machine learning algorithms, classifying 
 false positives (type I error), FN is the number 
encrypted traffic, simultaneously separating the 
 of false negatives (type II error). 
pure traffic in it from the malicious one. It is not 
the data field in the encrypted batch that is used In a formalized form, the classification task 
for analysis, but rather its header, from which can be defined by the following expression: 
the extended telemetry is obtained [10]: F : X Y , (2) 
 1. from Netflow – addresses and ports of the where X is the initial set of bit sequences to 
source and destination (SrcIP, DstIP, SrcPort, be classified, Y is the set of classes. 
DstPort), information on the protocol, the number of 
transferred packets and bytes; 
4 No 2.CS (10) 2019 
 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 The accuracy value in the classification subsamples equal to 10 were chosen) of machine 
must satisfy the condition presented in learning algorithms with default parameters was 
expression (3): used [18-23]: Decision Tree Classifier (DTC), 
 Random Forest Classifier (RFC). 
 p(F(xi ) y j | i j) 0,95, (3) 
 The results of the experiments are presented 
 where p is the probability, F(х) is the in Table 2. 
function of displaying the i-th file of the set X, y 
 TABLE 1. THE RESULTS OF CROSS-
is the mark of the class from the set Y, i, j are VALADATION OF MACHINE LEARNING 
the indices in the set of X and Y, respectively. ALGORITHMS IN THE NIST TEST FEATURE SPACE 
 ON A SAMPLE COMPRESSED AND ENCRYPTED 
 To develop and evaluate the classification SEQUENCES 
method of encrypted, compressed and PRS the 
following particular tasks were set: Algorithm Accuracy value 
 1. To check the possibility of binary Decision Tree 0,57508 
classification of sequences generated by Random Forest 0,642579 
encryption and compression algorithms. From the analysis of the results presented in 
 2. To check the possibility of multi-class Table 2, it is reasonable to conclude that the RFC 
classification of sequences generated by has greater accuracy, but the resulting accuracy 
encryption, compression, and PRS algorithms. value of detecting sequence types equal to 0,642 
 IV. TESTING THE POSSIBILITY OF A does not allow constructing a classifier that meets 
 BINARY CLASSIFICATION OF the requirement presented in expression 3. Thus, it 
 SENQUENCES GENERATED BY was concluded that the feature space based on the 
 COMPRESSION AND ENCRYPTION NIST test results would not allow solving the 
 ALGORITHMS problem of binary classification taking into 
 4.1 Binary classification for features account the selected restrictions. 
generated on the basis of NIST test results 4.2 Binary classification on the feature space 
 generated on the basis of the results of 
 In the course of the research it was assumed analyzing N length subsequence frequency. 
that the results of NIST tests might be a feature 
space for constructing a classifier that allows Then it was assumed in the research that as a 
distinguishing pseudo-random sequences by a feature space for solving the problem of bit 
source type. sequence binary classification the results of 
 analyzing the frequency of independent bit 
 To test the possibility of solving the subsequences of different length N (in bits) 
classification problem, the NIST SP 800- without taking into account the complete 
22rev1a statistical test suite was used to overlap of each subsequence can be used. For 
evaluate random and pseudo-random sequence example, for the sequence S = "00011011" and 
generators in cryptography [17]. N = 2 bits, the frequency occurrence of N bit 
 To conduct the experiment the initial sample length subsequences is presented in Table 3. 
of 1000 files, 600 KB each, containing the text 
 TABLE 2. EXAMPLE OF COUNTING THE 
in Russian, was converted by encryption and 
 FREQUENCIES OF BIT SUBSEQUENCES 
compression algorithms. As a result, the 
resulting sample of 4000 files was divided into Subsequence Amount Frequency 
2 classes: 00 1 0,142857143 
 1. Encrypted by AES, 3DES algorithms. 
 01 2 0,285714286 
 2. Compressed by RAR, ZIP algorithms. 
 Next, the files from the sample were 10 1 0,142857143 
processed with a package of statistical tests, as a 11 2 0,285714286 
result, 188 features were obtained. 
 To assess the applicability of the selected In the works of the authors it was shown that 
feature space, cross-validation (the number of for binary sequences it was enough to analyze 
 No 2.CS (10) 2019 5 
Journal of Science and Technology on Information Security 
half of all possible subsequences [24,25]. Using In the course of the experiments to assess the 
this assumption, the dimension of the feature possibility of constructing binary classifiers,the 
space for N bit length subsequences has been RFC showed the highest accuracy, the results 
halved, the number of possible feature on the are presented in Table 5. 
assumption is presented in Table 4. An average accuracy of more than 0,95 is 
 TABLE 3: THE DIMENSION OF THE FEATURE achieved at N> = 8 bits. 
 SPACE FOR N BIT LENGTH The dependence of the average accuracy of 
 classifying files on the basis of the RFC on the 
 Cross- length of the subsequence and the time spent on 
 Length of Amount Amount of validation 
subsequence of features, time on the selection of features from the training 
 (N), bit features total assumption, sample is shown in Fig.2. A sufficient ratio of 
 min accuracy and time is determined at N = 9 bits 
 length of the subsequence, in this case there is a 
 4 8 16 6 significant increase in the accuracy of sequence 
 5 16 32 11 classification at maintaining an acceptable 
 amount of time spent on extracting features. 
 6 32 64 15 
 7 64 128 22 
 8 128 256 37 
 9 256 512 69 
 TABLE 4. ACCURACY OF DISTINGUISHING FILE TYPES WHEN USING RFC 
 Algorithm accuracy for length sequence 
 N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 
 File type 
 Time taken to retrieve features in minutes 
 6 11 15 22 37 69 135 268 
 AES/7-Z 0,821 0,843 0,834 0,835 0,938 0,993 0,998 1,000 
 AES/RAR 0,986 0,992 0,994 0,992 0,991 0,997 0,993 0,993 
 AES/ZIP 0,986 0,992 0,988 0,990 0,993 0,998 0,999 0,999 
 3DES/7-Z 0,834 0,846 0,865 0,865 0,947 0,993 0,998 1,000 
 3DES/RAR 0,984 0,990 0,991 0,993 0,991 0,996 0,994 0,994 
 3DES/ZIP 0,991 0,988 0,991 0,989 0,991 0,997 0,998 0,999 
 7-Z/ZIP 0,968 0,974 0,974 0,973 0,971 0,972 0,976 0,978 
 7-Z/RAR 0,948 0,960 0,964 0,963 0,964 0,983 0,996 0,999 
 RAR/ZIP 0,840 0,864 0,864 0,863 0,868 0,977 0,999 1,000 
 Mean value 0,929 0,939 0,941 0,940 0,962 0,990 0,995 0,996 
 значение 
 Fig.2. Dependence of the file type classification accuracy at using the RFC on the 
 subsequence length 
6 No 2.CS (10) 2019 
 Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 The AES/3DES pair was removed from the as a feature space for the classifying binary 
tested sample. Fig.3. shows the dependence of sequences generated by encryption and 
the accuracy of distinguishing sequences compression algorithms was verified. The 
encrypted with the AES and 3DES algorithms accuracy of distinguishing the RFC sequences 
on the length of the subsequence N. was 0,64, which does not satisfy the 
 Due to the ability of ciphers to dissipate the requirement specified in expression (3). 
statistics of the source data, the accuracy of the To solve the problem of binary sequence binary 
classification of sequences encrypted with the classification, a new feature space was proposed, it 
AES and 3DES algorithms is on average 0,512, was formed by counting the frequency of different 
which makes it impossible to construct a length bit subsequences. The classification 
classifier for sequences of this type. accuracy of the RFC at the length of the 
 In the course of the research, the possibility subsequence N = 9 bits is 0,99, it satisfies the 
of applying the results of NIST statistical tests requirement specified in expression (3). 
 Fig.3. Dependence of the accuracy of distinguishing sequences encrypted with the AES and 3DES 
 algorithms on the length of the subsequence 
 REFERENCES [5]. Y. Miao, Z. Ruan, L. Pan, Y. Wang, J. Zhang, Y. 
 [1]. INFOWATCH company group site. URL: Xiang. Automated Big Traffic Analytics for Cyber 
 https://www.infowatch.ru/analytics/reports.4.html Security // Eprint arXiv:1804.09023, bibcode: 
 (дата обращения: 30.05.2019). 2018arXiv180409023M. 2018. 
[2]. INFOWATCH company group site. URL: [6]. S. Miller, K. Curran, T. Lunney. Multilayer 
 https://www.infowatch.ru/sites/default/files/report/ Perceptron Neural Network for Detection of 
 analytics/russ/infowatch_otchet_032014_smb_fin.p Encrypted VPN Network Traffic // International 
 df (дата обращения: 30.05.2019). Conference on Cyber Situational Awarness, Data 
 Analytics and Assessment. 2018. ISBN: 978-1-
[3]. INFOWATCH company group site. URL: 5386-4565-9. 
 https://www.infowatch.ru/analytics/leaks_monitori
 ng/15678 (дата обращения: 30.05.2019). [7]. P. Wang, X. Chen, F. Ye, Z. Sun. A Survey of 
 Techniques for Mobile Service Encrypted Traffic 
[4]. X. Huang, Y. Lu, D. Li, M. Ma. A novel mechanism Classification Using Deep Learning // IEEE 
 for fast detection of transformed data leakage // Access. Special section on challenges and 
 IEEE Access. Special section on challenges and opportunities of big data against cyber crime. Vol. 
 opportunities of big data against cyber crime. Vol. 7, 2019. pp. 54024-54033 doi: 
 6, 2018. pp. 35926-35936 10.1109/ACCESS.2019.2912896 
 No 2.CS (10) 2019 7 
Journal of Science and Technology on Information Security 
[8]. K. Demertzis, N. Tziritas, P. Kikiras, S.L. Sanchez, [16]. Breiman L., Friedman J., Olshen R., Stone C. 
 L. Iliadis. The Next Generation Cognitive Security Classification and Regression Trees // Wadsworth, 
 Operations Center: Adaptive Analytic Lambda Belmont, CA. 1984. 368 p. ISBN: 9781351460491. 
 Architecture for Efficient Defense against [17]. Hastie T., Tibshirani R., Friedman J. Elements of 
 Adversarial Attacks // Big Data and Cognitive Statistical Learning // Springer. 2009. pp. 587-601. 
 Computing, 2019 3(6). ISBN: 978-0387848570. 
[9]. H. Zhang, C. Papadopoulos, D. Massey. Detecting [18]. L. Breiman, A. Cutler. Random Forests // 
 encrypted botnet traffic // 16th IEEE Global URL:https://www.stat.berkeley.edu/~breiman/Ran
 Internet Symposium. 2013. p. 3453. domForests/cc_home.htm (дата обращения: 
[10]. T. Radivilova, L. Kirichenko, D. Ageyev, M. 14.01.2019). 
 Tawalbeh, V. Bulakh Decrypting SSL/TLS Traffic [19]. S. Raska. Python and machine learning // M .: 
 for Hidden Threats Detection // IEEE 9th DMK-Press. 2017. 418 p. ISBN: 978-5-97060-
 International Conference on Dependable Systems, 409-0. 
 Services and Technologies (DESSERT), 2018. 
 [20]. L. Breiman. Random Forests // Journal Machine 
 ISBN: 978-1-5386-5903-8. 
 Learning 45(1). 2001. pp. 5-32. 
[11]. M. Piccinelli, P. Gubian. Detecting hidden 
 [21]. M.Yu. Konyshev. Formation of probability 
 encrypted volume files via statistical analysis // 
 distributions of binary vectors of the source of 
 International Journal of Cyber-Security and Digital 
 errors of a Markov discrete communication 
 Forensics. Vol. 3(1). 2013 pp. 30-37. 
 channel with memory using the method of 
[12]. NIST STS manual. URL: "grouping probabilities" of error vectors. / 
 https://csrc.nist.gov/Projects/Random-Bit- M.Yu. Konyshev, A.Yu. Barabashov, K.E. 
 Generation/ (дата обращения: 14.01.2019). Petrov, A.A. Dvilyansky // Industrial ACS and 
[13]. Toolkit for the transport layer security and secure controllers. 2018. № 3. P. 42-52. 
 sockets layer protocols. [22]. M.Yu. Konyshev. A compression algorithm for a 
URL:  (дата обращения: 14.01.2019) series of distributions of binary multidimensional 
[14]. Archive manager WinRAR. URL:  random variables. / M.Yu. Konyshev, A.A. 
 (дата обращения: 14.01.2019). Dvilyansky, K.E. Petrov, G.A. Ermishin // Industrial 
 ACS and controllers. 2016. No. 8. P. 47-50.
[15]. Pedregosa F., et al. Scikit-learn: Machine Learning 
 in Python // Journal of Machine Learning Research 
 12. 2011. pp. 2825-2830. 
 ABOUT THE AUTHORS 
 D.S. Alexander Kozachok Spirin Andrey Andreevich 
 Workplace: The Academy of Workplace: The Academy of 
 Federal Guard Service of the Federal Guard Service of the 
 Russian Federation. Russian Federation. 
 Email: alex.totrin@gmail.com Email: spirin_aa@bk.ru 
 The education process: The education process: 
 received PhD. degree in graduated from the Academy 
 Engineering Sciences in of the Federal Guard Service 
 Academy of Federal Guard of the Russian Federation in 2010. 
 Service of the Russian Research today: Information security, information 
 Federation in Dec. 2012. leakage prevention systems, statistical testing.
Research today: Information security; 
Unauthorized access protection; Mathematical 
cryptography; theoretical problems of computer. 
8 No 2.CS (10) 2019 

File đính kèm:

  • pdfclassification_of_sequences_generated_by_compression_and_enc.pdf