Classification of sequences generated by compression and encryption algorithms

According to reports from the Infowatch

Analytics Center [1-3], the number of

confidential data leaks is growing from year to

year. Fig.1 provides statistics on the leaks

occurred in 2011-2018. The total damage from

leaks in 2013 amounted to more than US$7.5

billion. Equifax, the international credit

reference bureau, has already spent US$1.4

billion to eliminate one information leakage that

occurred in 2017 and the amount has continued

to grow due to repetitive claims. The share of

và các dãy được tạo ra bởi các thuật toán nén và

mã hóa. Các kết quả của việc đánh giá dẫn tới kết

luận rằng không gian đặc trưng được đề xuất có

thể được sử dụng để xác định các thuật toán nén

ZIP, RAR và các thuật toán mã hóa AES, 3DES

với độ chính xác lớn hơn 95%.

Keywords—Identification of compression and

encryption algorithms, statistical testing of

information.

Từ khóa—Xác định các thuật toán nén và mã

hóa, kiểm tra thông tin thống kê.

I. INTRODUCTION

According to reports from the Infowatch

Analytics Center [1-3], the number of

confidential data leaks is growing from year to

year. Fig.1 provides statistics on the leaks

occurred in 2011-2018. The total damage from

leaks in 2013 amounted to more than US$7.5

billion. Equifax, the international credit

reference bureau, has already spent US$1.4

billion to eliminate one information leakage that

occurred in 2017 and the amount has continued

to grow due to repetitive claims. The share of

This manuscript is received April 22, 2019. It is commented

on July 30, 2019 and is accepted on August 6, 2019 by the

first reviewer. It is commented on August 20, 2019 and is

accepted on August 27, 2019 by the second reviewer.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

6 trang duykhanh 13000

Download

Bạn đang xem tài liệu "Classification of sequences generated by compression and encryption algorithms", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Classification of sequences generated by compression and encryption algorithms

of 2019, 80% classify files the authors used the NIST
of the traffic will be the same. Encryption statistical test suite indicating encrypted data
can also be used to hide interactions with with an accuracy of more than 0,95.
malware command servers and solve other In order to prevent information leakage, it
problems. According to the Ponemon is necessary to block the transmission of
Institute report for 2016, almost half (41%) encrypted data, which determines the
attackers use encryption to bypass relevance of solving the problem of
mechanisms of detecting their unauthorized classifying encrypted, compressed and
activity. Security tools cannot inspect pseudo-random sequences (PRS).
encrypted traffic (according to the Ponemon
Institute, 64% of companies cannot detect III. PROBLEM STATEMENT
malicious code in encrypted traffic) [10]. In general, the research problem is
The existing security solutions are able to formulated as follows: it is necessary to map the
analyze the content of encrypted traffic, for original set of bit sequences X to the new set of
example, by means of a man-in-the-middle classes Y by using the selected feature space.
attack, but due to the high cost of implementing To be able to estimate the accuracy of the
the methods they are practically not applicable classification, the accuracy characteristic
in real conditions [10]. However, there is at calculated by Formula 1 was used.
least one way to analyze the content of TP TN
Accuracy , (1)
encrypted traffic without decrypting it – Cisco TP TN FP FN
ETA (Encrypted Traffic Analytics), which
where TP is the number of objects correctly
allows, on the basis of network telemetry
assigned to class i, TN is the number of objects
received from the network equipment and
correctly assigned to class j, FP is the number of
machine learning algorithms, classifying
false positives (type I error), FN is the number
encrypted traffic, simultaneously separating the
of false negatives (type II error).
pure traffic in it from the malicious one. It is not
the data field in the encrypted batch that is used In a formalized form, the classification task
for analysis, but rather its header, from which can be defined by the following expression:
the extended telemetry is obtained [10]: F : X Y , (2)
1. from Netflow – addresses and ports of the where X is the initial set of bit sequences to
source and destination (SrcIP, DstIP, SrcPort, be classified, Y is the set of classes.
DstPort), information on the protocol, the number of
transferred packets and bytes;
4 No 2.CS (10) 2019
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
The accuracy value in the classification subsamples equal to 10 were chosen) of machine
must satisfy the condition presented in learning algorithms with default parameters was
expression (3): used [18-23]: Decision Tree Classifier (DTC),
Random Forest Classifier (RFC).
p(F(xi ) y j | i j) 0,95, (3)
The results of the experiments are presented
where p is the probability, F(х) is the in Table 2.
function of displaying the i-th file of the set X, y
TABLE 1. THE RESULTS OF CROSS-
is the mark of the class from the set Y, i, j are VALADATION OF MACHINE LEARNING
the indices in the set of X and Y, respectively. ALGORITHMS IN THE NIST TEST FEATURE SPACE
ON A SAMPLE COMPRESSED AND ENCRYPTED
To develop and evaluate the classification SEQUENCES
method of encrypted, compressed and PRS the
following particular tasks were set: Algorithm Accuracy value
1. To check the possibility of binary Decision Tree 0,57508
classification of sequences generated by Random Forest 0,642579
encryption and compression algorithms. From the analysis of the results presented in
2. To check the possibility of multi-class Table 2, it is reasonable to conclude that the RFC
classification of sequences generated by has greater accuracy, but the resulting accuracy
encryption, compression, and PRS algorithms. value of detecting sequence types equal to 0,642
IV. TESTING THE POSSIBILITY OF A does not allow constructing a classifier that meets
BINARY CLASSIFICATION OF the requirement presented in expression 3. Thus, it
SENQUENCES GENERATED BY was concluded that the feature space based on the
COMPRESSION AND ENCRYPTION NIST test results would not allow solving the
ALGORITHMS problem of binary classification taking into
4.1 Binary classification for features account the selected restrictions.
generated on the basis of NIST test results 4.2 Binary classification on the feature space
generated on the basis of the results of
In the course of the research it was assumed analyzing N length subsequence frequency.
that the results of NIST tests might be a feature
space for constructing a classifier that allows Then it was assumed in the research that as a
distinguishing pseudo-random sequences by a feature space for solving the problem of bit
source type. sequence binary classification the results of
analyzing the frequency of independent bit
To test the possibility of solving the subsequences of different length N (in bits)
classification problem, the NIST SP 800- without taking into account the complete
22rev1a statistical test suite was used to overlap of each subsequence can be used. For
evaluate random and pseudo-random sequence example, for the sequence S = "00011011" and
generators in cryptography [17]. N = 2 bits, the frequency occurrence of N bit
To conduct the experiment the initial sample length subsequences is presented in Table 3.
of 1000 files, 600 KB each, containing the text
TABLE 2. EXAMPLE OF COUNTING THE
in Russian, was converted by encryption and
FREQUENCIES OF BIT SUBSEQUENCES
compression algorithms. As a result, the
resulting sample of 4000 files was divided into Subsequence Amount Frequency
2 classes: 00 1 0,142857143
1. Encrypted by AES, 3DES algorithms.
01 2 0,285714286
2. Compressed by RAR, ZIP algorithms.
Next, the files from the sample were 10 1 0,142857143
processed with a package of statistical tests, as a 11 2 0,285714286
result, 188 features were obtained.
To assess the applicability of the selected In the works of the authors it was shown that
feature space, cross-validation (the number of for binary sequences it was enough to analyze
No 2.CS (10) 2019 5
Journal of Science and Technology on Information Security
half of all possible subsequences [24,25]. Using In the course of the experiments to assess the
this assumption, the dimension of the feature possibility of constructing binary classifiers,the
space for N bit length subsequences has been RFC showed the highest accuracy, the results
halved, the number of possible feature on the are presented in Table 5.
assumption is presented in Table 4. An average accuracy of more than 0,95 is
TABLE 3: THE DIMENSION OF THE FEATURE achieved at N> = 8 bits.
SPACE FOR N BIT LENGTH The dependence of the average accuracy of
classifying files on the basis of the RFC on the
Cross- length of the subsequence and the time spent on
Length of Amount Amount of validation
subsequence of features, time on the selection of features from the training
(N), bit features total assumption, sample is shown in Fig.2. A sufficient ratio of
min accuracy and time is determined at N = 9 bits
length of the subsequence, in this case there is a
4 8 16 6 significant increase in the accuracy of sequence
5 16 32 11 classification at maintaining an acceptable
amount of time spent on extracting features.
6 32 64 15
7 64 128 22
8 128 256 37
9 256 512 69
TABLE 4. ACCURACY OF DISTINGUISHING FILE TYPES WHEN USING RFC
Algorithm accuracy for length sequence
N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11
File type
Time taken to retrieve features in minutes
6 11 15 22 37 69 135 268
AES/7-Z 0,821 0,843 0,834 0,835 0,938 0,993 0,998 1,000
AES/RAR 0,986 0,992 0,994 0,992 0,991 0,997 0,993 0,993
AES/ZIP 0,986 0,992 0,988 0,990 0,993 0,998 0,999 0,999
3DES/7-Z 0,834 0,846 0,865 0,865 0,947 0,993 0,998 1,000
3DES/RAR 0,984 0,990 0,991 0,993 0,991 0,996 0,994 0,994
3DES/ZIP 0,991 0,988 0,991 0,989 0,991 0,997 0,998 0,999
7-Z/ZIP 0,968 0,974 0,974 0,973 0,971 0,972 0,976 0,978
7-Z/RAR 0,948 0,960 0,964 0,963 0,964 0,983 0,996 0,999
RAR/ZIP 0,840 0,864 0,864 0,863 0,868 0,977 0,999 1,000
Mean value 0,929 0,939 0,941 0,940 0,962 0,990 0,995 0,996
значение
Fig.2. Dependence of the file type classification accuracy at using the RFC on the
subsequence length
6 No 2.CS (10) 2019
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
The AES/3DES pair was removed from the as a feature space for the classifying binary
tested sample. Fig.3. shows the dependence of sequences generated by encryption and
the accuracy of distinguishing sequences compression algorithms was verified. The
encrypted with the AES and 3DES algorithms accuracy of distinguishing the RFC sequences
on the length of the subsequence N. was 0,64, which does not satisfy the
Due to the ability of ciphers to dissipate the requirement specified in expression (3).
statistics of the source data, the accuracy of the To solve the problem of binary sequence binary
classification of sequences encrypted with the classification, a new feature space was proposed, it
AES and 3DES algorithms is on average 0,512, was formed by counting the frequency of different
which makes it impossible to construct a length bit subsequences. The classification
classifier for sequences of this type. accuracy of the RFC at the length of the
In the course of the research, the possibility subsequence N = 9 bits is 0,99, it satisfies the
of applying the results of NIST statistical tests requirement specified in expression (3).
Fig.3. Dependence of the accuracy of distinguishing sequences encrypted with the AES and 3DES
algorithms on the length of the subsequence
REFERENCES [5]. Y. Miao, Z. Ruan, L. Pan, Y. Wang, J. Zhang, Y.
[1]. INFOWATCH company group site. URL: Xiang. Automated Big Traffic Analytics for Cyber
https://www.infowatch.ru/analytics/reports.4.html Security // Eprint arXiv:1804.09023, bibcode:
(дата обращения: 30.05.2019). 2018arXiv180409023M. 2018.
[2]. INFOWATCH company group site. URL: [6]. S. Miller, K. Curran, T. Lunney. Multilayer
https://www.infowatch.ru/sites/default/files/report/ Perceptron Neural Network for Detection of
analytics/russ/infowatch_otchet_032014_smb_fin.p Encrypted VPN Network Traffic // International
df (дата обращения: 30.05.2019). Conference on Cyber Situational Awarness, Data
Analytics and Assessment. 2018. ISBN: 978-1-
[3]. INFOWATCH company group site. URL: 5386-4565-9.
https://www.infowatch.ru/analytics/leaks_monitori
ng/15678 (дата обращения: 30.05.2019). [7]. P. Wang, X. Chen, F. Ye, Z. Sun. A Survey of
Techniques for Mobile Service Encrypted Traffic
[4]. X. Huang, Y. Lu, D. Li, M. Ma. A novel mechanism Classification Using Deep Learning // IEEE
for fast detection of transformed data leakage // Access. Special section on challenges and
IEEE Access. Special section on challenges and opportunities of big data against cyber crime. Vol.
opportunities of big data against cyber crime. Vol. 7, 2019. pp. 54024-54033 doi:
6, 2018. pp. 35926-35936 10.1109/ACCESS.2019.2912896
No 2.CS (10) 2019 7
Journal of Science and Technology on Information Security
[8]. K. Demertzis, N. Tziritas, P. Kikiras, S.L. Sanchez, [16]. Breiman L., Friedman J., Olshen R., Stone C.
L. Iliadis. The Next Generation Cognitive Security Classification and Regression Trees // Wadsworth,
Operations Center: Adaptive Analytic Lambda Belmont, CA. 1984. 368 p. ISBN: 9781351460491.
Architecture for Efficient Defense against [17]. Hastie T., Tibshirani R., Friedman J. Elements of
Adversarial Attacks // Big Data and Cognitive Statistical Learning // Springer. 2009. pp. 587-601.
Computing, 2019 3(6). ISBN: 978-0387848570.
[9]. H. Zhang, C. Papadopoulos, D. Massey. Detecting [18]. L. Breiman, A. Cutler. Random Forests //
encrypted botnet traffic // 16th IEEE Global URL:https://www.stat.berkeley.edu/~breiman/Ran
Internet Symposium. 2013. p. 3453. domForests/cc_home.htm (дата обращения:
[10]. T. Radivilova, L. Kirichenko, D. Ageyev, M. 14.01.2019).
Tawalbeh, V. Bulakh Decrypting SSL/TLS Traffic [19]. S. Raska. Python and machine learning // M .:
for Hidden Threats Detection // IEEE 9th DMK-Press. 2017. 418 p. ISBN: 978-5-97060-
International Conference on Dependable Systems, 409-0.
Services and Technologies (DESSERT), 2018.
[20]. L. Breiman. Random Forests // Journal Machine
ISBN: 978-1-5386-5903-8.
Learning 45(1). 2001. pp. 5-32.
[11]. M. Piccinelli, P. Gubian. Detecting hidden
[21]. M.Yu. Konyshev. Formation of probability
encrypted volume files via statistical analysis //
distributions of binary vectors of the source of
International Journal of Cyber-Security and Digital
errors of a Markov discrete communication
Forensics. Vol. 3(1). 2013 pp. 30-37.
channel with memory using the method of
[12]. NIST STS manual. URL: "grouping probabilities" of error vectors. /
https://csrc.nist.gov/Projects/Random-Bit- M.Yu. Konyshev, A.Yu. Barabashov, K.E.
Generation/ (дата обращения: 14.01.2019). Petrov, A.A. Dvilyansky // Industrial ACS and
[13]. Toolkit for the transport layer security and secure controllers. 2018. № 3. P. 42-52.
sockets layer protocols. [22]. M.Yu. Konyshev. A compression algorithm for a
URL: (дата обращения: 14.01.2019) series of distributions of binary multidimensional
[14]. Archive manager WinRAR. URL: random variables. / M.Yu. Konyshev, A.A.
(дата обращения: 14.01.2019). Dvilyansky, K.E. Petrov, G.A. Ermishin // Industrial
ACS and controllers. 2016. No. 8. P. 47-50.
[15]. Pedregosa F., et al. Scikit-learn: Machine Learning
in Python // Journal of Machine Learning Research
12. 2011. pp. 2825-2830.
ABOUT THE AUTHORS
D.S. Alexander Kozachok Spirin Andrey Andreevich
Workplace: The Academy of Workplace: The Academy of
Federal Guard Service of the Federal Guard Service of the
Russian Federation. Russian Federation.
Email: alex.totrin@gmail.com Email: spirin_aa@bk.ru
The education process: The education process:
received PhD. degree in graduated from the Academy
Engineering Sciences in of the Federal Guard Service
Academy of Federal Guard of the Russian Federation in 2010.
Service of the Russian Research today: Information security, information
Federation in Dec. 2012. leakage prevention systems, statistical testing.
Research today: Information security;
Unauthorized access protection; Mathematical
cryptography; theoretical problems of computer.
8 No 2.CS (10) 2019

File đính kèm:

classification_of_sequences_generated_by_compression_and_enc.pdf