Pseudorandom sequences classification algorithm

Hiện nay, số vụ rò rỉ thông tin bởi

đối tượng vi phạm trong nội bộ gây ra ngày càng

gia tăng. Một trong những kênh có thể dẫn đến rò

rỉ thông tin là việc truyền dữ liệu ở dạng mã hóa

hoặc nén, vì các hệ thống chống rò rỉ dữ liệu (DLP)

hiện đại không thể phát hiện chữ ký và thông tin

trong loại dữ liệu này. Nội dung bài báo trình bày

thuật toán phân loại các chuỗi được hình thành

bằng thuật toán mã hóa và nén. Một mảng tần số

xuất hiện của các chuỗi con nhị phân có độ dài N

bit được sử dụng làm không gian đặc trưng. Tiêu

đề tệp hoặc bất kỳ thông tin ngữ cảnh nào khác

không được sử dụng để xây dựng không gian đối

tượng. Thuật toán được trình bày có độ chính xác

trong việc phân loại các chuỗi đạt 0,98 và có thể

được áp dụng trong các hệ thống DLP để ngăn

chặn việc rò rỉ thông tin khi truyền thông tin ở

dạng mã hóa hoặc nén.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

8 trang duykhanh 11620

Download

Bạn đang xem tài liệu "Pseudorandom sequences classification algorithm", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Pseudorandom sequences classification algorithm

is
formed by constructing all possible binary The resulting tuple of frequency values of
subsequences of a given bit length. occurrence of bit-length subsequences is a
The algorithm for constructing a feature characteristic space for further training and
space is shown in Fig. 1. construction of the classifier.
B. PRS Classification
The initial data for performing the PRS
classification are: PRS p, classifier K, set of the
features V.
The PRS classification algorithm is shown
in Fig. 2.
Fig. 1. Features space building algorithm.
Số 2.CS (12) 2020 5
Journal of Science and Technology on Information security
TP (true positive) – number of correctly
classified PRSs belonging to the class yYi .
TN (true negative) – number of PRSs
correctly assigned to a non-class yYi .
FP (false positive) – the number of PRSs
incorrectly assigned to the class yYi , i.e. the
number of false positives (the first type of error).
FN (false negative) – the number of PRSs
incorrectly not assigned to the class yYi , i.e.
the number of goal skips (second-type error).
To assess the quality of classification, we
used the percentage of correct responses metric,
which is generally defined by the equation (3).
TP TN
Accuracy (3)
TP TN FP FN
For a sample consisting of K PRS classes, the
percentage of correct answers of the classifier is
determined by the equation (4).
K
Fig. 2. PRS classification algorithm. Accuracy
 yi
i 1
Accuracytotal , (4)
Step 1. Initialize the tuple FQV, with empty values. K
Initialize the tuple State with empty values. where Accuracy – percentage of correct
yi
Calculate the length M of the sequence p
p responses for the class yi .
in bits.
To determine the percentage of correct
Step 2. For all features v from the tuple V execute: responses for each class, the confusion matrix
Calculate the length of the subsequence v shown in Table I is constructed.
and write the resulting value to a variable N . TABLE I. CONFUSION MATRIX FOR CLASSIFICATION
v 4 CLASSES OF PSR
Calculate the number of occurrences of the
Correct class
subsequence v in the PRS p and write the
4
resulting value to a variable nv . K 1 2 3
Calculate the frequency of occurrence of a 1 T1 F12 F13 F14
subsequence v in PRS p by equation (2).
2 F21 T2 F23 F24
Add a value for the frequency of the
3 F31 F32 T3 F33
subsequence v in PRS p to the tuple FQV, . Predicted class
4 F41 F42 F43 T4
III. EXPERIMENTS
The following sets are used to evaluate the When performing a multi-class
quality of the classifier: classification, sets are calculated based on the
error matrix using the following (5):
6 No 2.CS (12) 2020
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
Archives 7Z [33] – 4000 files.
TPyy T
ii
K The experiment was conducted in a software
TNy T c TP y environment Anaconda [34].
ii
c 1
K (5) Since the obtained values of the frequency of
FP F occurrence of sequences of length N bits are quite
yii y, c
c 1 small values ( 10 56 ...10 ) , the transition to a
K
FN F logarithmic scale of values was made to improve
yii c, y
c 1 the accuracy of classification (logarithmic values).
where y – correct class of PRS, с – predicted by Machine learning algorithms were used to
i construct classifiers and evaluate them [35]: a
the classifier class. decision tree classifier (DTC), a logarithmic
The value of the percentage of correct decision tree classifier (DTCL), a random forest
answers for choosing a classifier must meet the classifier (RFC), and a logarithmic random forest
condition presented in the equation (6): classifier (RFCL). The obtained values of the
accuracy of the PRS classification from the
Accuracytotal 1 (6) length of the subsequence N are shown in Fig. 3.
To classify the PRSs, we propose to use an The obtained results indicate that it is possible
algorithm based on a sub-count of the number of to classify PRSs generated by encryption,
binary subsequences of length N-1 bits in the compression algorithms, and pseudo-random
studied PRS. In [30], [31], it is noted that for number generators using the proposed algorithm
example, for the sequence s = 1011010001, the with an accuracy greater than 0.95 for a 9-bit
frequency of occurrence of subsequences of sequence length.
length N = 3 bits is represented in Table II.
TABLE II. SUBSEQUENCES FREQUENCIES COUNTING
Subsequences Number Frequency
000 1 0.125
001 1 0.125
010 1 0.125
011 1 0.125
To restore the distribution of binary
sequences, it is sufficient to analyze half of all
possible subsequences. Thus, the dimension of
the feature space for subsequences of length N
bits is defined by the equation (7): Length of the subsequences N bits
S 2N 1 (7) Fig 3. Accuracy for classification 4 classes of PSR.
During the experiments, 2 algorithms for
To carry out the experiment, a sample of PRS constructing classifiers were used: the algorithm
was formed, consisting of 16000 files of 4 classes for constructing a decision tree and the algorithm
obtained as a result of file transformations for constructing a random forest. The algorithm
containing meaningful text in Russian: for constructing the decision tree showed a
Encrypted by algorithms AES, 3DES, RC4, higher accuracy of the PRS classification. To
and Camellia in CBC mode [32] – 4000 files. improve the accuracy of the classifier, the values
of the frequency of occurrence of subsequences
Archives RAR [33] – 4000 files. were converted to the logarithmic scale, which
Archives ZIP [33] – 4000 files. made it possible to achieve the accuracy of the
PRS classification of 0.98.
Số 2.CS (12) 2020 7
Journal of Science and Technology on Information security
IV. CONCLUSION data leakage, IEEE Access, Vol. 6, 2018, pp.
35926-35936.
Modern DLP systems are not able to detect
encrypted or compressed data with high [7] K. Kaur, I. Gupta, A. K. Singh, Comparative
accuracy, which allows you to use the data Evaluation of Data Leakage/Loss prevention
transmission channel in encrypted or compressed Systems (DLPS), In Proc. 4th Int. Conf.
Computer Science & Information Technology
form, if there is no information about the (CS & IT-CSCP), 2017, pp. 87-95.
compression algorithm, for transmitting
confidential data. In this paper, we proposed a [8] L. Cheng, F. Liu, D. Yao, Enterprise data
classification algorithm consisting of several breach: causes, challenges, prevention, and
stages: determining the most significant future directions, Wiley Interdisciplinary
Reviews: Data Mining and Knowledge
statistical features of random sequences on a
Discovery, Vol. 7, No. 5, 2017, pp. 1211.
training sample of data using the random forest
algorithm and directly classifying the algorithm [9] X. Shu, D. Yao, E. Bertino, Privacy-Preserving
for building a decision tree. The proposed Detection of Sensitive Data Exposure, IEEE
algorithm for feature extraction and Transactions on Information Forensics and
Security, Vol. 10, No. 5, 2015, pp. 1092-1103.
classification allowed us to increase the accuracy
of classification of encrypted and compressed [10] F. Liu, X. Shu, D. Yao, A. R. Butt, Privacy-
data to an accuracy of 0.98. preserving scanning of big content for
sensitive data exposure with MapReduce,
ACKNOWLEDGMENT Proceedings of the 5th ACM Conference on
The reported study was funded by Russian Data and Application Security and Privacy,
Ministry of Science (information security, 2015, pp. 195-206.
project number 18/2020). [11] X. Shu, J. Zhang, D. Yao, W. Feng, Rapid and
parallel content screening for detecting
REFERENCES transformed data exposure, Proceedings of the
[1] Data Breach Report: A Study on Global Data Third International Workshop on Security and
Leaks in H1 2018, InfoWatch, Privacy in Big Data, 2015, pp. 191-196.
https://www.infowatch.ru/analytics/reports. [12] Shu X., Zhang J., Yao D. D., Feng, W. C., Fast
(Access date 14.01.2020). Detection of Transformed Data Leaks, IEEE
[2] B.B. Mahesh, M.S. Bhanu, "Prevention of Transactions on Information Forensics and
insider attacks by integrating behavior analysis Security, Vol. 11, No 3, 2016, pp. 528-542.
with risk based access control model to protect [13] Yu, X., Tian, Z., Qiu, J., & Jiang, F. , A data
cloud", Procedia Computer Science, Vol. 54, leakage prevention method based on the
2015, pp. 157-166. reduction of confidential and context terms for
[3] D. Kolevski, K. Michael, Cloud computing data smart mobile devices, Wireless
breaches a socio-technical review of literature, Communications and Mobile Computing, 2018.
2015 International Conference on Green DOI: 10.1155/2018/5823439.
Computing and Internet of Things (ICGCIoT), [14] X. Shu, D. Yao, E. Bertino, Privacy-Preserving
Greater Noida, India, 2015, pp. 1486-1495. Detection of Sensitive Data Exposure, IEEE
[4] S. Alneyadi, E. Sithirasenan, V. Transactions on Information Forensics and
Muthukkumarasamy, Detecting Data Semantic: Security, Vol. 10, No. 5, 2015, pp. 1092-1103.
A Data Leakage Prevention Approach, IEEE [15] Shvartzshnaider Y., Pavlinovic Z., Balashankar
Trustcom/BigDataSE/ISPA, Helsinki, Finland, A., Wies T., Subramanian L., Nissenbaum H.,
Vol. 1, 2015, pp. 910-917. Mittal P., VACCINE: Using Contextual Integrity
[5] S. Alneyadi, E. Sithirasenan, V. For Data Leakage Detection, The World Wide
Muthukkumarasamy, Discovery of potential Web Conference, 2019, pp. 1702-1712.
data leaks in email communications, 10th [16] Kavitha T., Rajitha O., Thejaswi K.,
International Conference on Signal Processing Muppalaneni N. B. Classification of encryption
and Communication Systems (ICSPCS), Gold algorithms based on ciphertext using pattern
Coast, Australia, 2016, pp. 1-10. recognition techniques, International conference
[6] X. Huang, Y. Lu, D. Li, M. Ma, A novel on Computer Networks, Big data and IoT, 2018,
mechanism for fast detection of transformed pp. 540-545.
8 No 2.CS (12) 2020
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
[17] C. Tan, Q. Ji, An approach to identifying classification: a systematic survey, IEEE
cryptographic algorithm from ciphertext, 8th Communications Surveys & Tutorials, Vol. 21,
IEEE International Conference on No. 2, 2018, pp. 1988-2014.
Communication Software and Networks, 2016,
[27] Hahn D., Apthorpe N., Feamster N., Detecting
pp. 19-23.
compressed cleartext traffic from consumer
[18] C. Tan, Y. Li, S. Yao, A Novel Identification internet of things devices, arXiv preprint
Approach to Encryption Mode of Block Cipher, arXiv:1805.02722, 2018.
4th International Conference on Sensors,
[28] Casino F., Choo K. K. R., Patsakis C., HEDGE:
Mechatronics and Automation, Zhuhai, China,
efficient traffic classification of encrypted and
2016. DOI: 10.2991/icsma-16.2016.101.
compressed packets, IEEE Transactions on
[19] C. Tan, X. Deng, L. Zhang, Identification of Information Forensics and Security, Vol. 14, No.
Block Ciphers under CBC Mode, Procedia 11, 2019, pp. 2916-2926.
Computer Science, Vol. 131, 2018, pp. 65-71.
[29] Tang Z., Zeng X. and Sheng Y., Entropy-
[20] Ray P. K., Ojha S., Roy B. K., Basu A., based feature extraction algorithm for
Classification of Encryption Algorithms using encrypted and non-encrypted compressed
Fisher’s Discriminant Analysis, Defence traffic classification International Journal of
Science Journal, Vol. 67, No. 1, 2017, pp. 59-65. ICIC, Vol. 15, No 3, 2019.
[21] Pan J., Encryption scheme classification: a deep [30] Khakpour A. R., Liu A. X., An information-
learning approach, International Journal of theoretical approach to high-speed flow nature
Electronic Security and Digital Forensics, Vol. identification, IEEE/ACM transactions on
9, No. 4, 2017, pp. 381-395. networking, Vol. 21, No. 4, 2012, pp. 1076-1089.
[22] Wang, W., Zhu, M., Zeng, X., Ye, X., & Sheng, [31] Konyshev M. U., Dvilyansky A. A.,
Y., Malware traffic classification using Barabashov A. Y., Petrov K. Y., Formation of
convolutional neural network for probability distributions of binary vectors of the
representation learning, International error source of a Markov discrete memory link
Conference on Information Networking using the method of "grouping probabilities" of
(ICOIN), 2017, pp. 712-717. error vectors, Industrial ACS and controllers,
No. 3, 2018, p. 42.
[23] Wang W., Zhu M., Wang J., Zeng X., Yang Z.,
End-to-end encrypted traffic classification with [32] Konyshev M. U., Dvilyansky A. A., Petrov K.
one-dimensional convolution neural networks, Y., Ermishin G. A., Algorithm for compression
IEEE International Conference on Intelligence of a distribution series of binary
and Security Informatics (ISI), 2017, pp. 43-48. multidimensional random variables, Industrial
ACS and controllers, No. 8, 2016, pp. 47-50.
[24] Lotfollahi M., Siavoshani M. J., Zade R. S. H.,
Saberian M., Deep packet: A novel approach for [33] Toolkit for the transport layer security and
encrypted traffic classification using deep secure sockets layer protocols,
learning, Soft Computing, 2017, pp. 1-14. (Access date: 14.01.2020).
[25] Zhang J., Chen X., Xiang Y., Zhou W., Wu J. [34] Archive manager WinRAR,
Robust network traffic classification, (Access date: 14.01.2020).
IEEE/ACM Transactions on Networking, Vol.
[35] Programm environment Anaconda,
23, No. 4 , 2015, pp. 1257-1270.
https://www.anaconda.com/distribution/,
[26] Pacheco F., Exposito E., Gineste M., Baudoin (Access date: 14.01.2020).
C., Aguilar J., Towards the deployment of
[36] Breiman L., Classification and regression trees,
machine learning solutions in network traffic
Routledge, 2017, p. 358.
Số 2.CS (12) 2020 9
Journal of Science and Technology on Information security
ABOUT THE AUTHOR
Alexander Kozachok
Workplace: Academy of the Federal
Guard Service of Russian Federation
Email: alex.totrin@gmail.com
Education: Received his PhD degree in
Engineering Sciences in Academy of
Federal Guard Service of the Russian Federation in
December 2012; received his doctorate in Engineering
Science in 2019.
Recent research direction: information security,
unauthorized access protection, mathematical
cryptography, theoretical problems of computer science.
Andrey Spirin
Workplace: Academy of the Federal
Guard Service of Russian Federation
Email: spirin_aa@bk.ru
Education: Postgraduate student in
Academy of the Federal Guard Service
of Russian Federation.
Recent research direction: information security, DLP
systems, machine learning algorithms, classification of
binary sequences.
10 No 2.CS (12) 2020

File đính kèm:

pseudorandom_sequences_classification_algorithm.pdf