Pseudorandom sequences classification algorithm

Hiện nay, số vụ rò rỉ thông tin bởi

đối tượng vi phạm trong nội bộ gây ra ngày càng

gia tăng. Một trong những kênh có thể dẫn đến rò

rỉ thông tin là việc truyền dữ liệu ở dạng mã hóa

hoặc nén, vì các hệ thống chống rò rỉ dữ liệu (DLP)

hiện đại không thể phát hiện chữ ký và thông tin

trong loại dữ liệu này. Nội dung bài báo trình bày

thuật toán phân loại các chuỗi được hình thành

bằng thuật toán mã hóa và nén. Một mảng tần số

xuất hiện của các chuỗi con nhị phân có độ dài N

bit được sử dụng làm không gian đặc trưng. Tiêu

đề tệp hoặc bất kỳ thông tin ngữ cảnh nào khác

không được sử dụng để xây dựng không gian đối

tượng. Thuật toán được trình bày có độ chính xác

trong việc phân loại các chuỗi đạt 0,98 và có thể

được áp dụng trong các hệ thống DLP để ngăn

chặn việc rò rỉ thông tin khi truyền thông tin ở

dạng mã hóa hoặc nén.

Pseudorandom sequences classification algorithm trang 1

Trang 1

Pseudorandom sequences classification algorithm trang 2

Trang 2

Pseudorandom sequences classification algorithm trang 3

Trang 3

Pseudorandom sequences classification algorithm trang 4

Trang 4

Pseudorandom sequences classification algorithm trang 5

Trang 5

Pseudorandom sequences classification algorithm trang 6

Trang 6

Pseudorandom sequences classification algorithm trang 7

Trang 7

Pseudorandom sequences classification algorithm trang 8

Trang 8

pdf 8 trang duykhanh 4980
Bạn đang xem tài liệu "Pseudorandom sequences classification algorithm", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Pseudorandom sequences classification algorithm

Pseudorandom sequences classification algorithm
is 
formed by constructing all possible binary The resulting tuple of frequency values of 
subsequences of a given bit length. occurrence of bit-length subsequences is a 
 The algorithm for constructing a feature characteristic space for further training and 
space is shown in Fig. 1. construction of the classifier. 
 B. PRS Classification 
 The initial data for performing the PRS 
 classification are: PRS p, classifier K, set of the 
 features V. 
 The PRS classification algorithm is shown 
 in Fig. 2. 
 Fig. 1. Features space building algorithm. 
 Số 2.CS (12) 2020 5 
Journal of Science and Technology on Information security 
 TP (true positive) – number of correctly 
 classified PRSs belonging to the class yYi . 
 TN (true negative) – number of PRSs 
 correctly assigned to a non-class yYi . 
 FP (false positive) – the number of PRSs 
 incorrectly assigned to the class yYi , i.e. the 
 number of false positives (the first type of error). 
 FN (false negative) – the number of PRSs 
 incorrectly not assigned to the class yYi , i.e. 
 the number of goal skips (second-type error). 
 To assess the quality of classification, we 
 used the percentage of correct responses metric, 
 which is generally defined by the equation (3). 
 TP TN
 Accuracy (3) 
 TP TN FP FN
 For a sample consisting of K PRS classes, the 
 percentage of correct answers of the classifier is 
 determined by the equation (4). 
 K
 Fig. 2. PRS classification algorithm. Accuracy
  yi
 i 1
 Accuracytotal , (4) 
Step 1. Initialize the tuple FQV, with empty values. K
 Initialize the tuple State with empty values. where Accuracy – percentage of correct 
 yi
 Calculate the length M of the sequence p 
 p responses for the class yi . 
in bits. 
 To determine the percentage of correct 
Step 2. For all features v from the tuple V execute: responses for each class, the confusion matrix 
 Calculate the length of the subsequence v shown in Table I is constructed. 
and write the resulting value to a variable N . TABLE I. CONFUSION MATRIX FOR CLASSIFICATION 
 v 4 CLASSES OF PSR 
 Calculate the number of occurrences of the 
 Correct class 
subsequence v in the PRS p and write the 
 4 
resulting value to a variable nv . K 1 2 3 
 Calculate the frequency of occurrence of a 1 T1 F12 F13 F14 
subsequence v in PRS p by equation (2). 
 2 F21 T2 F23 F24
 Add a value for the frequency of the 
 3 F31 F32 T3 F33 
subsequence v in PRS p to the tuple FQV, . Predicted class
 4 F41 F42 F43 T4 
 III. EXPERIMENTS 
 The following sets are used to evaluate the When performing a multi-class 
quality of the classifier: classification, sets are calculated based on the 
 error matrix using the following (5): 
6 No 2.CS (12) 2020 
 Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 Archives 7Z [33] – 4000 files. 
 TPyy T
 ii
 K The experiment was conducted in a software 
 TNy T c TP y environment Anaconda [34]. 
 ii
 c 1
 K (5) Since the obtained values of the frequency of 
 FP F occurrence of sequences of length N bits are quite 
 yii y, c
 c 1 small values ( 10 56 ...10 ) , the transition to a 
 K
 FN F logarithmic scale of values was made to improve 
 yii c, y
 c 1 the accuracy of classification (logarithmic values). 
where y – correct class of PRS, с – predicted by Machine learning algorithms were used to 
 i construct classifiers and evaluate them [35]: a 
the classifier class. decision tree classifier (DTC), a logarithmic 
 The value of the percentage of correct decision tree classifier (DTCL), a random forest 
answers for choosing a classifier must meet the classifier (RFC), and a logarithmic random forest 
condition presented in the equation (6): classifier (RFCL). The obtained values of the 
 accuracy of the PRS classification from the 
 Accuracytotal 1 (6) length of the subsequence N are shown in Fig. 3. 
 To classify the PRSs, we propose to use an The obtained results indicate that it is possible 
algorithm based on a sub-count of the number of to classify PRSs generated by encryption, 
binary subsequences of length N-1 bits in the compression algorithms, and pseudo-random 
studied PRS. In [30], [31], it is noted that for number generators using the proposed algorithm 
example, for the sequence s = 1011010001, the with an accuracy greater than 0.95 for a 9-bit 
frequency of occurrence of subsequences of sequence length. 
length N = 3 bits is represented in Table II. 
 TABLE II. SUBSEQUENCES FREQUENCIES COUNTING 
 Subsequences Number Frequency 
 000 1 0.125 
 001 1 0.125 
 010 1 0.125 
 011 1 0.125 
 To restore the distribution of binary 
sequences, it is sufficient to analyze half of all 
possible subsequences. Thus, the dimension of 
the feature space for subsequences of length N 
bits is defined by the equation (7): Length of the subsequences N bits 
 S 2N 1 (7) Fig 3. Accuracy for classification 4 classes of PSR. 
 During the experiments, 2 algorithms for 
 To carry out the experiment, a sample of PRS constructing classifiers were used: the algorithm 
was formed, consisting of 16000 files of 4 classes for constructing a decision tree and the algorithm 
obtained as a result of file transformations for constructing a random forest. The algorithm 
containing meaningful text in Russian: for constructing the decision tree showed a 
 Encrypted by algorithms AES, 3DES, RC4, higher accuracy of the PRS classification. To 
and Camellia in CBC mode [32] – 4000 files. improve the accuracy of the classifier, the values 
 of the frequency of occurrence of subsequences 
 Archives RAR [33] – 4000 files. were converted to the logarithmic scale, which 
 Archives ZIP [33] – 4000 files. made it possible to achieve the accuracy of the 
 PRS classification of 0.98. 
 Số 2.CS (12) 2020 7 
Journal of Science and Technology on Information security 
 IV. CONCLUSION data leakage, IEEE Access, Vol. 6, 2018, pp. 
 35926-35936. 
 Modern DLP systems are not able to detect 
encrypted or compressed data with high [7] K. Kaur, I. Gupta, A. K. Singh, Comparative 
accuracy, which allows you to use the data Evaluation of Data Leakage/Loss prevention 
transmission channel in encrypted or compressed Systems (DLPS), In Proc. 4th Int. Conf. 
 Computer Science & Information Technology 
form, if there is no information about the (CS & IT-CSCP), 2017, pp. 87-95. 
compression algorithm, for transmitting 
confidential data. In this paper, we proposed a [8] L. Cheng, F. Liu, D. Yao, Enterprise data 
classification algorithm consisting of several breach: causes, challenges, prevention, and 
stages: determining the most significant future directions, Wiley Interdisciplinary 
 Reviews: Data Mining and Knowledge 
statistical features of random sequences on a 
 Discovery, Vol. 7, No. 5, 2017, pp. 1211. 
training sample of data using the random forest 
algorithm and directly classifying the algorithm [9] X. Shu, D. Yao, E. Bertino, Privacy-Preserving 
for building a decision tree. The proposed Detection of Sensitive Data Exposure, IEEE 
algorithm for feature extraction and Transactions on Information Forensics and 
 Security, Vol. 10, No. 5, 2015, pp. 1092-1103. 
classification allowed us to increase the accuracy 
of classification of encrypted and compressed [10] F. Liu, X. Shu, D. Yao, A. R. Butt, Privacy-
data to an accuracy of 0.98. preserving scanning of big content for 
 sensitive data exposure with MapReduce, 
 ACKNOWLEDGMENT Proceedings of the 5th ACM Conference on 
 The reported study was funded by Russian Data and Application Security and Privacy, 
Ministry of Science (information security, 2015, pp. 195-206. 
project number 18/2020). [11] X. Shu, J. Zhang, D. Yao, W. Feng, Rapid and 
 parallel content screening for detecting 
 REFERENCES transformed data exposure, Proceedings of the 
[1] Data Breach Report: A Study on Global Data Third International Workshop on Security and 
 Leaks in H1 2018, InfoWatch, Privacy in Big Data, 2015, pp. 191-196. 
 https://www.infowatch.ru/analytics/reports. [12] Shu X., Zhang J., Yao D. D., Feng, W. C., Fast 
 (Access date 14.01.2020). Detection of Transformed Data Leaks, IEEE 
[2] B.B. Mahesh, M.S. Bhanu, "Prevention of Transactions on Information Forensics and 
 insider attacks by integrating behavior analysis Security, Vol. 11, No 3, 2016, pp. 528-542. 
 with risk based access control model to protect [13] Yu, X., Tian, Z., Qiu, J., & Jiang, F. , A data 
 cloud", Procedia Computer Science, Vol. 54, leakage prevention method based on the 
 2015, pp. 157-166. reduction of confidential and context terms for 
[3] D. Kolevski, K. Michael, Cloud computing data smart mobile devices, Wireless 
 breaches a socio-technical review of literature, Communications and Mobile Computing, 2018. 
 2015 International Conference on Green DOI: 10.1155/2018/5823439. 
 Computing and Internet of Things (ICGCIoT), [14] X. Shu, D. Yao, E. Bertino, Privacy-Preserving 
 Greater Noida, India, 2015, pp. 1486-1495. Detection of Sensitive Data Exposure, IEEE 
[4] S. Alneyadi, E. Sithirasenan, V. Transactions on Information Forensics and 
 Muthukkumarasamy, Detecting Data Semantic: Security, Vol. 10, No. 5, 2015, pp. 1092-1103. 
 A Data Leakage Prevention Approach, IEEE [15] Shvartzshnaider Y., Pavlinovic Z., Balashankar 
 Trustcom/BigDataSE/ISPA, Helsinki, Finland, A., Wies T., Subramanian L., Nissenbaum H., 
 Vol. 1, 2015, pp. 910-917. Mittal P., VACCINE: Using Contextual Integrity 
[5] S. Alneyadi, E. Sithirasenan, V. For Data Leakage Detection, The World Wide 
 Muthukkumarasamy, Discovery of potential Web Conference, 2019, pp. 1702-1712. 
 data leaks in email communications, 10th [16] Kavitha T., Rajitha O., Thejaswi K., 
 International Conference on Signal Processing Muppalaneni N. B. Classification of encryption 
 and Communication Systems (ICSPCS), Gold algorithms based on ciphertext using pattern 
 Coast, Australia, 2016, pp. 1-10. recognition techniques, International conference 
[6] X. Huang, Y. Lu, D. Li, M. Ma, A novel on Computer Networks, Big data and IoT, 2018, 
 mechanism for fast detection of transformed pp. 540-545. 
8 No 2.CS (12) 2020 
 Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
[17] C. Tan, Q. Ji, An approach to identifying classification: a systematic survey, IEEE 
 cryptographic algorithm from ciphertext, 8th Communications Surveys & Tutorials, Vol. 21, 
 IEEE International Conference on No. 2, 2018, pp. 1988-2014. 
 Communication Software and Networks, 2016, 
 [27] Hahn D., Apthorpe N., Feamster N., Detecting 
 pp. 19-23. 
 compressed cleartext traffic from consumer 
[18] C. Tan, Y. Li, S. Yao, A Novel Identification internet of things devices, arXiv preprint 
 Approach to Encryption Mode of Block Cipher, arXiv:1805.02722, 2018. 
 4th International Conference on Sensors, 
 [28] Casino F., Choo K. K. R., Patsakis C., HEDGE: 
 Mechatronics and Automation, Zhuhai, China, 
 efficient traffic classification of encrypted and 
 2016. DOI: 10.2991/icsma-16.2016.101. 
 compressed packets, IEEE Transactions on 
[19] C. Tan, X. Deng, L. Zhang, Identification of Information Forensics and Security, Vol. 14, No. 
 Block Ciphers under CBC Mode, Procedia 11, 2019, pp. 2916-2926. 
 Computer Science, Vol. 131, 2018, pp. 65-71. 
 [29] Tang Z., Zeng X. and Sheng Y., Entropy-
[20] Ray P. K., Ojha S., Roy B. K., Basu A., based feature extraction algorithm for 
 Classification of Encryption Algorithms using encrypted and non-encrypted compressed 
 Fisher’s Discriminant Analysis, Defence traffic classification International Journal of 
 Science Journal, Vol. 67, No. 1, 2017, pp. 59-65. ICIC, Vol. 15, No 3, 2019. 
[21] Pan J., Encryption scheme classification: a deep [30] Khakpour A. R., Liu A. X., An information-
 learning approach, International Journal of theoretical approach to high-speed flow nature 
 Electronic Security and Digital Forensics, Vol. identification, IEEE/ACM transactions on 
 9, No. 4, 2017, pp. 381-395. networking, Vol. 21, No. 4, 2012, pp. 1076-1089. 
[22] Wang, W., Zhu, M., Zeng, X., Ye, X., & Sheng, [31] Konyshev M. U., Dvilyansky A. A., 
 Y., Malware traffic classification using Barabashov A. Y., Petrov K. Y., Formation of 
 convolutional neural network for probability distributions of binary vectors of the 
 representation learning, International error source of a Markov discrete memory link 
 Conference on Information Networking using the method of "grouping probabilities" of 
 (ICOIN), 2017, pp. 712-717. error vectors, Industrial ACS and controllers, 
 No. 3, 2018, p. 42. 
[23] Wang W., Zhu M., Wang J., Zeng X., Yang Z., 
 End-to-end encrypted traffic classification with [32] Konyshev M. U., Dvilyansky A. A., Petrov K. 
 one-dimensional convolution neural networks, Y., Ermishin G. A., Algorithm for compression 
 IEEE International Conference on Intelligence of a distribution series of binary 
 and Security Informatics (ISI), 2017, pp. 43-48. multidimensional random variables, Industrial 
 ACS and controllers, No. 8, 2016, pp. 47-50. 
[24] Lotfollahi M., Siavoshani M. J., Zade R. S. H., 
 Saberian M., Deep packet: A novel approach for [33] Toolkit for the transport layer security and 
 encrypted traffic classification using deep secure sockets layer protocols,  
 learning, Soft Computing, 2017, pp. 1-14. (Access date: 14.01.2020). 
[25] Zhang J., Chen X., Xiang Y., Zhou W., Wu J. [34] Archive manager WinRAR,  
 Robust network traffic classification, (Access date: 14.01.2020). 
 IEEE/ACM Transactions on Networking, Vol. 
 [35] Programm environment Anaconda, 
 23, No. 4 , 2015, pp. 1257-1270. 
 https://www.anaconda.com/distribution/, 
[26] Pacheco F., Exposito E., Gineste M., Baudoin (Access date: 14.01.2020). 
 C., Aguilar J., Towards the deployment of 
 [36] Breiman L., Classification and regression trees, 
 machine learning solutions in network traffic 
 Routledge, 2017, p. 358. 
 Số 2.CS (12) 2020 9 
Journal of Science and Technology on Information security 
 ABOUT THE AUTHOR 
 Alexander Kozachok 
 Workplace: Academy of the Federal 
 Guard Service of Russian Federation 
 Email: alex.totrin@gmail.com 
 Education: Received his PhD degree in 
 Engineering Sciences in Academy of 
Federal Guard Service of the Russian Federation in 
December 2012; received his doctorate in Engineering 
Science in 2019. 
Recent research direction: information security, 
unauthorized access protection, mathematical 
cryptography, theoretical problems of computer science. 
 Andrey Spirin 
 Workplace: Academy of the Federal 
 Guard Service of Russian Federation 
 Email: spirin_aa@bk.ru 
 Education: Postgraduate student in 
 Academy of the Federal Guard Service 
of Russian Federation. 
Recent research direction: information security, DLP 
systems, machine learning algorithms, classification of 
binary sequences. 
10 No 2.CS (12) 2020 

File đính kèm:

  • pdfpseudorandom_sequences_classification_algorithm.pdf