An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture
The Advanced Encryption Standard (AES) is currently considered as one of the best symmetric-key block
ciphers. The hardware implementation of the AES for hand-held mobile devices or wireless sensor network
nodes is always required to meet the strict constraints in terms of performance, power and cost. Coarse-grained
reconfigurable architectures are recently proposed as the solution that provides high flexibility, high performance
and low power consumption for the next-generation embedded systems. This paper presents a flexible, highperformance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called
MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning
method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are
considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources
of the MUSRA. Some optimizations at both loop transformation level and scheduling level are performed in
order to make better use of instruction-, loop- and task- level parallelism. The proposed implementation has been
evaluated by the cycle-accurate simulator of the MUSRA. Experimental results show that the MUSRA can be
reconfigured to support both encryption and decryption with all key lengths specified in the AES standard. The
performance of the AES algorithm on the MUSRA is better than that of the ADRES reconfigurable processor,
Xilinx Virtex-II, and the TI C64+ DSP.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture
processed in parallel. 4.2. Scheduling Context Parser Context Memory Input DMA Output DMA Data Memory IN_FIFO OUT_FIFO GRF AMBA/CGRA Interface 1 2 3 4 3 RCA AMBA BUS CPUInstruction Memory Data Memory IRQC CDMAC DDMAC CGRA Figure 10. System-level cycle-accurate simulator for MUSRA. In this paper, we developed a system-level cycle-accurate simulator for experimentally evaluating and validating the implementation of an application on the MUSRA. The simulator is based on the LEON3 processor and the other IP cores from the Gaisler’s library [15] as shown in Figure 10. The LEON3 processor functions as the central processing unit (CPU) that takes charge of managing and scheduling all activities of the system. The external memory is used for communicating data between tasks on the CPU and tasks on the RCA. Cooperation between RCA, CPU and DMAs are synchronized by the interrupt mechanism. When the MUSRA finishes the assigned task, it generates an interrupt via IRQC (Interrupt Request Controller) unit to signal the CPU, and returns bus control to the CPU. In order to simulate, the C-software of the AES algorithm is compiled and loaded into the Instruction Memory of the simulator. Meanwhile, the plaintext file is copied into the Data Memory. Figure 11 shows the timing diagram of scheduling tasks on the different resources of the MUSRA-based system. Execution and data- flow of the MUSRA are reconfigured dynamically under controlling of the CPU. AddRoundKey() and the combination of MixColum() and AddRoundKey() are mapped onto the RCA and denoted as AddRoundKey() and Mix_Add(), respectively, in Figure 11. The other tasks including KeyExpansion() (i.e. Key Exp.) and the combination of SubBytes() and ShiftRows() (i.e. Sub_Shft()) are assigned to the CPU. After resetting, the operation of the system is briefly described as follows: ● Context Memory Initialization (i.e. CM Init. process in Figure 11): CPU writes the necessary control parameters and then grant bus control to CDMAC in Context Memory (i.e. phase (1) in Figure 10). CDMAC will copy a context from the instruction memory to context memory. At the same time, CPU executes Key Exp. function. ● Context Parser Initialization (i.e. PAR init. process in Figure 11): CPU writes the configuration words to the context parser. ● RCA Configuration and Data Memory Initialization: After configured, parser reads one proper context from the context memory, decode it and configure RCA (i.e. Conf. process in Figure 11). Concurrently, CPU initializes DDMAC that will copy data from the external data memory to the internal data memory (i.e. DM init. process in Figure 11). DDMAC is also used for writing the result back to the external data memory. H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 19 Copy to CM RCA PARSER DMA Boot R Processing Resource Time (Cycles) Encryption Copy data to/from DM Conf. AddRounKey(0) Mix_Add(1) Mix_Add(1) CM Init. PAR Init. DM Init. Sub_Shf(1) Sub_Shf(2) Key Exp. Sub_Shf (Nr) Conf. AddRounKey (Nr) Conf. Figure 11. Timing diagram of scheduling sub-tasks on resources of RPU. RCA Execution: RCA performs a certain task (e.g. AddRoundKey(), Mix_Add(),) right after it has been configured. 5. Experiment and Evaluation This section presents the simulating of the AES algorithm on the MUSRA platform that is modeled at different abstraction levels. The performance of the AES algorithm running on the MUSRA is compared with that of the ADRES reconfigurable processor [6], Xilinx Virtex-II (XC2V500) [11], and the TI C64+ DSP from Texas Instruments [3]. 5.1. Simulation Environment The environment for developing and verifying applications on the MUSRA has been built at the different abstract levels [10]. Firstly, the C-model is used for hardware/software partitioning and generating configuration contexts. C-Model is a software platform includes a set of C source files (.c, .h) to define the parameters and the functional model of the building blocks of MUSRA (Figure 12). Besides, C-model also offers several APIs for reading/writing data from/to a text file (.txt) to initialize or store the contents of the memory model of the C-model. The configuration information for the MUSRA is generated by the configuration Tools. Based on the C-model, it is easy to build the testbench programs to verify applications on the MUSRA architecture. The C-model has been developed in the Visual Studio IDE. C-model of MUSRA FIFO DMA RCA core Context Parser MUSRA Parameters API processing file // User application main () { //SW code here } { //code of HW task is removed //extract and generate data for MUSRA //grant parameter to MUSRA } { //Read data that are returned by HW //SW code } C++ code of application Initializing Context Memory Initializing Data Memory and GRF Fetching Context and Configuring RCA, DMAs Run RCA core: (1) Write data to IN_FIFO (2) Processing (3) Write result to OUT_FIFO Store data from OUT_FIFO to Memory OUT_DATA.txt IN_DATA.txt CONSTANT.txt Print data to file or screenScreen CONTEXT.txt HW/SW Partition Configuration ToolHW tasks Testbench Tools Figure 12. C-model of MUSRA. Secondly, a cycle-accurate RTL (Register Transfer Level) model, which is written in VHDL language, is used for evaluating the performance of the algorithm on the proposed architecture. Figure 13 shows an example of the construction of the testbench model for verifying the AES algorithm. Besides the RCA described at RTL, some other function blocks such as clock generator, address generator, data memory, and context memory... are described in the behavioral level. In order to simulate, it also needs the input data files includes "in_data.txt", "constant.txt" and H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 20 "context.txt" - these was created by the C- model of the MUSRA. DUT: RCA8x8_DATAPATH DATA MEMORY SUBSYSTEM MODEL FIFO_WIRE_IN GRF 4*8 bits FIFO_WIRE_IN (31 downto 0) GRF model PE_OUT(5)(0~3) ADDRESS GENERATOR 32 bits CLK GENERATOR CONTEXT MEMORY SUBSYSTEM MODEL 8 bits 7*8 bits In_Data.txt Constant.txt C o n te x t. tx t FIFO_WIRE_OUT Figure 13. RTL model of the MUSRA. Finally, the system-level cycle-accurate simulator (as shown in Figure 11) is used for hardware/software co-verifying and evaluating the performance of the whole algorithm. Both RTL model and the cycle-accurate simulator were developed by using the ModelSim EDA tool from Mentor Graphics. 5.2. Simulation Results and Evaluation Figure 14 shows the simulation results for the case of mapping Mix_Add() (i.e. DFG in Figure 9) on the MUSRA. After the latency of seven cycles (from 100ns to 220ns), RCA can calculate and output a column of four bytes (including pe_out(5)(0) to pe_out(5)(0)) of the status matrix every clock cycle. At the system level, the simulations are done for both encryption and decryption process on an input file of 300KB with key lengths of 128- and 256-bit. The simulation result shows that it take about 2.2 and 2.89 million cycles to perform the algorithm AES with 128- and 256-bit key lengths on the MUSRA, respectively. Table 2 summarizes the simulation results of the AES encryption and decryption algorithm with MUSRA, TI C64+ DSP, and ADRES, Xilinx Virtex-II (XC2V500). The TI C64+ DSP is one 64-bit digital signal processor targeted at the cryptography applications on embedded systems. The C- software of the AES algorithm that is optimized for 64-bit architecture just requires approximately 32 million instructions in total to complete the assigned task. The simulation shows that TI C64+ DSP can execute average 2.09 instructions per cycle, and therefore it takes about 15.2 million cycles to process its tasks. The ADRES [6] is a 32-bit reconfigurable architecture that tightly couples a VLIW processing core with an array of 4×4 reconfigurable cells (RCs). The reconfigurable RCs act as instruction issue slots of the VLIW core. The ADRES takes 3.6 million instructions in total to complete its task with 6.31 instructions per cycle in average. The Virtex-II (XC2V500) is a FPGA device from Xilinx. The authors in [11] proposed the SoC that includes a MicroBlaze processor and the programmable logic of the Xilinx Virtex-II for performing the AES algorithm. Their implementation shows that it requires about 250 cycles to encrypt or decrypt one state block. To evaluate the performance of the MUSRA, the C-software of the AES algorithm, which was optimized for the MUSRA architecture, is first executed on only the LEON3 processor. As shown in Table 2, it has to execute approximately 65.4 million instructions in total. The reason is that the proposed loop transformation increases the length of the C-software. However, when this C-software is executed on both of the LEON3 and the MUSRA, the total cycles is just 2.2 million, which is about 6.9 times, 2.2 times, and 1.6 times better than that of the TI C64+ DSP, Xilinx Virtex-II, and the ADRES. Our proposal achieves 29.71 instructions per cycle in average. The implementation by using Xilinx Virtex-II is slower than ours due to the inherent fine-grained architecture of FPGAs. There are two reasons that make our proposal better than the ADRES. Firstly, the MUSRA uses an 8×8 RCA compared with 4×4 one of the ADRES. Secondly, the AES algorithm is partitioned into hardware tasks and software H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 21 Figure 14. Simulation result with RTL model of MUSRA. Table 2. Performance of the AES algorithm on different platforms (using 128-bit key length) Platform Processing Elements Total Instructions Total Cycles Instructions per Cycle Cycles per Block TI C64+ DSP[3] 1 CPU + Coprocessor 32M 15.2M 2.09 N/A ADRES[6] 1 CPU + 4×4 RCs 23.2M 3.6M 6.31 N/A Xilinx Virtex-II [11] 1 CPU + FPGA N/A N/A N/A 250 Our proposal LEON3 1 CPU 65.4M 65.6M 1 3416 LEON3+MUSRA 1 CPU + 8×8 RCs 65.4M 2.2M 29.71 114 G tasks that are executed simultaneously on both LEON3 and MUSRA. It is difficult to exploit task-level parallelism on the ADRES due to tightly coupling between the VLIW processor with the RCA. 6. Conclusions In this paper, a detailed explanation for mapping the AES algorithm onto the MUSRA platform has been presented. Multi-level parallelism was exploited in order to improve the performance of the AES algorithm on the MUSRA. We first analyzed the source code of the AES algorithm and proposed the optimization solution to expose the instruction- level and the loop-level parallelism. Hardware/software partition and scheduling were also proposed to exploit the task-level parallelism. Our implementation has been simulated and verified by the cycle-accurate simulator of the MUSRA. Experimental results show that the performance of the AES algorithm on MUSRA is better than that of the ADRES reconfigurable processor, Xilinx Virtex-II, and the TI C64+ DSP. It is also easy to reconfigure the MUSRA to support both the encryption and decryption with all key lengths specified in the AES standard. In the future work, some aspects such as hardware/software partitioning, DFG extracting, and scheduling, etc., will continue to be optimized according to the architecture of the MUSRA to achieve a better performance. The proposed implementation also will be validated with the MUSRA prototype on FPGA platform. Acknowledgement This work has been supported by Vietnam National University, Hanoi under Project No. QG.16.33. References [1] NIST, “Announcing the advanced encryption standard (AES)”, Federal Information Processing Standards Publication, n. 197, H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 22 November 26, 2001. [2] Christophe Bobda, “Introduction to Reconfigurable Computing – Architectures, Algorithms, and Applications”, Springer, 2007. doi: 10.1007/978-1-4020-6100-4. [3] J. Jurely and H. Hakkarainen, “TI’s new ’C6x DSP screams at 1.600 MIPS. Microprocessor Report”, 1997. [4] V. Dao, A. Nguyen, V. Hoang and T. Tran, “An ASIC Implementation of Low Area AES Encryption Core for Wireless Networks”, in Proc. International Conference on Computing, Management and Telecommunications (ComManTel2015), pp. 99-112, December 2015. [5] devices/soc/zynq-7000.htm [6] Garcia, A., Berekovic M., Aa T.V., “Mapping of the AES cryptographic algorithm on a Coarse-Grain reconfigurable array processor”, International Conference on Application- Specific Systems, Architectures and Processors (ASAP 2008). [7] João M. P. Cardoso, Pedro C. Diniz: “Compilation Techniques for Reconfigurable Architectures”, Springer, 2009. [8] A. Shoa and S. Shirani, “Run-Time Reconfigurable Systems for Digital Signal Processing Applications: A Survey”, Journal of VLSI Signal Processing, Vol. 39, pp.213–235, Springer, 2005. [9] G. Theodoridis, D. Soudris and S. Vassiliadis, “A Survey of Coarse-Grain Reconfigurable Architectures and Cad Tools Basic Definitions, Critical Design Issues and Existing Coarse-grain Reconfigurable Systems”, Springer, 2008. [10] Hung K. Nguyen, Quang-Vinh Tran, and Xuan- Tu Tran, “Data Locality Exploitation for Coarse- grained Reconfigurable Architecture in a Reconfigurable Network-on-Chip”, The 2014 International Conference on Integrated Circuits, Design, and Verification (ICDV 2014). [11] Z. Alaoui Ismaili and A. Moussa, “Self-Partial and Dynamic Reconfiguration Implementation for AES using FPGA”, IJCSI International Journal of Computer Science Issues, Vol. 2, pp. 33-40, 2009. [12] Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, “Improving Data Locality with Loop Transformations”, ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 18, Issue 4, July 1996, pp. 424 - 453. [13] S. Sohoni, and R. Min, et al. “A study of memory system performance of multimedia applications”. SIGMETRICS Performance 2001, pp. 206–215. [14] M. Zhu, L. Liu, S. Yin, et al., "A Cycle-Accurate Simulator for a Reconfigurable Multi-Media System," IEICE Transactions on Information and Systems, Vol. 93, pp. 3202-3210, 2010. [15] https://gcc.gnu.org/. [16] Gaisler Research, “GRLIB IP Core User’s Manual”, Version 1.3.0-b4133, August 2013. H
File đính kèm:
- an_efficient_implementation_of_advanced_encryption_standard.pdf