An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture

The Advanced Encryption Standard (AES) is currently considered as one of the best symmetric-key block

ciphers. The hardware implementation of the AES for hand-held mobile devices or wireless sensor network

nodes is always required to meet the strict constraints in terms of performance, power and cost. Coarse-grained

reconfigurable architectures are recently proposed as the solution that provides high flexibility, high performance

and low power consumption for the next-generation embedded systems. This paper presents a flexible, highperformance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called

MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning

method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are

considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources

of the MUSRA. Some optimizations at both loop transformation level and scheduling level are performed in

order to make better use of instruction-, loop- and task- level parallelism. The proposed implementation has been

evaluated by the cycle-accurate simulator of the MUSRA. Experimental results show that the MUSRA can be

reconfigured to support both encryption and decryption with all key lengths specified in the AES standard. The

performance of the AES algorithm on the MUSRA is better than that of the ADRES reconfigurable processor,

Xilinx Virtex-II, and the TI C64+ DSP.

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 1

Trang 1

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 2

Trang 2

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 3

Trang 3

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 4

Trang 4

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 5

Trang 5

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 6

Trang 6

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 7

Trang 7

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 8

Trang 8

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 9

Trang 9

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 13 trang duykhanh 3100
Bạn đang xem 10 trang mẫu của tài liệu "An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture

An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture
processed in parallel. 
4.2. Scheduling 
Context 
Parser
Context 
Memory
Input DMA
Output DMA
Data 
Memory
IN_FIFO
OUT_FIFO
GRF
 AMBA/CGRA Interface
1
2
3
4
3
RCA
 AMBA BUS
CPUInstruction Memory
Data 
Memory
IRQC
CDMAC
DDMAC
CGRA
Figure 10. System-level cycle-accurate 
simulator for MUSRA. 
In this paper, we developed a system-level 
cycle-accurate simulator for experimentally 
evaluating and validating the implementation of 
an application on the MUSRA. The simulator is 
based on the LEON3 processor and the other IP 
cores from the Gaisler’s library [15] as shown 
in Figure 10. The LEON3 processor functions 
as the central processing unit (CPU) that takes 
charge of managing and scheduling all activities 
of the system. The external memory is used for 
communicating data between tasks on the CPU 
and tasks on the RCA. Cooperation between 
RCA, CPU and DMAs are synchronized by the 
interrupt mechanism. When the MUSRA 
finishes the assigned task, it generates an 
interrupt via IRQC (Interrupt Request 
Controller) unit to signal the CPU, and returns 
bus control to the CPU. In order to simulate, the 
C-software of the AES algorithm is compiled 
and loaded into the Instruction Memory of the 
simulator. Meanwhile, the plaintext file is 
copied into the Data Memory. 
Figure 11 shows the timing diagram of 
scheduling tasks on the different resources of 
the MUSRA-based system. Execution and data-
flow of the MUSRA are reconfigured 
dynamically under controlling of the CPU. 
AddRoundKey() and the combination of 
MixColum() and AddRoundKey() are mapped 
onto the RCA and denoted as AddRoundKey() and 
Mix_Add(), respectively, in Figure 11. The other 
tasks including KeyExpansion() (i.e. Key Exp.) 
and the combination of SubBytes() and ShiftRows() 
(i.e. Sub_Shft()) are assigned to the CPU. 
After resetting, the operation of the system 
is briefly described as follows: 
● Context Memory Initialization (i.e. CM 
Init. process in Figure 11): CPU writes the 
necessary control parameters and then grant bus 
control to CDMAC in Context Memory (i.e. 
phase (1) in Figure 10). CDMAC will copy a 
context from the instruction memory to context 
memory. At the same time, CPU executes Key 
Exp. function. 
● Context Parser Initialization (i.e. PAR 
init. process in Figure 11): CPU writes the 
configuration words to the context parser. 
● RCA Configuration and Data Memory 
Initialization: After configured, parser reads 
one proper context from the context memory, 
decode it and configure RCA (i.e. Conf. process 
in Figure 11). Concurrently, CPU initializes 
DDMAC that will copy data from the external 
data memory to the internal data memory (i.e. 
DM init. process in Figure 11). DDMAC is also 
used for writing the result back to the external 
data memory. 
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 19 
Copy to 
CM
RCA
PARSER
DMA
Boot
R
Processing 
Resource
Time (Cycles)
Encryption
Copy data to/from DM
Conf.
AddRounKey(0) Mix_Add(1) Mix_Add(1)
CM 
Init.
PAR 
Init.
DM 
Init.
Sub_Shf(1) Sub_Shf(2)
Key
Exp.
Sub_Shf
(Nr)
Conf.
AddRounKey
(Nr)
Conf.
Figure 11. Timing diagram of scheduling sub-tasks on resources of RPU. 
RCA Execution: RCA performs a certain 
task (e.g. AddRoundKey(), Mix_Add(),) right 
after it has been configured. 
5. Experiment and Evaluation 
This section presents the simulating of the 
AES algorithm on the MUSRA platform that is 
modeled at different abstraction levels. The 
performance of the AES algorithm running on 
the MUSRA is compared with that of the 
ADRES reconfigurable processor [6], Xilinx 
Virtex-II (XC2V500) [11], and the TI C64+ 
DSP from Texas Instruments [3]. 
5.1. Simulation Environment 
The environment for developing and 
verifying applications on the MUSRA has 
been built at the different abstract levels [10]. 
Firstly, the C-model is used for 
hardware/software partitioning and generating 
configuration contexts. C-Model is a software 
platform includes a set of C source files (.c, .h) 
to define the parameters and the functional 
model of the building blocks of MUSRA 
(Figure 12). Besides, C-model also offers 
several APIs for reading/writing data from/to a 
text file (.txt) to initialize or store the contents 
of the memory model of the C-model. The 
configuration information for the MUSRA is 
generated by the configuration Tools. Based on 
the C-model, it is easy to build the testbench 
programs to verify applications on the 
MUSRA architecture. The C-model has been 
developed in the Visual Studio IDE. 
C-model of MUSRA
FIFO DMA RCA core
Context Parser MUSRA Parameters API processing file 
// User application
main ()
{
 //SW code here
}
{
 //code of HW task is removed
 //extract and generate data for MUSRA
 //grant parameter to MUSRA
}
{
 //Read data that are returned by HW 
 //SW code
}
C++ code of application Initializing Context Memory
Initializing Data Memory and 
GRF
Fetching Context and 
Configuring RCA, DMAs
Run RCA core:
(1) Write data to IN_FIFO
(2) Processing 
(3) Write result to OUT_FIFO
Store data from OUT_FIFO 
to Memory
OUT_DATA.txt
IN_DATA.txt
CONSTANT.txt
Print data to file or screenScreen
CONTEXT.txt
HW/SW 
Partition
Configuration 
ToolHW 
tasks
Testbench
Tools
Figure 12. C-model of MUSRA. 
Secondly, a cycle-accurate RTL (Register 
Transfer Level) model, which is written in 
VHDL language, is used for evaluating the 
performance of the algorithm on the proposed 
architecture. Figure 13 shows an example of 
the construction of the testbench model for 
verifying the AES algorithm. Besides the RCA 
described at RTL, some other function blocks 
such as clock generator, address generator, 
data memory, and context memory... are 
described in the behavioral level. In order to 
simulate, it also needs the input data files 
includes "in_data.txt", "constant.txt" and 
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 
20 
"context.txt" - these was created by the C-
model of the MUSRA. 
DUT: RCA8x8_DATAPATH
DATA MEMORY 
SUBSYSTEM MODEL
FIFO_WIRE_IN GRF
4*8 bits
FIFO_WIRE_IN
(31 downto 0)
GRF model
PE_OUT(5)(0~3)
ADDRESS 
GENERATOR
32 
bits
CLK 
GENERATOR
CONTEXT 
MEMORY 
SUBSYSTEM 
MODEL
8 bits
7*8 bits
In_Data.txt Constant.txt
C
o
n
te
x
t.
tx
t
FIFO_WIRE_OUT
Figure 13. RTL model of the MUSRA. 
Finally, the system-level cycle-accurate 
simulator (as shown in Figure 11) is used for 
hardware/software co-verifying and evaluating 
the performance of the whole algorithm. Both 
RTL model and the cycle-accurate simulator 
were developed by using the ModelSim EDA 
tool from Mentor Graphics. 
5.2. Simulation Results and Evaluation 
Figure 14 shows the simulation results for 
the case of mapping Mix_Add() (i.e. DFG in 
Figure 9) on the MUSRA. After the latency of 
seven cycles (from 100ns to 220ns), RCA can 
calculate and output a column of four bytes 
(including pe_out(5)(0) to pe_out(5)(0)) of the 
status matrix every clock cycle. 
At the system level, the simulations are 
done for both encryption and decryption 
process on an input file of 300KB with key 
lengths of 128- and 256-bit. The simulation 
result shows that it take about 2.2 and 2.89 
million cycles to perform the algorithm AES 
with 128- and 256-bit key lengths on the 
MUSRA, respectively. 
Table 2 summarizes the simulation results 
of the AES encryption and decryption 
algorithm with MUSRA, TI C64+ DSP, and 
ADRES, Xilinx Virtex-II (XC2V500). 
The TI C64+ DSP is one 64-bit digital 
signal processor targeted at the cryptography 
applications on embedded systems. The C-
software of the AES algorithm that is optimized 
for 64-bit architecture just requires 
approximately 32 million instructions in total to 
complete the assigned task. The simulation 
shows that TI C64+ DSP can execute average 
2.09 instructions per cycle, and therefore it takes 
about 15.2 million cycles to process its tasks. 
The ADRES [6] is a 32-bit reconfigurable 
architecture that tightly couples a VLIW 
processing core with an array of 4×4 
reconfigurable cells (RCs). The reconfigurable 
RCs act as instruction issue slots of the VLIW 
core. The ADRES takes 3.6 million 
instructions in total to complete its task with 
6.31 instructions per cycle in average. 
The Virtex-II (XC2V500) is a FPGA 
device from Xilinx. The authors in [11] 
proposed the SoC that includes a MicroBlaze 
processor and the programmable logic of the 
Xilinx Virtex-II for performing the AES 
algorithm. Their implementation shows that it 
requires about 250 cycles to encrypt or decrypt 
one state block. 
To evaluate the performance of the 
MUSRA, the C-software of the AES algorithm, 
which was optimized for the MUSRA 
architecture, is first executed on only the 
LEON3 processor. As shown in Table 2, it has 
to execute approximately 65.4 million 
instructions in total. The reason is that the 
proposed loop transformation increases the 
length of the C-software. However, when this 
C-software is executed on both of the LEON3 
and the MUSRA, the total cycles is just 2.2 
million, which is about 6.9 times, 2.2 times, 
and 1.6 times better than that of the TI C64+ 
DSP, Xilinx Virtex-II, and the ADRES. Our 
proposal achieves 29.71 instructions per cycle 
in average. The implementation by using 
Xilinx Virtex-II is slower than ours due to the 
inherent fine-grained architecture of FPGAs. 
There are two reasons that make our proposal 
better than the ADRES. Firstly, the MUSRA 
uses an 8×8 RCA compared with 4×4 one of 
the ADRES. Secondly, the AES algorithm is 
partitioned into hardware tasks and software 
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 21 
Figure 14. Simulation result with RTL model of MUSRA. 
Table 2. Performance of the AES algorithm on different platforms (using 128-bit key length) 
Platform Processing Elements 
Total 
Instructions 
Total 
Cycles 
Instructions 
per Cycle 
Cycles 
per Block 
TI C64+ DSP[3] 1 CPU + Coprocessor 32M 15.2M 2.09 
N/A 
ADRES[6] 1 CPU + 4×4 RCs 23.2M 3.6M 6.31 N/A 
Xilinx Virtex-II [11] 1 CPU + FPGA N/A N/A N/A 250 
Our 
proposal 
LEON3 1 CPU 65.4M 65.6M 1 
3416 
LEON3+MUSRA 1 CPU + 8×8 RCs 65.4M 2.2M 29.71 114 
G 
tasks that are executed simultaneously on both 
LEON3 and MUSRA. It is difficult to exploit 
task-level parallelism on the ADRES due to 
tightly coupling between the VLIW processor 
with the RCA. 
6. Conclusions 
In this paper, a detailed explanation for 
mapping the AES algorithm onto the MUSRA 
platform has been presented. Multi-level 
parallelism was exploited in order to improve 
the performance of the AES algorithm on the 
MUSRA. We first analyzed the source code of 
the AES algorithm and proposed the 
optimization solution to expose the instruction-
level and the loop-level parallelism. 
Hardware/software partition and scheduling 
were also proposed to exploit the task-level 
parallelism. Our implementation has been 
simulated and verified by the cycle-accurate 
simulator of the MUSRA. Experimental results 
show that the performance of the AES 
algorithm on MUSRA is better than that of the 
ADRES reconfigurable processor, Xilinx 
Virtex-II, and the TI C64+ DSP. It is also easy 
to reconfigure the MUSRA to support both the 
encryption and decryption with all key lengths 
specified in the AES standard. 
In the future work, some aspects such as 
hardware/software partitioning, DFG extracting, 
and scheduling, etc., will continue to be 
optimized according to the architecture of the 
MUSRA to achieve a better performance. The 
proposed implementation also will be validated 
with the MUSRA prototype on FPGA platform. 
Acknowledgement 
This work has been supported by Vietnam 
National University, Hanoi under Project No. 
QG.16.33. 
References 
[1] NIST, “Announcing the advanced encryption 
standard (AES)”, Federal Information 
Processing Standards Publication, n. 197, 
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 
22 
November 26, 2001. 
[2] Christophe Bobda, “Introduction to 
Reconfigurable Computing – Architectures, 
Algorithms, and Applications”, Springer, 2007. 
doi: 10.1007/978-1-4020-6100-4. 
[3] J. Jurely and H. Hakkarainen, “TI’s new ’C6x 
DSP screams at 1.600 MIPS. Microprocessor 
Report”, 1997. 
[4] V. Dao, A. Nguyen, V. Hoang and T. Tran, “An 
ASIC Implementation of Low Area AES 
Encryption Core for Wireless Networks”, in Proc. 
International Conference on Computing, 
Management and Telecommunications 
(ComManTel2015), pp. 99-112, December 2015. 
[5] 
devices/soc/zynq-7000.htm 
[6] Garcia, A., Berekovic M., Aa T.V., “Mapping 
of the AES cryptographic algorithm on a 
Coarse-Grain reconfigurable array processor”, 
International Conference on Application-
Specific Systems, Architectures and Processors 
(ASAP 2008). 
[7] João M. P. Cardoso, Pedro C. Diniz: 
“Compilation Techniques for Reconfigurable 
Architectures”, Springer, 2009. 
[8] A. Shoa and S. Shirani, “Run-Time 
Reconfigurable Systems for Digital Signal 
Processing Applications: A Survey”, Journal of 
VLSI Signal Processing, Vol. 39, pp.213–235, 
Springer, 2005. 
[9] G. Theodoridis, D. Soudris and S. Vassiliadis, “A 
Survey of Coarse-Grain Reconfigurable 
Architectures and Cad Tools Basic Definitions, 
Critical Design Issues and Existing Coarse-grain 
Reconfigurable Systems”, Springer, 2008. 
[10] Hung K. Nguyen, Quang-Vinh Tran, and Xuan-
Tu Tran, “Data Locality Exploitation for Coarse-
grained Reconfigurable Architecture in a 
Reconfigurable Network-on-Chip”, The 2014 
International Conference on Integrated Circuits, 
Design, and Verification (ICDV 2014). 
[11] Z. Alaoui Ismaili and A. Moussa, “Self-Partial 
and Dynamic Reconfiguration Implementation for 
AES using FPGA”, IJCSI International Journal of 
Computer Science Issues, Vol. 2, pp. 33-40, 2009. 
[12] Kathryn S. McKinley, Steve Carr, Chau-Wen 
Tseng, “Improving Data Locality with Loop 
Transformations”, ACM Transactions on 
Programming Languages and Systems (TOPLAS), 
Volume 18, Issue 4, July 1996, pp. 424 - 453. 
[13] S. Sohoni, and R. Min, et al. “A study of memory 
system performance of multimedia applications”. 
SIGMETRICS Performance 2001, pp. 206–215. 
[14] M. Zhu, L. Liu, S. Yin, et al., "A Cycle-Accurate 
Simulator for a Reconfigurable Multi-Media 
System," IEICE Transactions on Information and 
Systems, Vol. 93, pp. 3202-3210, 2010. 
[15] https://gcc.gnu.org/. 
[16] Gaisler Research, “GRLIB IP Core User’s Manual”, 
Version 1.3.0-b4133, August 2013. 
H 

File đính kèm:

  • pdfan_efficient_implementation_of_advanced_encryption_standard.pdf