An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture
The Advanced Encryption Standard (AES) is currently considered as one of the best symmetric-key block
ciphers. The hardware implementation of the AES for hand-held mobile devices or wireless sensor network
nodes is always required to meet the strict constraints in terms of performance, power and cost. Coarse-grained
reconfigurable architectures are recently proposed as the solution that provides high flexibility, high performance
and low power consumption for the next-generation embedded systems. This paper presents a flexible, highperformance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called
MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning
method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are
considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources
of the MUSRA. Some optimizations at both loop transformation level and scheduling level are performed in
order to make better use of instruction-, loop- and task- level parallelism. The proposed implementation has been
evaluated by the cycle-accurate simulator of the MUSRA. Experimental results show that the MUSRA can be
reconfigured to support both encryption and decryption with all key lengths specified in the AES standard. The
performance of the AES algorithm on the MUSRA is better than that of the ADRES reconfigurable processor,
Xilinx Virtex-II, and the TI C64+ DSP.

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: An efficient implementation of advanced encryption standard on the coarse - Grained reconfigurable architecture
processed in parallel.
4.2. Scheduling
Context
Parser
Context
Memory
Input DMA
Output DMA
Data
Memory
IN_FIFO
OUT_FIFO
GRF
AMBA/CGRA Interface
1
2
3
4
3
RCA
AMBA BUS
CPUInstruction Memory
Data
Memory
IRQC
CDMAC
DDMAC
CGRA
Figure 10. System-level cycle-accurate
simulator for MUSRA.
In this paper, we developed a system-level
cycle-accurate simulator for experimentally
evaluating and validating the implementation of
an application on the MUSRA. The simulator is
based on the LEON3 processor and the other IP
cores from the Gaisler’s library [15] as shown
in Figure 10. The LEON3 processor functions
as the central processing unit (CPU) that takes
charge of managing and scheduling all activities
of the system. The external memory is used for
communicating data between tasks on the CPU
and tasks on the RCA. Cooperation between
RCA, CPU and DMAs are synchronized by the
interrupt mechanism. When the MUSRA
finishes the assigned task, it generates an
interrupt via IRQC (Interrupt Request
Controller) unit to signal the CPU, and returns
bus control to the CPU. In order to simulate, the
C-software of the AES algorithm is compiled
and loaded into the Instruction Memory of the
simulator. Meanwhile, the plaintext file is
copied into the Data Memory.
Figure 11 shows the timing diagram of
scheduling tasks on the different resources of
the MUSRA-based system. Execution and data-
flow of the MUSRA are reconfigured
dynamically under controlling of the CPU.
AddRoundKey() and the combination of
MixColum() and AddRoundKey() are mapped
onto the RCA and denoted as AddRoundKey() and
Mix_Add(), respectively, in Figure 11. The other
tasks including KeyExpansion() (i.e. Key Exp.)
and the combination of SubBytes() and ShiftRows()
(i.e. Sub_Shft()) are assigned to the CPU.
After resetting, the operation of the system
is briefly described as follows:
● Context Memory Initialization (i.e. CM
Init. process in Figure 11): CPU writes the
necessary control parameters and then grant bus
control to CDMAC in Context Memory (i.e.
phase (1) in Figure 10). CDMAC will copy a
context from the instruction memory to context
memory. At the same time, CPU executes Key
Exp. function.
● Context Parser Initialization (i.e. PAR
init. process in Figure 11): CPU writes the
configuration words to the context parser.
● RCA Configuration and Data Memory
Initialization: After configured, parser reads
one proper context from the context memory,
decode it and configure RCA (i.e. Conf. process
in Figure 11). Concurrently, CPU initializes
DDMAC that will copy data from the external
data memory to the internal data memory (i.e.
DM init. process in Figure 11). DDMAC is also
used for writing the result back to the external
data memory.
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 19
Copy to
CM
RCA
PARSER
DMA
Boot
R
Processing
Resource
Time (Cycles)
Encryption
Copy data to/from DM
Conf.
AddRounKey(0) Mix_Add(1) Mix_Add(1)
CM
Init.
PAR
Init.
DM
Init.
Sub_Shf(1) Sub_Shf(2)
Key
Exp.
Sub_Shf
(Nr)
Conf.
AddRounKey
(Nr)
Conf.
Figure 11. Timing diagram of scheduling sub-tasks on resources of RPU.
RCA Execution: RCA performs a certain
task (e.g. AddRoundKey(), Mix_Add(),) right
after it has been configured.
5. Experiment and Evaluation
This section presents the simulating of the
AES algorithm on the MUSRA platform that is
modeled at different abstraction levels. The
performance of the AES algorithm running on
the MUSRA is compared with that of the
ADRES reconfigurable processor [6], Xilinx
Virtex-II (XC2V500) [11], and the TI C64+
DSP from Texas Instruments [3].
5.1. Simulation Environment
The environment for developing and
verifying applications on the MUSRA has
been built at the different abstract levels [10].
Firstly, the C-model is used for
hardware/software partitioning and generating
configuration contexts. C-Model is a software
platform includes a set of C source files (.c, .h)
to define the parameters and the functional
model of the building blocks of MUSRA
(Figure 12). Besides, C-model also offers
several APIs for reading/writing data from/to a
text file (.txt) to initialize or store the contents
of the memory model of the C-model. The
configuration information for the MUSRA is
generated by the configuration Tools. Based on
the C-model, it is easy to build the testbench
programs to verify applications on the
MUSRA architecture. The C-model has been
developed in the Visual Studio IDE.
C-model of MUSRA
FIFO DMA RCA core
Context Parser MUSRA Parameters API processing file
// User application
main ()
{
//SW code here
}
{
//code of HW task is removed
//extract and generate data for MUSRA
//grant parameter to MUSRA
}
{
//Read data that are returned by HW
//SW code
}
C++ code of application Initializing Context Memory
Initializing Data Memory and
GRF
Fetching Context and
Configuring RCA, DMAs
Run RCA core:
(1) Write data to IN_FIFO
(2) Processing
(3) Write result to OUT_FIFO
Store data from OUT_FIFO
to Memory
OUT_DATA.txt
IN_DATA.txt
CONSTANT.txt
Print data to file or screenScreen
CONTEXT.txt
HW/SW
Partition
Configuration
ToolHW
tasks
Testbench
Tools
Figure 12. C-model of MUSRA.
Secondly, a cycle-accurate RTL (Register
Transfer Level) model, which is written in
VHDL language, is used for evaluating the
performance of the algorithm on the proposed
architecture. Figure 13 shows an example of
the construction of the testbench model for
verifying the AES algorithm. Besides the RCA
described at RTL, some other function blocks
such as clock generator, address generator,
data memory, and context memory... are
described in the behavioral level. In order to
simulate, it also needs the input data files
includes "in_data.txt", "constant.txt" and
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22
20
"context.txt" - these was created by the C-
model of the MUSRA.
DUT: RCA8x8_DATAPATH
DATA MEMORY
SUBSYSTEM MODEL
FIFO_WIRE_IN GRF
4*8 bits
FIFO_WIRE_IN
(31 downto 0)
GRF model
PE_OUT(5)(0~3)
ADDRESS
GENERATOR
32
bits
CLK
GENERATOR
CONTEXT
MEMORY
SUBSYSTEM
MODEL
8 bits
7*8 bits
In_Data.txt Constant.txt
C
o
n
te
x
t.
tx
t
FIFO_WIRE_OUT
Figure 13. RTL model of the MUSRA.
Finally, the system-level cycle-accurate
simulator (as shown in Figure 11) is used for
hardware/software co-verifying and evaluating
the performance of the whole algorithm. Both
RTL model and the cycle-accurate simulator
were developed by using the ModelSim EDA
tool from Mentor Graphics.
5.2. Simulation Results and Evaluation
Figure 14 shows the simulation results for
the case of mapping Mix_Add() (i.e. DFG in
Figure 9) on the MUSRA. After the latency of
seven cycles (from 100ns to 220ns), RCA can
calculate and output a column of four bytes
(including pe_out(5)(0) to pe_out(5)(0)) of the
status matrix every clock cycle.
At the system level, the simulations are
done for both encryption and decryption
process on an input file of 300KB with key
lengths of 128- and 256-bit. The simulation
result shows that it take about 2.2 and 2.89
million cycles to perform the algorithm AES
with 128- and 256-bit key lengths on the
MUSRA, respectively.
Table 2 summarizes the simulation results
of the AES encryption and decryption
algorithm with MUSRA, TI C64+ DSP, and
ADRES, Xilinx Virtex-II (XC2V500).
The TI C64+ DSP is one 64-bit digital
signal processor targeted at the cryptography
applications on embedded systems. The C-
software of the AES algorithm that is optimized
for 64-bit architecture just requires
approximately 32 million instructions in total to
complete the assigned task. The simulation
shows that TI C64+ DSP can execute average
2.09 instructions per cycle, and therefore it takes
about 15.2 million cycles to process its tasks.
The ADRES [6] is a 32-bit reconfigurable
architecture that tightly couples a VLIW
processing core with an array of 4×4
reconfigurable cells (RCs). The reconfigurable
RCs act as instruction issue slots of the VLIW
core. The ADRES takes 3.6 million
instructions in total to complete its task with
6.31 instructions per cycle in average.
The Virtex-II (XC2V500) is a FPGA
device from Xilinx. The authors in [11]
proposed the SoC that includes a MicroBlaze
processor and the programmable logic of the
Xilinx Virtex-II for performing the AES
algorithm. Their implementation shows that it
requires about 250 cycles to encrypt or decrypt
one state block.
To evaluate the performance of the
MUSRA, the C-software of the AES algorithm,
which was optimized for the MUSRA
architecture, is first executed on only the
LEON3 processor. As shown in Table 2, it has
to execute approximately 65.4 million
instructions in total. The reason is that the
proposed loop transformation increases the
length of the C-software. However, when this
C-software is executed on both of the LEON3
and the MUSRA, the total cycles is just 2.2
million, which is about 6.9 times, 2.2 times,
and 1.6 times better than that of the TI C64+
DSP, Xilinx Virtex-II, and the ADRES. Our
proposal achieves 29.71 instructions per cycle
in average. The implementation by using
Xilinx Virtex-II is slower than ours due to the
inherent fine-grained architecture of FPGAs.
There are two reasons that make our proposal
better than the ADRES. Firstly, the MUSRA
uses an 8×8 RCA compared with 4×4 one of
the ADRES. Secondly, the AES algorithm is
partitioned into hardware tasks and software
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 21
Figure 14. Simulation result with RTL model of MUSRA.
Table 2. Performance of the AES algorithm on different platforms (using 128-bit key length)
Platform Processing Elements
Total
Instructions
Total
Cycles
Instructions
per Cycle
Cycles
per Block
TI C64+ DSP[3] 1 CPU + Coprocessor 32M 15.2M 2.09
N/A
ADRES[6] 1 CPU + 4×4 RCs 23.2M 3.6M 6.31 N/A
Xilinx Virtex-II [11] 1 CPU + FPGA N/A N/A N/A 250
Our
proposal
LEON3 1 CPU 65.4M 65.6M 1
3416
LEON3+MUSRA 1 CPU + 8×8 RCs 65.4M 2.2M 29.71 114
G
tasks that are executed simultaneously on both
LEON3 and MUSRA. It is difficult to exploit
task-level parallelism on the ADRES due to
tightly coupling between the VLIW processor
with the RCA.
6. Conclusions
In this paper, a detailed explanation for
mapping the AES algorithm onto the MUSRA
platform has been presented. Multi-level
parallelism was exploited in order to improve
the performance of the AES algorithm on the
MUSRA. We first analyzed the source code of
the AES algorithm and proposed the
optimization solution to expose the instruction-
level and the loop-level parallelism.
Hardware/software partition and scheduling
were also proposed to exploit the task-level
parallelism. Our implementation has been
simulated and verified by the cycle-accurate
simulator of the MUSRA. Experimental results
show that the performance of the AES
algorithm on MUSRA is better than that of the
ADRES reconfigurable processor, Xilinx
Virtex-II, and the TI C64+ DSP. It is also easy
to reconfigure the MUSRA to support both the
encryption and decryption with all key lengths
specified in the AES standard.
In the future work, some aspects such as
hardware/software partitioning, DFG extracting,
and scheduling, etc., will continue to be
optimized according to the architecture of the
MUSRA to achieve a better performance. The
proposed implementation also will be validated
with the MUSRA prototype on FPGA platform.
Acknowledgement
This work has been supported by Vietnam
National University, Hanoi under Project No.
QG.16.33.
References
[1] NIST, “Announcing the advanced encryption
standard (AES)”, Federal Information
Processing Standards Publication, n. 197,
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22
22
November 26, 2001.
[2] Christophe Bobda, “Introduction to
Reconfigurable Computing – Architectures,
Algorithms, and Applications”, Springer, 2007.
doi: 10.1007/978-1-4020-6100-4.
[3] J. Jurely and H. Hakkarainen, “TI’s new ’C6x
DSP screams at 1.600 MIPS. Microprocessor
Report”, 1997.
[4] V. Dao, A. Nguyen, V. Hoang and T. Tran, “An
ASIC Implementation of Low Area AES
Encryption Core for Wireless Networks”, in Proc.
International Conference on Computing,
Management and Telecommunications
(ComManTel2015), pp. 99-112, December 2015.
[5]
devices/soc/zynq-7000.htm
[6] Garcia, A., Berekovic M., Aa T.V., “Mapping
of the AES cryptographic algorithm on a
Coarse-Grain reconfigurable array processor”,
International Conference on Application-
Specific Systems, Architectures and Processors
(ASAP 2008).
[7] João M. P. Cardoso, Pedro C. Diniz:
“Compilation Techniques for Reconfigurable
Architectures”, Springer, 2009.
[8] A. Shoa and S. Shirani, “Run-Time
Reconfigurable Systems for Digital Signal
Processing Applications: A Survey”, Journal of
VLSI Signal Processing, Vol. 39, pp.213–235,
Springer, 2005.
[9] G. Theodoridis, D. Soudris and S. Vassiliadis, “A
Survey of Coarse-Grain Reconfigurable
Architectures and Cad Tools Basic Definitions,
Critical Design Issues and Existing Coarse-grain
Reconfigurable Systems”, Springer, 2008.
[10] Hung K. Nguyen, Quang-Vinh Tran, and Xuan-
Tu Tran, “Data Locality Exploitation for Coarse-
grained Reconfigurable Architecture in a
Reconfigurable Network-on-Chip”, The 2014
International Conference on Integrated Circuits,
Design, and Verification (ICDV 2014).
[11] Z. Alaoui Ismaili and A. Moussa, “Self-Partial
and Dynamic Reconfiguration Implementation for
AES using FPGA”, IJCSI International Journal of
Computer Science Issues, Vol. 2, pp. 33-40, 2009.
[12] Kathryn S. McKinley, Steve Carr, Chau-Wen
Tseng, “Improving Data Locality with Loop
Transformations”, ACM Transactions on
Programming Languages and Systems (TOPLAS),
Volume 18, Issue 4, July 1996, pp. 424 - 453.
[13] S. Sohoni, and R. Min, et al. “A study of memory
system performance of multimedia applications”.
SIGMETRICS Performance 2001, pp. 206–215.
[14] M. Zhu, L. Liu, S. Yin, et al., "A Cycle-Accurate
Simulator for a Reconfigurable Multi-Media
System," IEICE Transactions on Information and
Systems, Vol. 93, pp. 3202-3210, 2010.
[15] https://gcc.gnu.org/.
[16] Gaisler Research, “GRLIB IP Core User’s Manual”,
Version 1.3.0-b4133, August 2013.
H
File đính kèm:
an_efficient_implementation_of_advanced_encryption_standard.pdf

