0% found this document useful (0 votes)
114 views12 pages

A High Throughput and Secure Authentication-Encryption AES-CCM Algorithm On Asynchronous Multicore Processor

1) The document proposes an Authentication-Encryption AES-CCM algorithm implemented on an asynchronous multicore processor to achieve high throughput and security. 2) It employs matrix multiplication to transform 16 plaintexts into 1, improving authentication speed by 32x collectively. It reschedules AES encryptions across cores to compensate physical leakages. 3) Intermediate values are propagated asynchronously between cores to randomize physical leakages, further enhancing security against side-channel attacks. It also proposes a key adjusting technique to protect keys against pattern-based attacks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views12 pages

A High Throughput and Secure Authentication-Encryption AES-CCM Algorithm On Asynchronous Multicore Processor

1) The document proposes an Authentication-Encryption AES-CCM algorithm implemented on an asynchronous multicore processor to achieve high throughput and security. 2) It employs matrix multiplication to transform 16 plaintexts into 1, improving authentication speed by 32x collectively. It reschedules AES encryptions across cores to compensate physical leakages. 3) Intermediate values are propagated asynchronously between cores to randomize physical leakages, further enhancing security against side-channel attacks. It also proposes a key adjusting technique to protect keys against pattern-based attacks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
1

A High Throughput and Secure


Authentication-Encryption AES-CCM Algorithm on
Asynchronous Multicore Processor
Ali Akbar Pammu, Student Member, IEEE, Weng-Geng Ho, Member, IEEE, Ne Kyaw Zwa Lwin,
Kwen-Siong Chong, Senior Member, IEEE and Bah-Hwee Gwee, Senior Member, IEEE

Abstract—We propose an Authentication based Matrix- receiver (MACR) is corresponding with the transmitted MAC
transformation cum Parallel-encryption implemented on an (MACT). On the other hand, the plaintext is encrypted into
asynchronous Multicore Processor (AMP-MP) to achieve a high ciphertext to protect the confidentiality of the plaintext against
throughput and yet secure Advanced Encryption Standard based adversary.
on Counter with Chaining Mode (AES-CCM). There are four The authentication and confidentiality can be performed
main features in our proposed AMP-MP. First, we employ the
simultaneously using a symmetric encryption algorithm such
matrix multiplication in GF(28) computation to transform the 16
plaintexts into 1 plaintext, hence improving the authentication as Advanced Encryption Standard (AES) [1], which is widely
speed by 32× collectively at the transmitter and receiver. Second, used for encryption. The authentication adopts the chaining
we reschedule the operations of 3 AES encryptions in 3 different mode operation while the confidentiality (encryption) is based
cores such that their physical leakages are compensated and on the counter mode operation as depicted in Fig. 2(a). In Fig.
equalized, thus reducing the correlation of physical leakage with 2(a), the AES is used for encryption on the Counter with
the processed data by >3×. Third, the intermediate values of Chaining Mode (CCM) algorithm, abbreviated as AES-CCM
AES-CCM are propagated asynchronously between different [2]. Thus, the AES-CCM can provide two attributes
cores to randomize the physical leakages with the processed data, simultaneously, the high assurance of authentication and
therefore further enhance the security of AES-CCM against the
confidentiality of the messages during the communication.
SCA by another 3×. Fourth, we propose a key adjusting
technique based on S-Box byte-key transformation to protect the The AES-CCM algorithm is largely employed in standard
key against pattern-based attack. Our proposed AMP-MP is protocol of vehicular communication, i.e. Control Area
realized on an 8-bit asynchronous 9-core processor fabricated Network Bus (CAN-Bus) [3] and FlexRay [4], to secure the
based on 65nm CMOS process. The experimental results show data communications against various attacks, such as Man-in-
that the throughput of the authentication is 13.54Gbps while the Middle-Attack (MMA) [5] and Side Channel Attack (SCA)
throughput for both authentication and encryption collectively is [6]. In addition, the applications of the AES-CCM algorithm
8.32Gbps, which are 17× and 70× faster than the reported are equally propitious for Internet of Thing (IoT) and wearable
counterparty respectively. Based on power dissipation and EM devices with the employment of low power and small area
SCA on our proposed AMP-MP, the secret key is unrevealed at
AES accelerator (gate counts < 2×103) [7] in the AES-CCM.
5×105 traces, which is ~17× more secured than the standard ASIC
AES-CCM implementation. Besides the aforementioned advantages, the AES-CCM
algorithm implementation remains challenging, particularly
Index Terms — Authentication, Encryption, AES-CCM, on achieving the high throughput for communication systems
Multicore, Asynchronous Circuit, key adjusting technique and the high security against physical hardware attacks.
I. INTRODUCTION Further, the throughput and security are somewhat limited in
synchronous platform. Various techniques have been reported
I NTERNET Protocol security (IPsec) is a vigorous layer-
network protocol to enhance the security level of digital
communication systems and it encompasses three processes:
to overcome the said challenges of the AES-CCM which can
largely be categorized into two groups: hardware FPGA and
software approaches. The hardware FPGA approaches are
key management, authentication and confidentiality [1]. The pipeline-reconfigurable AES-CCM in FPGA implementation
key management, which is based on the Random Number [8], memoryless AES-CCM implementation [9], single-core
Generator (RNG) module, establishes secure key distribution reconfigurable AES [10], Unified Data Authentication
using the encryption algorithm and key updates between the Encryption [11] and Ultra-Low Power AES-CCM IP core
transmitter and the receiver afore-exchanging the messages [12]. The maximum throughput can be achieved by reported
(plaintexts). As depicted in Fig.1, the authentication process techniques [8]-[12] is 3.71Gbps where the need for future
verifies the originality of the plaintext while the confidentiality communication technologies will be much higher (i.e. >3×).
is to ensure the security of the messages by performing In addition, due to the placement and routing optimization in
encryption-decryption process at the transmitter and receiver an FPGA, the physical leakage such as the power dissipation
respectively. At the transmitter, the Message Authentication during the encryption process can be vulnerable to SCA [13]
Code (MAC) is generated based on the plaintext and secret key where the secret key is breakable at 400 traces. In the reported
through an authentication algorithm. The MAC is used to software implementations of AES-CCM [14], the iterative
validate the plaintext upon received at the receiver. The round computation of AES requires >103 clock cycles for each
plaintext is authenticated when the MAC of the plaintext at the round [6], which is much higher than FPGA with requires only
The paper was submitted for review in February 2018. This work was one clock cycle for each round computation of AES algorithm.
supported in part by the Agency for Science, Technology and Research, In this paper, we propose an Authentication based Matrix-
Singapore under SERC 2013 Public Sector Research Funding, Grant No:
SERC1321202098. transformation cum Parallel-encryption implemented in
Ali Akbar Pammu*, Weng-Geng Ho, Ne Kyaw Zwa Lwin, Kwen-Siong
Chong and Bah-Hwee Gwee are with Virtus, IC Design of Excellent Nanyang
Technological University, 50 Nanyang Avenue, Singapore 639798.
(E-mail: *ali1@ntu.edu.sg)

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
2

Transmitter Receiver
MACT
Authentication MACT
MACR Is
Yes Valid
Ciphertext MACT Secret Key Authentication MACT = MACR ?
Plaintext
Plaintext Secret Key 010100101010 110100

No

Encryption Ciphertext Invalid


Ciphertext Decryption Plaintext Plaintext

MACT : generated at Transmitter MACR : generated at Receiver

Fig. 1: The Message Authentication Code (MAC) is generated at the Transmitter and verified at the Receiver
Authentication input
Authentication input Our proposed B 0 - B 15 B 16 - B 31 B r - B r+15
B0 B1 Br AMP-MP 8
128 128 128 Matrix multiplication in GF(2 )
128 128 128

Secret Key 128


AES AES AES Secret Key Key adjusting
128
Yr AES AES AES
Y0 Y1
T = MSB(Yr) Y0 Y1 Yr
Authentication (chaining) T = MSB(Yr)
Authentication (chaining)
Ctr 1 Ctr 2 Ctr m Ctr 0
128 128 128 128 Ctr 1 Ctr 2 Ctr m Ctr 0
Single Core
Multi-core 128 128 128 128

Synchronous Parallel encryption


AES AES AES AES
(Clock) Asynchronous AES AES AES AES
(Clock less)
Encryption (Ctr)
S1 S2 Sm S0
Encryption (Ctr) S1 S2 Sm S0
PL TL
Plaintext PL TL
Plaintext
Ciphertext MAC
Ciphertext MAC
(a) (b)

Fig. 2: AES-CCM Architecture based on (a) Conventional (b) Our proposed AMP-MP
Multicore Processor (AMP-MP) to achieve comprehensive of AES algorithm (i.e. S-Box) and byte circular-shifting when
performance, high throughput and secure AES-CCM, yet low similar patterns of HW and HD of the secret key are detected.
power dissipation overhead as depicted in Fig. 2(b). Our Hence, the secret key is protected against the key updates. The
proposed AMP-MP comprises four main features as follows: fourth feature provides highest level of security protection of
First, we propose to use the matrix multiplication in GF(28) AES-CCM if the SCA successfully can reveal all the secret
computation for authentication to transform each message of key during key updates and subsequently the similar pattern of
16 plaintexts into 1 plaintext which is inherently irreversible the secret key is revealed. Furthermore, our proposed AMP-
[2] as to break the dependency between the authentication MP is realized on an 8-bit 9-core asynchronous processor and
input and the messages. In Fig. 2(a), the authentication input fabricated based on 65nm CMOS process.
(e.g. B0) of conventional AES-CCM is operated sequentially The experimental results show that the throughput of the
(in chaining mode) to generate the authentication key (Y). authentication process of AES-CCM based on our proposed
With our proposed matrix multiplication in GF(28) delineated AMP-MP implementation is 13.54Gbps while the throughput
in Fig. 2(b), the 16 authentication inputs (e.g. B0-B15) are for authentication and encryption collectively is 8.32Gbps.
transformed into one authentication key (e.g. Y0). In other The multicore based on the asynchronous-logic exhibits
words, the transformation, such as in hash function [15], aims uniform distribution of power variance (σpower < 20%) for AES
to reduce the length of the authentication input. Consequently, operation which can reduce the correlation in SCA by >2×.
the speed of the authentication process can be increased by The power dissipation of authentication process based on our
≥32× for both the transmitter and receiver collectively when proposed AMP-MP is 307µW, while overall authentication
compared to conventional implementations. Second, we and encryption process dissipates 311µW to perform 256
reschedule the operations of 3 AES encryptions in 3 different plaintexts, ~2× lower than a reported counterpart [11]. Based
cores such that their physical leakages, i.e. power dissipation on SCA evaluation, single (power) and multi-channel (both
and Electromagnetic (EM) are compensated and equalized, power and EM) attacks [18], the secret key is unrevealed even
thus reducing the correlation of the physical leakage with the after 5×105 plaintext measurements. Furthermore, we evaluate
processed data by ≥3×. Third, the intermediate result of the the robustness of our proposed AMP-MP against collision
AES-CCM is propagated asynchronously within the cores of attacks [1] with 216 sets of plaintext. The experimental results
the multicore processor, as to randomize the physical leakage shows that the MAC of our proposed AMP-MP has a zero
information with processed data. Therefore, the security collision. In addition, with our proposed key adjusting
feature of the AES-CCM can be further enhanced against SCA technique on 256 randomly generated secret keys, the 4 similar
by another 3×. Furthermore, the power dissipation and EM in patterns within the 256 secret keys are completely removed.
asynchronous circuits can be significantly reduced by >2× The remainder of this paper is organized as follows. Review
through reducing the spurious switching due to the clock-less of the AES-CCM is explained in Section II. The proposed
circuit protocol. Fourth, the leakage characteristics of the AMP-MP and its implementation in asynchronous-multicore
secret key based on Hamming Weight (HW) and Hamming processors are elaborated in Section III. The experimental
Distance (HD) are observable in the key management (an results and comparisons of the performance with various
IPsec function), due to the limitation-randomization of True- reported of AES-CCM are presented in Section IV. Evaluation
RNG (TRNG) [16] on key updates [17]. Hence, we further of our proposed AMP-MP based on SCA is presented in
propose a key adjusting technique by leveraging the function Section V. Finally, the paper is concluded in Section VI.

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
3

II. REVIEW OF AES-CCM III. PROPOSED AUTHENTICATION BASED MATRIX-


TRANSFORMATION CUM PARALLEL-ENCRYPTION IN
The operation of AES-CCM, which simultaneously
MULTICORE PROCESSOR (AMP-MP) ON AES-CCM
provides two essential processes: data authentication and
encryption to secure digital communication systems, is based The four main features of our proposed AMP-MP on AES-
on IEEE Standard 802.11i [19] of wireless local area network. CCM are elaborated as follows.
There are 3 inputs to the AES-CCM: (1) The Data which will
A. Proposed matrix multiplication in GF(28)
be both authenticated and encrypted, named as plaintext (P),
bit-length: PL; (2) The Associated-Data (A) with bit-length: AL, To achieve high throughput of the authentication-encryption
which is a header of P, will be authenticated but not encrypted; in AES-CCM, Fig. 5 depicts the schematic flow of the matrix
and (3) A unique value generated randomly by TRNG for multiplication on reducing the inputs for authentication from
every block of plaintext and independent of secret key, 256-block to 16-block of inputs. The input: raw-data, in this
denoted as a Nonce (N), bit-length: NL. The N will be assigned operation; password, security code and message (plaintext),
to the A and P in the sequence of N, A, P to form a sequence are sorted according to conventional order; N, A, P. The sorted
of blocks B (i.e. B0, B1, …, Br) for authentication process as data is then proceeded for formatting to form blocks of data
depicted in Fig. 3. The authentication output (T) is generated (i.e. B0 to B255), which is one block consists of 16 bytes of data.
with the bit-length (TL) where the bit range of the Most The N is corresponded to B0 and appended to 0 value (byte) if
Significant Byte (MSB) is from 32-to-64 bits [2], 32 < TL < 64. NL <16 bytes. Based on conventional implementation [1], the
Nonce Associated data Plaintext A can be formatted in two or more blocks if AL >16 bytes. In
Formatting
our proposed matrix multiplication, we employ 16 bytes for
AL, hence it is assigned only to one block (i.e. B1) of
B0
Authentication input
B1 B2 Br
authentication input. Finally, the P is assigned to the
128 128 128 128
remaining blocks (i.e. B2-B255).
Secret Key 128 The block B, in the format of row vector, is the input for
AES AES AES AES
Y0 Y1 Y2 Yr
matrix multiplication. The row vector matrix is equivalent to
Authentication process
1 × 255 matrix in which one block (i.e. B0) is categorized as a
Authentication Output T = MSB TL (Yr)
vector element. The input of matrix multiplication is
Fig. 3: Authentication process reformatted and transposed into square matrix 16×16 matrix
The bit-length for each block B of authentication input is and subsequently multiplied with constant-column matrix
(constant vector) based on GF(28) computation to generate 1
128 bits and the first block (B0) is assigned exclusively for
× 16 vector matrix of B’. The block B’ is the input for
Nonce, thus NL < 128 bits. The number of blocks (r) for an
authentication input is dependent with NL, AL and PL which the authentication process to generate T (based on Yr) has been
reduced from 256 to 16 blocks through matrix multiplication
authentication process is based on Cipher Block Chaining
with constant vector. The multiplication process in GF(24) is
(CBC) [1], chaining mode. In the operation as depicted in Fig.
equivalent to the computation in one of the AES operations,
2(a), the authentication output of the previous block (BN-1) will
Mix Column operation, using a constant square matrix (4×4)
be mixed (e.g. XOR) with the input of the current block (BN),
to be multiplied with 16 bytes of intermediate result of AES
which is 0 < N < r. In hardware implementations, the number
round computation. Our proposed matrix multiplication has
of clock cycle (Clk) required to perform authentication of
similar principle as a length reduction technique on the
AES-CCM depends on r, Clk ~ r. Therefore, the bigger value
plaintext (i.e. using the hash function [15]). The fundamental
of r, the authentication process will require longer time
difference of our proposed matric multiplication with the
duration. The CBC in AES-CCM could leak information of
conventional AES-CCM is different T value. In order to obtain
secret key (by means of SCA [6]) by correlating the ciphertext
correct T and successfully validate the plaintext, the pre-
and measured EM emanation during the authentication.
requisite of our proposed matrix multiplication is to be
In the encryption process as depicted in Fig. 4, the
implemented in both transmitter and receiver.
generation of ciphertext and MAC, in parallel or Counter (Ctr) Input: Raw-Data
mode encryption is performed to both the plaintext and the
authentication key. The Ctr block, which consists of 128 bits Sorting

for each block, is including the value of N. The number of Ctr Nonce Associated data Plaintext
blocks is m which depends on r, where 0 ≤ r ≤ m, (Ctrm). The Formatting
input for AES is Ctrm value and the output AES encryption
(Sm) is XOR-ed with plaintext, except for Ctr0 which is XOR- B0 B1 B2 B255

ed with T. The ciphertext generated based on the plaintext is Matrix multiplication


appended with the ciphertext generated based on the in GF(28) Reformatting and Transposing

authentication message, MAC. It is worthwhile to note that the B0 B16 B240


AES-CCM employs only the forward AES encryption [1] to B1 B17 B241
B2 B18 B242
generate MACT and MACR, hence reducing the area overhead.
B15 B31 B255
Ctr1 Ctr2 Ctrm Ctr0
128 128 128 128
128
Secret Key
Multiplying with constant
AES AES AES AES vector in GF(28)

B’0 B’1 B’16


S1 S2 Sm S0
PL TL
Plaintext T Authentication

Ciphertext MAC
T
Fig. 4: Counter mode-based encryption to generate ciphertext and MAC Fig. 5: Schematic flow of our proposed matrix multiplication in GF(28)

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
4

16×16 Matrix of 256-block B Constant vector 16-block B’


0
B0 B16 B240 1 B’0 Transposed of 16-block B’
B1 B17 B241 2 B’1
B2 B18 B242 . B’2
. Transposing B’0 B’1 B’2 B’16
.
B15 B31 B255 B’16
E Authentication
F

T
Fig. 6: Matrix multiplication in GF(28)
The detail operation of matrix multiplication is explained as up as byte component in B’0). The computation of
follows. Considering the polynomial in GF(28), which has the multiplication is expressed in (6) as follows:
form f(x) = a7x7 + a6x6 + a5x5 + a4x4 + a3x3 + a2x2 + a1x + a0 b0,1 = ((b01 ⋅ 0) + (b02 ⋅1) + ⋅⋅⋅ + (b016 ⋅ F )) ⋅ mod( m ( x )) (6)
represents one byte data in block B. The f(x) is subsequently
multiplied with x which is the representation for one-byte Where b0,1 = first byte of block B’0 and b01 = first byte of block
constant vector, as expressed in (1). B0. For instance: the input of block B’0, 16-byte, is generated
from matrix multiplication of B0, B16 ··· B240 (256-byte) with
x × f(x) = (a7x8 + a6x7 + a5x6 + a4x5 + a3x4 + a2x3 + a1x2
constant vector and B’1, 16-byte, is generated base on matrix
+ a0x) mod m(x) (1)
multiplication of B1, B17 ··· B241 (256-byte) and the constant
Where m(x) is the finite field in family of GF(28) used in one vector. The reduction bytes 256-to-16 is inherently first
of AES functions [1], Mix Column, as shown in (2) as follows: preimage resistant [1] which the input is unpredictable based
on the output and values of the constant vector. The flow of
m(x) = x8 + x4 + x3 + x + 1 (2)
matrix multiplication over GF(28) for 256 of B (e.g. plaintext)
The multiplication of a7x8 by m(x) is shown in (3) as follows: to generate 16 of B’ for authentication is depicted in Fig. 6.
The matrix multiplication is implemented in both transmitter
a7x8 mod m(x) = a7 (( x8 + x4 + x3 + x + 1) - (x8))
and receiver to obtain the valid authentication tag. With our
= a7 (x4 + x3 + x + 1) (3)
proposed matrix multiplication in GF(28), the authentication
If a7 = 0, the result of x × f(x) is a polynomial degree less than process, for one-time communication from the transmitter to
8 and polynomial result of a7 is negligible. Otherwise, receiver, can be significantly enhanced by >32× faster than
polynomial degrees are reduced in modulo m(x) which is conventional AES-CCM implementation. The generator for
expressed in (4) as follows: GF(24) will randomize the byte sequences for every specific
length of plaintext (e.g. 256 plaintexts) to provide another
x × f(x) = (a6x7 + a5x6 + a4x5 + a3x4 + a2x3 + a1x2 security layer in authentication. In this context, the sequence
+ a0x) + a7 ·(x4 + x3 + x + 1) (4) of the number is not fixed and unpredictable by adversary. The
sequence of constant vector element is randomized with 16! ≈
As a summary, the overall bitwise operation for each byte in 2.09×1013 possibilities, thus securing against preimage attacks.
matrix multiplication can be expressed in (5) as follows: Fig. 7 depicts the circuit implementation of our proposed
 ( a6 a5 a4 a3 a2 a1a0 0) ; a7 = 0 matrix multiplication. The sequences of pre-stored constant
x × f ( x) =  (5) vector are randomized based on TRNG which is initiated with
 ( a6 a5 a4 a3 a2 a1a0 0) ⊕ (00011011); a7 = 1 random binary Seed input. The randomized sequence will be
The constant vector, which is used in multiplication of each propagated to binary multiplication module and Nonce_R
byte in block B, consists of 16 bytes vector elements. We use which the additional security feature is appended to Nonce for
a generator of a finite field GF(24) to generate the constant authentication. It is worthwhile to note that the randomization
vector elements where a generator for GF(24) is based on the of the sequence is performed once during the authentication
family of polynomial equation: g4 + g3 + g2 + g + 1. The list of and encryption. In other words, randomization of the sequence
possible values generated by GF(24) computations are will not affect throughput of authentication process. The input
tabulated in Table I as follows: of 128-bit B is multiplied with 64-bit randomized constant
vector by referring to (6) where the output is restricted by
TABLE I: A GENERATOR OF GF(24) FOR CONSTANT VECTOR GF(28). If the value of an a7 is high, the multiplication result
Polynomial Binary Hexadecimal
is XOR-ed with binary GF(28) values, 00011011. Otherwise,
0 0000 0
1 0001 1
the result of multiplication is made transparent to the output.
g 0010 2 The multiplication process is continuously performed until the
g+1 0011 3 16-byte register of B0 is fully occupied by the multiplication
g2 0100 4 result and subsequently the Finish signal is triggered to 1.
g2 + 1 0101 5 Constant vector
g2 + g 0110 6 (0 1 2 ···F)
64
g +g+1
2
0111 7
g3 1000 8 Seed Randomizing the
g3 + 1 1001 9 Clk Sequence
g3 + g 1010 A 64 00011011
Nonce_R 8 16-byte register B' 0
g3 + g + 1 1011 B 64
1 1
g3 +g2 1100 C 128 8 8 8
B b'01 b'15 b'16
g3 +g2 + 1 1101 D Binary Multiplication 0 0
0 0
8
3 2
g +g + g 1110 E En (8-bit) a7
g3 +g2 + g + 1 1111 F Clk
Buffer

The corresponding bytes in block B (i.e. B0) is multiplied by Finish


a constant vector (i.e. default sequence is 0 - F) and summed Fig. 7: Circuit implementation of matrix multiplication

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
5

B. Proposed Parallel AES with Rescheduling the operations operation is F while the scalar is (1- F) and their throughputs
Fig. 8 depicts the circuit implementation of the AES-CCM are denoted as bv and bs for vector and scalar operations
authentication-encryption in 9-core synchronous-logic 8051 respectively. The bv is dependent with bs in terms of the
microcontroller [20] with Asynchronous Network on Chip number of cores n used for parallel-pipeline (bv = n.bs) and
(ANoC) protocol. To optimize the performances, 3 cores are total throughput (btot) is determined based on parallel current
used for parallel encryption, 2 cores are used for authentication flow formula as expressed in (7).
and the other 4 cores are assigned for receiver mode AES- 1 F 1 − F nF − F + 1 1 + (n − 1) F
= + = = ; btot =
nbs (7)
CCM. Input/output (I/O) interface of 12 ports is grouped into btot bs nbs nbs nbs 1 + (n − 1) F
four blocks (North-East-South-West) as depicted in Fig. 8(a).
The N, A and secret key are propagated through North (I/O 0- The analysis of the EM compensation of parallel-pipeline is
2) while the plaintext and ciphertext are interfaced to West (I/O explained as follows. The EM flux (Φx) of EM field (‫ܤ‬ሬറx)
9-11) and East (I/O 3-5) respectively. The 32-bit of control generated in core x is linearly proportional with the dissipated
signal for multicore processors consists of three signals: current (I) and the core area (A) as expressed in (8), where µ,
Address_data is to determine the address location of the data l and r are constants permeability of the conductor, length and
in the respective core, Rec_core is reconfigurable signal to radius of the measured point respectively.
control the flow of data from ANoC to the core during the AES  μ ⋅ Ix ⋅l ⋅ A
encryption and En_core is the enabled signal to control on and Φ x = Bx ⋅ A = (8)
2π r 2
off the respective core. The data valid signal (Dvld) is
activated when the ciphertext has been completed and finish With vector operations performed during the encryption, the
signal (Finish) is high when the encryption of the total Φx at individual core is randomized for different instructions
plaintexts (i.e. 254 plaintexts) are complete. (round iterations) of AES. Since EM flux is a vector form, the
The main objective of having 3 cores for parallel-pipeline magnitude will be cancelled to each other when two EM flux
encryption is to further achieve >3× higher throughput generated by two different cores with opposite directions (i.e.
encryption and EM compensation of AES concurrently. The Φx + Φy ≈ 0). Furthermore, the AES encryption is reconfigured
performance of parallel-pipeline encryption is analyzed based (reschedule the operations) within three cores base on MIMD
on Multiple Instructions and Multiple Data (MIMD) mode such that the critical rounds (e.g. first and last round of AES-
[21]. In this analysis, two operations are defined: the vector 128) are performed at different cores. This is to de-correlate
and the scalar operations. The vector operation involves data the leakage information of EM emanation with processed data
transferred protocol between cores while scalar operation is in the same core encryption (e.g. Core_x) as expressed in (9).
operating exclusively in a core. The fraction of vector  Φ x,total =Φ x, first + Φ x,middle + Φ x,last ⎯⎯⎯⎯→
uncorrelated
processed _ data (9)

Nonce North
128
Associated_data I/O 0 1 2
Secret Key
32 Parallel Encryption
Address_data,
Rec_core,
En_core
Core_1 Core_2 Core_3
I/O I/O
Clk ANoC Clk ANoC Clk ANoC
128 11 3
128
Plaintext_0
Authentication Ciphertext_0
128 128
Plaintext_1 Ciphertext_1
Core_4 Core_5 Core_6
West

East

128 Clk Clk Clk 128


ANoC ANoC ANoC
Plaintext_15 10 4 Ciphertext_15

Core_7 Core_8 Core_9 Dvld

Finish Clk Clk Clk


ANoC ANoC ANoC
9 5

Assigned for
receiver mode I/O 8 7 6
South
(a)
EM of Core_1 Φ1 Ciphertext_1 Plaintext_0 EM of Core_2 Φ2 Ciphertext_2 Plaintext_1 EM of Core_3 Φ3 Ciphertext_0 Plaintext_2

Core_1 Core_2 Core_3


Secret Key Key Expansion Ctr_1 Ctr_2
Ctr_0
Add Round Key Add Round Key
Add Round Key S-Box S-Box
Expanded key

Expanded key

S-Box Shift Row Shift Row


Shift Row Mix Column Mix Column
Mix Column
Int. Int.
Int. Plaintext_2 Plaintext_0
Plaintext_1 L_Rnd L_Rnd
L_Rnd

Plaintext_0 Plaintext_1 Plaintext_2


S0 S1 S2

Clk Clk Clk


ANoC ANoC ANoC

Data flow of vector operation Direct data flow interface

(b)
Fig. 8: Design of ANoC based 9-Core processor (a) Overall architecture of multicore and (b) Parallel and rescheduling three AES within three cores

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
6

The four main functions of AES algorithm (Add Round Key, without performing the attack (i.e. SCA). To further protect of
S-Box, Shift Row and Mix Column), which are constituted in AES-CCM implementation against the patterns leakage, we
static and vector operations, are assigned to individual cores. propose a key adjusting technique for AES, performed by
To reduce the delay operation and static-operation complexity, circular shifting 128-bit (8-bit basis) secret key and transform
the secret key is expended (1-to-10 round of secret key) only the value in non-linear function S-Box to make the subsequent
at Core_1 and the result is propagated to Core_2 and Core_3. key unpredictable. Fig. 10 depicts circuit implementation of
Each core encrypts the Ctr with secret key up to 9th round and the proposed key adjusting technique.
transmit the intermediate result (i.e. int. plaintext_0) to the Our proposed key adjusting technique is activated when
next core for the last round operation. For instance: Core_1 the HW and HD patterns are detected (HW_dctd and
receives plaintext_0 (TL) and generates ciphertext_1 as the HD_dctd). The HW pattern is based on the number of bit-one
output while the Core_2 receives plaintext_1 and generate (‘1’) in each byte of secret key while the HD pattern is detected
ciphertext_2 and eventually Core_3 receives and generates the by observing the bit changes of two consecutive secret key,
plaintext_2 and ciphertext_0 respectively. using XOR logic for each bit. Each 8-bit input secret key
(Data_in) can be circulated up to 8× until the pattern is fully
C. Proposed a novel asynchronous data flow
randomized (in Data_out).
The main advantage of asynchronous circuits in terms of Data_in 8 8
Data_out
countermeasure against SCA is the ability to randomize the HD_dctd
R_in S-Box
R_out

time occurrences of the sensitive leakage information. HW_dctd

8-bit circular shift register


Comparing with the synchronous counterparts, the physical D0 Q0 D1 Q1 D7 Q7

leakages of the asynchronous circuits are able to reduce the


correlation of physical leakage with the processed data. Clk Buffer

Furthermore, in terms of implementation with its clock-less


Fig. 10: Circuit schematic of key adjusting technique for each 8-bit Data_in
architecture, the asynchronous circuit dissipates lower energy
[18] and lower EM emanation compared with synchronous The key adjusting technique precedes the key expansion. To
circuits. In this paper, we propose to implement a novel illustrate the security point of view, suppose the adversary
asynchronous architecture of Sense Amplifier Half Buffer successfully eavesdrops and reveals the pattern of secret key
(SAHB) [21] in ANoC protocol to achieve fastest data flow through the SCA in key management system. By predicting
between cores and yet high resistance of AES-CCM against the subsequent key (in HW and HD), the ciphertext will be
SCA (i.e. lower correlation coefficient). unsuccessfully recovered by adversary due to the key has been
Fig. 9 depicts the SAHB circuit architecture, constituted in corrected by key adjusting technique. The synchronization of
ANoC, features dual-rail logic interface, where each signal the adjusted key between the transmitter and the receiver can
interface accompanied with complementary logical signal be performed in two methods. First, the key is partitioned into
which is indicated as n coefficient). The dual-rail feature can two partitions and encrypted using the previous key while the
achieve high robustness, high speed and low energy second method is using asymmetric cryptographic algorithms
dissipation [20]. The data flow is based on handshake (i.e. RSA, Diffie-Hellman or ECC) [1]. The validation of
protocol, where the acknowledge signal at the right channel authentication key is performed with < 0.3% latency overhead.
(Rack and nRack) is activated when output (Dout and nDout) is
IV. EXPERIMENTAL RESULTS IN ASYNCHRONOUS
completed. The left acknowledgement signal (Lack and nLack)
MULTICORE
when the SAHB block is ready to receive the input (Din and
nDin). The SAHB embodies evaluation block and Sense Our proposed AMP-MP is implemented in multicore ANoC
Amplifier (SA) to accommodate latency during the parallel embody the SAHB architecture. In this context, we adopt full-
AES encryption and the SA is based on cross couple latch and custom approach based on the 65nm CMOS process where the
positive feedback to speed up and latch the output. Finally, SA input and output interfaces are exactly the same as portrayed
is tightly coupled to reduce the switching nodes, hence in Fig. 8. Microphotograph of multicore ANoC, leveraging the
reducing the dynamic power dissipation during the SAHB architecture, is depicted in Fig. 11 which occupies
authentication and encryption. 0.105mm2 area. Based on the 9-core processor with our
SAHB cell template
proposed AMP-MP, we assign three cores (Core_1 to Core_3)
VDD

nD in Din
nD in Din
nD in
Din Rack for parallel encryption and two cores (Core_4 and Core_5) for
{Lack, nLack}
nD out
{Rack, nRack}
Din nD in
nD in Din authentication, while the other four cores (Core_6 to Core_9)
are reserved for decryption process (receiver mode).
nL ack
Dout Rack

{Din, nDin} SAHB Circuit {Dout, nDout}


nD out
Dout

Lack Core_1 Core_2 Core_3

Fig. 9: Main channel interface (L and R) of SAHB circuit and cell template
ANoC ANoC ANoC

D. Proposed key adjusting technique


8051

The key update is one of the methods to countermeasure the Core_4 Core_5 Core_6 Microprocessor

SCA [17] by changing the secret key periodically based on ANoC RAM
Interface

ANoC ANoC

timing and size of the messages; i.e. every 65 minutes or 500 XRAM ROM

megabytes plaintexts. The disadvantage of this method is Core_7 Core_8 Core_9


SAHB

compromising the throughput caused by additional time for ANoC

both key exchange mechanism and public key management


ANoC ANoC ANoC
Core and ANoC

system to update the secret key of AES for both transmitter


(a) (b)
and receiver. Based on the observed key patterns, the Fig. 11: Microphotograph of fabricated multicore (a) 9-Core with respective
adversary (i.e. MMA) can predict the subsequent secret key ANoC (b) A unit core is interfaced with ANoC embodies SAHB

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
7

To obtain the optimum performance for parallel resistant is based on the weak collision attack where it is
computations, we assign the maximum number of cores for infeasible to obtain another 16 input plaintexts of B11 which
parallel encryption/decryption. In addition, we set the supply has the same output transformation B’ with the original 16
voltages (VDD_ANoC and VDD_Core) and frequency operation plaintexts of B1. Each set of input plaintext comprises 16
(fCore) to 0.8V, 1.2V and 100MHz respectively for the optimum blocks of input plaintext as authentication input B0-B15. The
performance of asynchronous multicore. To evaluate the MAC of two different authentication inputs (i.e. Bx0-Bx15 and
efficacy of our proposed AMP-MP leveraged on multicore By0-By15) are compared to investigate the collision. Table II
ANoC, we perform three experiments, based on the main tabulates the performances of the first (1st) and second (2nd)
features of our proposed AMP-MP implemented in AES-CCM preimage attacks based on the MAC of our proposed message
and compare the results with conventional implementation and transformation. The performance is compared with two
the reported techniques. reported hash functions [1] such as SHA-256 and SHA-512
which are commonly used for message authentications (i.e.
A. Matrix multiplication over GF(28) and authentication
MAC).
We generate 4,096 bytes randomly in hex number as an
TABLE II: THE 1ST AND 2ND PREIMAGE ATTACKS BASED ON THE MAC
input for Raw-Data and subsequently sorted to respective; 254 OF 216 SETS OF INPUT PLAINTEXT (MESSAGE)
plaintexts, 6 bytes N and 12 bytes A. The Nonce_R (64 bits/8
bytes) is generated by sequence randomization module which Algorithm 1st 2nd (coll.)
is then appended to N before formatting to series blocks (256 SHA-256 Yes No (3)
blocks B). The operation of matrix multiplication in GF(28) is SHA-512 Yes No (1)
performed after reformatting and transposing of 256 blocks B Our proposed message transformation Yes Yes (0)
as has been depicted in Fig. 5. We measure power dissipation As tabulated in Table II, no collision occurs in MAC of our
during the matrix multiplication and authentication of the proposed input plaintext transformation in 65,536 sets of input
authentication input block B (i.e. plaintexts) in oscilloscope plaintext as indicated the number of collision (coll.) is 0. On
with a fixed sampling rate of 2.5GSamples/second. The power the other hand, the number of collisions for SHA-256 and
dissipation measurement profile is depicted in Fig. 12 where SHA-512 algorithms are 3 and 1 respectively. This partly due
the first 16 peaks represent the dissipated power during the to “birthday paradox” [5] from 216 sets of input plaintext. The
matrix multiplication operation in GF(28) computation and collisions can be analyzed based on 22,048 bits (216 sets of input)
higher peaks are generated during the authentication process. which is considered as a large search space to identify birthday
The 256 blocks of B are transformed into 16 blocks of B’ (16 paradox. Based on our analysis on second preimage attacks, it
peaks) in 2.079μs with average dissipated power is 357μW is worthwhile to note that our proposed matrix multiplication
and followed by the authentication process which is performed in GF(28) also secured against strong collision attacks as no
in Core_4 and Core_5 through ANoC in 0.34μs where its pairs of input plaintext are found to be collided in MAC.
average of power dissipation is 631μW. The overall
throughput of matrix multiplication for 256 blocks B and B. Parallel AES with rescheduling the operations in ANoC
followed by authentication process is 13.54Gbps in 2.42μs Fig. 13 depicts the power dissipation profile measured from
with average power dissipation is ~500μW. The additional three encryption cores to generate ciphertext and MAC
76% power during authentication is dissipated due to through ANoC. The parallel encryption requires 3.26μs and
overflowing data in ANoC to enable the continuity of parallel dissipates 307μW average power to encrypt 256 Ctr(s) and
AES in two cores. The continuity of encryption is realized by XOR operation with plaintexts. The sudden impulsive spike of
propagating the plaintext to the AES for every round power dissipation is generated when loading the 255 plaintexts
computation without waiting the last round AES computation. into XRAM before the encryption process. With rescheduling
631 mode in three cores ANoC, the throughput of parallel 3 AES
600
encryption is 10.05Gbps. On the other hand, the conventional
Power Dissipation (μW)

500 AES encryption in single core requires 20.94μs to encrypt 256


400
plaintext which the throughput is 1.56Gbps. As a result, the
357 parallel encryption with reschedule the AES has increased the
300 encryption process by 6.42×. It is worthwhile to note that the
200 encryption process can be performed in parallel with
2.079 μs 0.34 μs
authentication, 5 cores, hence the overall throughput is mainly
100
determined by encryption, which dissipates 315μW nominal
0.0 0.5 1.0 1.5 2.0 2.5
power during 3.93μs, thus the throughput is 8.32Gbps.
Time (μs)
Fig. 12: Power dissipation of matrix multiplication and authentication process Loading the plaintext into XRAM
500
Power Dissipation (μW)

Parallel encryption in three cores


By comparing the throughput performance, based on the
400
same parameters (i.e. clock, supply voltage and single core
processor) and platform, our propose AMP-MP on AES-CCM 307
is 35.91× faster than conventional implementation. In this
comparison, the conventional implementation [14] dissipates 200

721μW power and requires 86.7μs to both authenticate and 3.261μs


100
encrypt 256 blocks B. 3.934μs
We further investigate the preimage resistant of our
0.0 1.0 2.0 3.0 4.0 5.0
proposed on input plaintext (message) transformation by Time (μs)
performing the experiment on second preimage attack to MAC Fig. 13: Power dissipation of parallel encryption in three cores which
of 216 (65,536) sets of input plaintext. The second preimage dissipates power 307μW in 3.26μs

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
8

We further analyze the optimum throughput performance probability is < 20%.


for both parallel encryption/decryption (to demonstrate the
transmission and receiver modes for AES-CCM). The 0.8

configuration is based on 6 cores in multicore while the other Leaking the secret key through SCA

3 cores are assigned for the authentication of encryption- 0.6

Variance
decryption. Table III tabulates the throughput performance of
Equal distribution of variance
6 cores configurations for parallel encryption and decryption. 0.4
power dissipation
The optimum throughput performance is achieved at 3 cores
for both parallel encryption and decryption where the 0.2

throughputs are 8.32Gbps and 8.27Gbps respectively. 3.261μs

TABLE III: THE THROUGHPUT PERFORMANCE OF PARALLEL ENCRYPTION 0.1

AND PARALLEL DECRYPTION IN 6 CORES CONFIGURATION 0 10 20 30 40 50


Number of cores for parallel Enc./Dec. Throughput (Gbps) Sampling points
Encryption Decryption Encryption Decryption Fig. 15: Comparison of the variance power dissipation during the encryption
1 5 2.74 14.86
2 4 5.17 11.02 With lower leakage probability (i.e. < 20%), the adversary
3 3 8.32 8.27 will require the higher number of traces to reveal the secret
4 2 11.24 4.91
key in the SCA, even with Higher Order (HO) attack [25]. This
5 1 15.33 2.53
is mainly due to the leakage information of the secret key are
For SCA evaluation, we measure the EM emanation of our sparse and randomized in the sampling points with lower
proposed AMP-MP during the authentication and encryption leakage probability. With HO attack, the number of traces
in AES-CCM and compare with a single core synchronous required to reveal the secret key will be marginally reduced
implementation @100MHz as depicted in Fig. 14. The EM with additional computational resources to perform the SCA.
emanation of 132 dB (in μA/m dB) is generated from single
core implementation potentially leak the information of the C. Key adjusting technique
secret key AES-CCM through SCA. With our proposed AMP- The fundamental idea of the key updating system is to
MP in multicore, the EM emanation of ANoC with dual-rail restrict the use of similar secret key for one time
SAHB is reduced to 93 dB (reducing 29.5%) which shows communication. Thereafter, the secret key has to be updated
lower data dependency (noisy) than synchronous counterpart to further thwart the adversary against deciphering the correct
during the authentication-encryption in AES-CCM. Besides ciphertext. To demonstrate the pattern based attack against key
rescheduling 3 AES encryption in 3 cores, the expanded keys updating system, we measured the EM leakage emanations of
are distributed to 3 cores with random route through ANoC the stateful key updating [17] for synchronous applications.
protocol. With the random route key distributions, the direct The secret key is udapted based on the master key (the intial
attack on EM emanation is further secured with only 0.21% of secret key) to determine the subsequent secret keys. Fig. 16
the latency overhead. In addition, the total EM generated from depicts the EM emanation of the key updating system in 104
the reconfigured of three cores in the AES operation in (9) can sampling points. It shows that six similar patterns are detected
improve the resistance level against SCA by breaking the in Fig. 16(a) which lead to the future computational attacks of
correlation of EM emanation with processed data. the subsequent secret key. However, with our proposed key
132 adjusting technique, in Fig. 16(b) four similar patterns on key
120 Single core synchronous
updating system are omitted hence secured against patterns
EM magnitude (μA/m dB)

implementation
100 based attack. The additional ~22.4% of the EM noise (from
93 Our proposed AMP-MP in
multicore ANoC
0.56 to 0.68 μA/m dB) are generated by circuit of shift register.
80

60 1.0
EM magnitude (μA/m dB)

0.8
40
0.6

20 0.4

0.2
0 200 400 600 800 1,000 1,200 1,400 1,600 0.56 μA/m dB
Frequency (MHz) 0.0

Fig. 14: EM emanation of our proposed AMP-MP in multicore and single core -0.2
synchronous implementation 0 1 3 4 5 6 7 9 10
2 8
Sampling points ×103
To further evaluate the robustness of ANoC on leakage (a)
distribution against SCA, we determine the variance of 256
aligned power dissipation based on conventional 1.0
EM magnitude (μA/m dB)

implementation single core synchronous and with our 0.8

proposed AMP-MP in multicore ANoC. Fig. 15 depicts the 0.6

variance of two consecutive operations S-Box and add round 0.4

key which are comprising the value of the secret key. It shows 0.2 0.68 μA/m dB
that the power variance of AES in ANoC implementation is 0.0

equally distributed (uniform) due to the result of dual-rail -0.2

asynchronous SAHB architecture. Comparing with single core 0 1 2 3 4 5 6 7 8 9 10


×103
implementation (dashed line) which is leaking information at Sampling points

(b)
highest variance (probability 80% of key leaked at particular Fig. 16: The EM emanation patterns are detected based on (a) reported key
sampling point: 10), our proposed AMP-MP in multicore updating system and with (b) our proposed key adjusting technique the
ANoC implementation is secured against SCA with leakage patterns are omitted

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
9

To further evaluate the robustness of our key adjusting V. EXPERIMENTAL RESULTS ON SCA
technique, we adopt the TRNG to generate the random key
We measure the Success Rate (SR) of the leakage
during the key updates. The randomness of bit sequence, by
information SCA based on Test Vector Leakage Assessment
leveraging de-correlator [16] based on circular shift register to
(TVLA) [22] to detect the presence of the leakage information
ensure the key value, is uniformly distributed and independent
in the measurements (power or EM) and determine the leakage
while the XOR gate is to restrict the value under GF(28)
probability. The values of the TVLA computed based on
computation. We randomly generate 256 the secret key based
Welch’s t-test is generally determined before quantifying the
TRNG to simulate 256× key updates [17] and observe the HD
SCA traces [6]. Fig. 18 depicts the SR of the power and EM
and HW patterns. It turns out that, although the values are
analyses based on single core and multicore implementations
independent and unpredictable, the HD and HW patterns are
where the multi leakages measurements are power and EM
repeatable, predictable and recurrence as depicted in Fig. 17.
obtained based on asynchronous multicore implementation.
Two HW patterns are detected at 1-to-3 and 6-to-8 while the
two HD patterns are detected at 0-to-2 and 5-to-6, collectively 100

4 patterns detected, as depicted in Fig. 17(a). Upon detecting 90

the patterns, we perform our proposed key adjusting 80

Success Rate (%)


technique, which has been shown in Fig. 10, during the key 70

updates and hence, the similar patterns are completely 60

removed as depicted in Fig. 17(b). 50

40
HD and HW Distribution of 256 secret key
HW pattern is detected 30
45
40 20
Frequency occurrences

35 10
30
25 0 2 4 6 8 10 12 14 16 18 20
20 Traces (Power and EM) × 104

15 HD pattern is detected Fig. 18: The SR of power and EM analyses based on single core and multicore
10 implementations
5
0
The SR of the leakage measurements shows that the
0 1 2 3 4 5 6 7 8 conventional AES in single core implementation has a
HD Distribution
(a)
HW Distribution
probability of 70% to leak the secret key at 2×104 power traces
45 (i.e. 70% of the secret keys can be revealed at 2×104 traces).
40 On the contrary when AES is implemented in asynchronous
Frequency occurrences

35
multicore, the SR is reduced to ~28% even after 2×105 traces.
30
We further determine the SR of the multi leakages
25
measurements where two leakages measurements (power and
20

15
EM) are incorporated for SCA. The SR of multi leakages
After applying key adjusting technique
10 the HD and HW patterns are undetected measurements is saturated at 50% after 2×105 traces. The
5
result implies that the leakage information of the secret key
0 can only be revealed mostly 50% at 2×105 traces based on our
0 1 2 3 4 5 6 7 8
proposed asynchronous multicore implementation.
HD Distribution HW Distribution
(b)
In addition, the SR of the key adjusting technique is also
determined based on TVLA which refers to four scenarios in
Fig. 17: HD and HW distribution of 256 secret key (a) HD and HW patterns
are detected and (b) Performing key adjusting technique to remove patterns Table IV. The SR of the key adjusting technique when
implemented in 2 cores for both power dissipation and EM
Table IV tabulates the measurement results on variance emanation measurements are 14.87% and 9.32% respectively.
distribution, time, power dissipation, EM emanation and The 2 cores refer to the highest distribution of HD and HW.
number of cores used during key adjusting technique The lower SR is obtained due to the random delay from ANoC
implementation. It shows that, when both patterns are and low Signal-to-Noise Ratio (SNR) of the leakage
detected, the distribution of HD and HW are highest which measurement generated from the operation of the shift register
lead to reveal and predict the secret key easily. The circuits. Therefore, information of the secret key is secured
performance of latency is 0.12μs which dissipates 41.35μW either in sampling points or in amplitude of the measurements.
and 7.6dB for power and EM respectively. In this experiment, To further quantify the resistance of our proposed AMP-MP
the resulted of overheads are 4.95%, 6.33% and 8.17% for against SCA, we analyze both Correlation Power Analysis
latency, power and EM respectively which is proportional to (CPA) and Correlation EM Analysis (CEMA) attacks. Our
the number of cores used for key adjustment. The brute-force attacking point is focusing on Core_1 to Core_5 where the
method is not applicable to attack the key adjustment process authentication and encryption of AES-CCM are performed.
since the key is corrected in 128 bit randomly. The adversary The SCA evaluation is constituted into two parts, Single
still requires huge 2128 possible keys to predict the correct key. Channel [13] which measures only one physical leakage
TABLE IV: MEASUREMENTS OF TIME, POWER AND EM OF KEY ADJUSTING information (i.e. power dissipation) and Multi-Channel [17]
Patterns Distribution which is employing more than one physical leakage
Time Power EM information (i.e. both power dissipation and EM emanation)
detection (σ) Core(s)
(μs) (μW) (dB)
HD HW HD HW in the computation of correlation to reveal the secret key. In
0 0 0.32 0.37 0.00 0.00 0.00 1 this SCA evaluations, we targeted two leakage functions: first
0 1 0.87 0.29 0.08 10.33 4.7 1 round decryption (i.e. inverse S-Box and Add Round Key) and
1 0 0.14 0.86 0.06 34.21 5.1 1
last round encryption (i.e. S-Box, Shift Row and Add Round
1 1 0.93 0.84 0.12 41.35 7.6 2

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
10

Key). The two leakage functions are based on the HW leakage determined based on the conventional AES implementation
model to determine the intermediate values (i.e. processed which has been depicted in Fig. 19(a) and the multi-channel
data). Finally, secret key is analyzed based on the correlation attacks in Fig. 20. Eventually, based on our proposed AMP-
between leakage measurements and processes data. MP in AES-CCM, the secret key is secured against multi-
channel attacks and the adversary will not be able to reveal the
A. Single Channel SCA secret key.
Figs. 19(a) and (b) depict the CPA plots of correlation vs 0.6
key candidates at 5×104 power traces for single core and Single-channel based attack
multicore ANoC implementations respectively. In the single 0.4 Secret key is unrevelaed
core implementation, the secret key 176 is successfully

Correlation
0.2
revealed at 39,000 traces with 0.29 correlation and SNR 4.4
dB as depicted in Fig. 19(a). On the contrary, by activating the 0.0

ANoC between Core_4 and Core_5, the correlation is reduced


-0.2
by 17× (0.29 → 0.017) with SNR of 0.96 dB as depicted in Multi-channel based attack
Fig. 19(b). The reduction of correlation is mainly due to the
-0.4
dual-rail asynchronous SAHB in ANoC which distributes 0 1 2 3 4 5
equally the variance of leakage information in time domain. Traces (Power and EM) ×105

0.3 Fig. 20: The plot of Single-channel (black) and Multi-channel result (red)
0.29
based on CPA and CEMA which are unrevealed at 5×105 measurements
0.2 Highest correlation at 176
For completeness, we compare the performance of our
Correlation

0.1
proposed AMP-MP with the reported techniques on AES-
0.0
CCM and other Authentication-Encryption with Associated
Data (AEAD) algorithms. Table V tabulates a comparison of
-0.1
our proposed AMP-MP with conventional implementation
-0.2
and various reported techniques on AES-CCM. Based on
0 50 100 150 176 200 250 multicore ANoC, running @100MHz for 5-core processors,
Key candidates our proposed AMP-MP features 8.32Gbps of throughput
(a)
which is 2× to 70× faster than reported techniques during the
0.03
The correlation is lower at 176 authentication-encryption. In addition to the security level of
0.02
the secret key against SCA, both single and multi-channel
0.017
attacks, are unbreakable at 5×105 traces measurements which
Correlation

0.01
have been indicated as lower correlation with processed data
0.00 (<0.29) compared with other key candidates.
-0.01
In addition to the comparisons with reported AES-CCM
implementations, we further compare the performance of our
-0.02 proposed AMP-MP with AEAD algorithms. Three AEAD
0 50 100 150 176 200 250 algorithms [23] have been reported as highly efficient
Key candidates
(b)
algorithms, Deoxys, NORX and CLOC. Table VI tabulates the
Fig. 19: CPA attack on AES-CCM based on (a) Single core and (b) Multi-core performance of our proposed AMP-MP in AES-CCM and
with ANoC protocol compare with three AEAD algorithms. The area is compared
in terms of Kilo-Gate Equivalent (KGE) which indicates the
B. Multi-Channel SCA active area used for algorithm implementation. The efficiency
We measure 5×105 of both power dissipation and EM traces parameter used in this comparison is the throughput/area,
as two physical leakages channel concurrently during the Gbps/KGE.
AES-CCM encryption operations with our proposed AMP- TABLE VI: THE PERFORMANCE OF OUR PROPOSED AMP-MP COMPARED
MP in three cores ANoC and plot the SCA result (CPA and WITH OTHER THREE OPTIMIZED AEAD ALGORITHMS
CEMA) as depicted in Fig. 20. It shows that, based on single Algorithms
Area Max. Freq. Throughput Efficiency
channel using CPA (black) and multi-channel attacks based on (KGE) (MHz) (Gbps) (Gbps/KGE)
Deoxys* 59.53(1.2×) 847 7.22 (1.2×) 0.12 (1.33×)
both CPA and CEMA (red) concurrently [17], the secret key
NORX* 70.13(1.3×) 757 83.11(0.1×) 1.18 (0.13×)
is still unrevealed with 5×105 measurements for both power CLOC* 67.09(1.3×) 746 2.85 (2.9×) 0.04 (4.00×)
dissipation and EM emanation. The correlation of the secret Our
key is remain lower (<0.2) compared with other key proposed 51.31(1.0×) 100 8.32 (1.0×) 0.16 (1.00×)
candidates at 5×105 measurements. For comparison, the AMP-MP#
*
the optimized result is obtained by running it at maximum frequency and
experimental results on SCA show that the resistance against high-power dissipation using TSMC 65nm technology [23]
SCA has been increased by >12.82×. The comparison result is #
implemented using Global Foundries 65nm CMOS technology

TABLE V: COMPARISON OF OUR PROPOSED AMP-MP AES-CCM WITH THE REPORTED AES-CCM TECHNIQUES
Architecture Speed Throughput SCA resistant
Technique Platform
(AES core(s)) (MHz) (Gbps) Single-Channel Multi-Channel
AES control unit FPGA [8] Three control modules 2 100.00 1.05 (8×) Spartan 3 3S4000 Yes N/A
Parallel two AES [9] Reordering iterations 2 264.00 2.69 (3×) CMOS SAED 90nm N/A N/A
Open System Inter AES [10] Interconnection AES 2 152.42 1.95 (4×) XC4VLX40 N/A N/A
Unified data [11] Redundancy Check 1 341.58 3.71 (2×) XC7Z020 Yes N/A
Ultra-low power AES [12] 8-bit AES core enc. 1 149.00 0.12 (70×) ASIC 65nm CMOS N/A N/A
Conventional AES-CCM [14] CCM and CBC 1 100.00 0.47 (17×) XC7S75 Yes N/A
Our Proposed AMP-MP Mat. Mult GF(28) 5 100.00 8.32 (1×) Multicore ANoC 65nm Yes Yes

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
11

The NORX is an online cipher which computes the Pham, “A compact, ultra-low power AES-CCM IP core for wireless
body area networks,” in 2016 IFIP/IEEE International Conference on
intermediate states and authentication tag based on the secret
Very Large Scale Integration (VLSI-SoC), 2016, pp. 1–4.
key [24]. In Table IV, the NORX outperforms our proposed [13] A. A. Pammu, K.-S. Chong, K. Z. L. Ne, and B.-H. Gwee, “High
AMP-MP in asynchronous multicore in terms of throughput Secured Low Power Multiplexer-LUT Based AES S-Box
(0.1×) and efficiency (0.13×). However, the NORX is still Implementation,” in 2016 International Conference on Information
Systems Engineering (ICISE), 2016, pp. 3–7.
vulnerable against SCA (i.e. CPA decryption attacks) [24], [14] J. H. Yoo, “Fast software implementation of AES-CCM on
particularly when the target of attack is at the output of the multiprocessors,” in Lecture Notes in Computer Science (including
authentication process (i.e. MAC) where the secret key is subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
reused two times. Hence, the secret key of NORX Bioinformatics), 2011, vol. 7017 LNCS, no. PART 2, pp. 300–311.
[15] K. Atighehchi and R. Rolland, “Optimization of Tree Modes for Parallel
implementation is noticeable during the authentication process Hash Functions: A Case Study,” IEEE Transactions on Computers, vol.
at the leakage measurements. 66, no. 9, pp. 1585–1598, 2017.
[16] S. K. Mathew et al., “μrNG: A 300-950 mV, 323 Gbps/W All-Digital
VI. CONCLUSIONS Full-Entropy True Random Number Generator in 14 nm FinFET
We have proposed an Authentication based Matrix- CMOS,” IEEE Journal of Solid-State Circuits, vol. 51, no. 7, pp. 1695–
1704, 2016.
transformation cum Parallel-encryption implemented on [17] M. Taha and P. Schaumont, “Key updating for leakage resiliency with
Multicore Processor to achieve a comprehensive performance, application to AES modes of operation,” IEEE Transactions on
high throughput and secure AES-CCM, yet low overheads of Information Forensics and Security, vol. 10, no. 3, pp. 519–528, 2015.
the power dissipation and the latency. The high throughput of [18] W. Yang, Y. Zhou, Y. Cao, H. Zhang, Q. Zhang, and H. Wang, “Multi-
Channel Fusion Attacks,” IEEE Transactions on Information Forensics
8.32Gbps is achieved by implementing matrix multiplication and Security, vol. 12, no. 8, pp. 1757–1771, 2017.
in GF(28) computation which transforms 16 plaintexts to 1 [19] I. 802. 1. W. Group, “IEEE Standard for High Data Rate Wireless Multi-
plaintext for authentication process. While the highly Media Networks,” IEEE Std 802.15.3-2016 (Revision of IEEE Std
resistance of SCA (>5×105 traces) and low overheads are 802.15.3-2003, vol. 2016. pp. 1–510, 2016.
[20] K. L. Chang, J. S. Chang, B. H. Gwee, and K. S. Chong, “Synchronous-
achieved concurrently by leveraging the multicore ANoC with logic and asynchronous-logic 8051 microcontroller cores for realizing
our propose AMP-MP. The additional security feature, key the internet of things: A comparative study on dynamic voltage scaling
adjusting technique, is implemented to further secure the AES- and variation effects,” IEEE Journal on Emerging and Selected Topics
CCM against future computational attacks and the leakage in Circuits and Systems, vol. 3, no. 1, pp. 23–34, 2013.
[21] K. S. Chong, W. G. Ho, T. Lin, B. H. Gwee, and J. S. Chang, “Sense
patterns of HW and HD during key updates and MMA. With Amplifier Half-Buffer (SAHB) A Low-Power High-Performance
all these advantageous, our proposed AMP-MP in AES-CCM Asynchronous Logic QDI Cell Template,” IEEE Transactions on Very
is suitable for secured IoT applications. Large Scale Integration (VLSI) Systems, vol. 25, no. 2, pp. 402–415,
2017.
[22] “Leak Me If You Can: Does TVLA Reveal Success Rate?”, : [Online].
REFERENCES Available: https://eprint.iacr.org/2016/1152.pdf
[1] W. Stallings, CRYPTOGRAPHY AND NETWORK SECURITY [23] Kumar, S., Yahya, J. H., Khairallah, M., Elmohr, M. A., Chattopadhyay,
PRINCIPLES AND PRACTICE, Fifth Edit., vol. 139, no. 3. Pearson A., “A Comprehensive Performance Analaysis of Hardware
Education, Inc, 2011. Implementations of CAESAR Candidates”, 2018. [Online]. Available:
[2] M. J. Dworkin, “Recommendation for Block Cipher Modes of https://eprint.iacr.org/2017/1261.pdf
Operation: The CMAC Mode for Authentication,” NIST Special [24] Vaudenay, S., Vizàr, D., “Under Pressure: Security of Caesar Candidates
Publication 800-38B, 2007. Beyond Their Guarantees”, 2018. [Online]. Available:
[3] W. Prodanov, M. Valle, and R. Buzas, “A Controller Area Network Bus https://eprint.iacr.org/2017/1147.pdf
Transceiver Behavioral Model for Network Design and Simulation,” [25] Gierlichs, B., Batina, L.,Verbauhede, I., “Revisiting Higher-Order DPA
IEEE Transactions on Industrial Electronics, vol. 56, no. 9, pp. 3762– Attacks: multivariate mutual information analysis”, [Online]:
3771, 2009. https://eprint.iacr.org/2009/228.pdf
[4] C. C. Wang, C. L. Chen, G. N. Sung, C. L. Wang, and C. Y. Juan, “A
FlexRay transceiver design with bus guardian for in-car networking
systems compliant with FlexRay standard,” Journal of Signal Ali Akbar Pammu (S’15) received the B.Eng.
Processing Systems, vol. 74, no. 2, pp. 221–233, 2014. (Hons.) degree in electrical and electronic
[5] M. Conti, N. Dragoni, and V. Lesyk, “A Survey of Man in the Middle engineering from Nanyang Technological
Attacks,” IEEE Communications Surveys and Tutorials, vol. 18, no. 3, University (NTU), Singapore, in 2014. He was a
pp. 2027–2051, 2016. recipient of the NTU Graduate Research
[6] S. Mangard, E. Oswald, and T. Popp, Power Analysis attacks: Revealing Scholarship. He was awarded as the best presenter
the secrets of smart cards. 2007. in ICISE2016 conference, California USA and the
[7] S. K. Mathew et al., “340 mV–1.1 V, 289 Gbps/W, 2090-Gate NanoAES best student paper award in ISIC2016 conference,
Hardware Accelerator With Area-Optimized Encrypt/Decrypt GF(24)2 Singapore.
Polynomials in 22 nm Tri-Gate CMOS,” IEEE Journal of Solid-State Mr. Ali is currently a PhD student with the Hardware Assurance Team,
Circuits, vol. 50, no. 4, pp. 1048–1058, 2015. Temasek Laboratories @ NTU, Singapore. His current research interests
[8] E. López-Trejo, F. Rodríguez-Henríquez, and A. Díaz-Pérez, “An include digital hardware security, profiling leakage measurements for SCA by
FPGA Implementation of CCM Mode Using AES,” in Information leveraging machine learning algorithms, fastest leakage assessment SCA and
Security and Cryptology-ICISC 2005., 2006, pp. 322–334. countermeasure SCA.
[9] K. Nguyen, L. Lanante, Y. Nagao, M. Kurosaki, and H. Ochi,
“Implementation of 2.6 Gbps super-high speed AES-CCM security Weng-Geng Ho (S’10-M’16) received the
protocol for IEEE 802.11i,” in 13th International Symposium on B.Eng. (Hons.) and Ph.D. degrees in electrical
Communications and Information Technologies: Communication and and electronic engineering from Nanyang
Information Technology for New Life Style Beyond the Cloud, ISCIT Technological University (NTU), Singapore, in
2013, 2013, pp. 669–673. 2009 and 2016 respectively.
[10] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and M. Morales- Dr. Ho is currently a Research Scientist with
Sandoval, “FPGA Implementation and Performance Evaluation of AES- the Hardware Assurance Team, Temasek
CCM Cores for Wireless Networks,” in 2008 International Conference Laboratories @ NTU, Singapore. His current
on Reconfigurable Computing and FPGAs, 2008, pp. 421–426. research interests include low power secured
[11] Y. Wang, J. An, and Y. Ha, “Unified Data Authenticated Encryption for memory design, digital VLSI design,
Vehicular Communication,” in 2016 IEEE International Midwest asynchronous-logic circuit design, NoC-based
Symposium on Circuits and System (MWSCAS), Abu Dhabi, UAE, 2016, multicore platform design and side-channel-
no. October, pp. 16–19. attack countermeasures. Dr. Ho was a recipient of the NTU Graduate Research
[12] Van-Phuc Hoang, Thi-Thanh-Dung Phan, Van-Lan Dao, and Cong-Kha Scholarship.

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2869344, IEEE
Transactions on Information Forensics and Security
12

Ne Kyaw Zwa Lwin received the B.Eng. and


M.Sc. degrees in electrical and electronic
engineering from Nanyang Technological
University (NTU), Singapore, in 2011 and 2014
respectively.
He is currently a Research Associate with
School of Electrical and Electronic
Engineering, NTU. His current research
interests include hardware security, space-grade
resilient circuits and systems, and digital VLSI
design.

Kwen-Siong Chong (S’03-M’09-SM’13)


received the B.Eng., M.Phil. And Ph.D. degrees
in electrical and electronic engineering from
Nanyang Technological University (NTU),
Singapore, in 2001, 2002, and 2007
respectively.
He is presently a Senior Research Scientist
with Temasek Laboratories @ NTU, Singapore.
He was a visiting researcher in Nara Institute of
Science and Technology, Japan, in 2010, and in
the University of Michigan, USA, in 2012. He
is/was principal investigator (PI), co-PI, and
collaborator of several research projects, including the projects supported
from National Science Foundation (Singapore), Defense Advanced Research
Projects Agency (USA), Ministry of Education (Singapore), and Public Sector
Research Funding (Singapore). His research interests include hardware
security, space-grade resilient circuits and systems, asynchronous VLSI
designs, low-voltage low power VLSI circuits, and audio signal processing.
Dr. Chong was the Chair of IEEE Circuits and Systems (CAS) Society,
Singapore Chapter, in 2017 and 2018. He has served as an organizing
committee for several conferences, including the ASP-DAC 2014, DSP-2015
and DSP-2018. He has been a member of IEEE CAS Society VLSI Systems
and Applications Technical Committee since 2009. He is an IEEE senior
member.

Bah-Hwee Gwee (S’93-M’97-SM’03)


received the B.Eng. degree in Electrical and
Electronic Engineering from University of
Aberdeen, U.K., in 1990. He received the
M.Eng. and Ph.D. degrees from Nanyang
Technological University (NTU), Singapore,
in 1992 and 1998 respectively.
He was an Assistant Professor in School of
EEE, NTU from 1999 to 2005 and has been an
Associate Professor since 2005. He was
Assistant Chair (Students) from 2010 to 2014
and he is currently the Assistant Chair
(Outreach) of School of EEE. He was the Principal Investigators (PIs) of a
number of research projects amounting to more than US$12M. He has
published more than 100 technical papers, 7 patents (3 granted in USA) and
started a Start-up Company in 2005.
He was the Chairman of IEEE-Singapore Circuits and Systems Chapter in
2005, 2006, 2013 and 2016. He is currently the Chair of the IEEE Circuits and
Systems Society DSP TC. He was the TPC Chair for ISIC-2011, 2014 and
2016 and the General co-chair for IEEE DSP 2018. He has served as Associate
Editors of a number of journals, including IEEE TCAS-1 (2012-2013), IEEE
TCAS-II (2010-2011, 2018-on going) and Journal of Circuits, Systems and
Signal Processing (2007-2012). He was an IEEE Distinguished Lecture for
Circuits and Systems Society in 2009/10 and in 2017/18. He was awarded the
Singapore Defense Technology Prize in 2016.

1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like