High Efficiency Counter Mode Security Architecture via Prediction and Precomputation
Weidong Shi Hsien-Hsin (Sean) Lee Mrinmoy Ghosh Chenghuai Lu Alexandra Boldyreva School of Electrical and Computer Engineering Georgia Institute of Technology
Content
Motivation Related Work Counter/Decryption Pad Prediction Profile Prediction Failures 2Level Prediction Context Based Prediction Conclusions
Why Encrypt System Memory?
Protect sensitive data stored in the RAM (many simple devices
can bypass OS memory protection and directly access physical memory)
Digital Right Management (industry witness of gradual addition of
encryption to each platform component, encrypted PCI-E, encrypted disk, encrypted flash memory, then toward encrypted RAM)
Anti-reverse engineer (majority software licenses require users not
to do reverse engineer, count on the users not breaking the promise)
Military (customer of encrypted FPGA chips, lots of embedded military
software)
Program randomization (intrusion prevention, CCS 2003)
3
Different Solutions
Crypto Engine Flash
Micro Controller
Create a little secure world, limited application scenarios (code signing, BIOS signature verification)
Processor Core Cache Crypto Engine
SoC. Memory is on-chip. Apply to limited platforms such as small embedded systems (cell phones)
Configurable system RAM encryption. More usage models.
4
Related Work
Use dedicated cache (sequence number cache) to reduce
latency overhead of memory decryption (Micro 2003)
Prefetch based memory pre-decryption (WASSA 2004)
Prediction based memory decryption (this paper)
Fully exploit pre-computation capability enabled by counter mode
encryption. Use wasted idle crypto engine pipeline stages for prediction and pre-computation. Less area overhead than caching and less memory pressure than prefetch based pre-decryption.
Counter Mode - Encryption
Processor Core
Cache Line Cache Line ... Cache Line Cache Line
Counter+2 Counter+1 Counter Key
VAddr
Counter+2 Counter+1 Counter
Vaddr+2
Crypto Engine
Key
AES Block Cipher
AES Block Cipher
Encryption pad
Encryption pad
16B Cache Line 16B Cache Line
Counter+2 Counter+1 Counter Encrypted 16B Encrypted 16B
Each memory line has its own counter. Each time memory line is updated, increment the counter.
Counter Mode -Decryption
Counter+2 VAddr Key Counter+2 Vaddr+2
Processor Core
Key
Cache Line Cache Line ... Cache Line Cache Line
Crypto Engine
AES Block Cipher
AES Block Cipher
Encryption pad
Encryption pad
16B Cache Line 16B Cache Line
Counter has to be
fetched for memory line missing L2.
Encrypted 16B
Encrypted 16B
Counter Prediction
Counters exhibit both spacial and temporal coherence. To exploit spacial coherence, memory blocks from the same page
start counting from the same initial value (page root counter)
Page Base Addr 0x0000ff00 ... ... Page Root Counter (64 bits) 0xabcddcba12344321 ... ...
counter
Memory line Memory line ... Memory line Memory line 0xabcddcba123443f1 0xabcddcba12344e0a ... 0xabcddcba12344325 0xabcddcba12344321
frequently updated data infrequently updated data
static data
8
Use Free Idle Pipeline Stages for Prediction
Time Line AES Pipeline
Pipeline Idle
Retrieving Counter Value and Encrypted Line
Generate Decryption Pad
Memory Pipeline
decrypted line
Unrolled and pipelined AES decryption logic often stays idle from tens
to hundreds of cycles when data is missing L2.
Use Free Idle Pipeline Stages for Prediction
Use the idle pipeline stages to generate decryption pads based on
predicted counter values (a small window of look ahead counter values based on page root counter number) E(K,G4) correct guess
Time Line AES Pipeline
First Speculated Decryption Pad
Memory Pipeline
Retrieving Counter Value and Encrypted Line
decrypted line
10
Handle Frequent Updates
Page Base Addr 0x0000ff00 ... ... Page Root Counter (64 bits) 0xabcddcba12344321 ... ... Prediction History Vector (16bits) ... ... ...
TLB Shift Register
0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0
Counter Value Prediction Logic
1
If total(miss)>threshold, reset the corresponding Page Root Counter to a new number
Prediction Miss/Prediction Hit (miss =1, hit = 0)
Window based dynamic tracking of prediction rate for each page. For frequently updated memory blocks, according to prediction history
vector, reset root counter number. All future write-backs will count from the new number.
11
Experiment Setup
Parameters L1 I/D Cache Value DM, 8KB
L2 Cache
Memory Bus CPU Clock AES Latency (256-bit) Prediction Depth
4way, unified, 256KB/1M
200MHz, 8B wide 1GHz Total 64 pipeline stages, 1ns each 5
Prediction History Window 16 Bits
Simplescalar 3.0 SPEC2000 INT/FP, benchmarks with high L2 misses. Prediction hit rate study (8 billion instructions) IPC performance (400 million on representative window)
12
Prediction Rate
1.2 1 0.8 0.6 0.4 0.2 0
lu gr id pa rs er sw im tw ol f vo rt ex p ar t bz ip 2 cf gc c vp up r w is Av e er ag e p m gz i ap p m Am m
Prediction Hit Rate (256K L2)
1.2 1 0.8 0.6 0.4 0.2 0
lu
Prediction Hit Rate (1M L2)
gr id pa rs er sw im tw ol f vo rt ex
ar t bz ip 2
gc c
cf
128K_Counter_#_Cache
512K_Counter_#_Cache
Pred
128K_Counter_#_Cache
512K_Counter_#_Cache
Prediction hit rate under 8 billion instructions No counter number cache when using prediction Prediction depth = 5 Average prediction hit rate, about 82-83%
13
vp up r w is Av e er ag e
Pred
ap p
gz i
Am
IPC
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t gr id Pa rs er Sw im Tw ol Vo f rt ex Vp up r w is Av e er ag e p ip 2 Gc c cf Gz i M Bz Am M
Normalized IPC (256K L2)
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t
Normalized IPC (1M L2)
gr id Pa rs er Sw im Tw ol Vo f rt ex
Counter_Cache_4K
Counter_Cache_128K
Counter_Cache_512K
Pred
Counter_Cache_4K
Counter_Cache_128K
Counter_Cache_512K
IPC normalized with the scenario without decryption. In general, outperform 128K counter cache On average, in par with 512K counter cache
14
Vp up r w is Av e er ag e
Pred
ip 2
Gc c
p Gz i
Am
Bz
cf
Prediction Miss
Reasons of prediction misses Prediction depth is too small. Reset of page root counter number. Memory lines whose counter
values based on the old page root counter cannot be predicted correctly using the new page root counter.
Solutions (details in the next few slides) Two-level prediction (divide prediction depth into sub ranges,
increase effective prediction depth without adding more predictions) Page root counter history memorization (predict using both the current page root counter and the previous root counter, only having marginal improvement) Context based prediction (exploit temporal coherence of accessing memory locations with coherent update frequency)
15
Two-level Prediction
Counter Number In Natural Order
10 01 00 11
Prediction Window Prediction Window Prediction Window Prediction Window
Divide prediction window into ranges (power of 2) With 2bits per line, effectively quadruple the prediction depth. Overhead is about 2KB on chip memory for 64-entry TLB.
16
Context Based Prediction
Counter Number In Natural Order
Prediction Window
Store the previous lines counter number depth value in a global
register. Generate new predictions based on Page Root Counter and the value in Context Register. Can be combined with regular and 2-level predictions. Feed all the predictions into the decryption pipeline.
17
Why Does It Work?
Memory Page (128 lines)
Memory line Memory line ... Memory line Memory line
{ while (1) { for all lines of the page write to the line; for all lines of the page read the line; } }
Regular Prediction (prediction depth=4) Prediction miss of memory read (%) 20% (for each line, every 5 reads, 1 miss)
Context Based Prediction 0.1% (for every 128*5 reads, 1 miss)
18
Prediction Rate
1.2 1 0.8 0.6 0.4 0.2 0
lu gr id pa rs er sw im tw ol f vo rt ex p ar t bz ip 2 cf gc c vp up r w is Av e er ag e p m gz i ap p m Am m
Prediction Hit Rate (256K L2)
1.2 1 0.8 0.6 0.4 0.2 0
lu
Prediction Hit Rate (1M L2)
gr id pa rs er sw im tw ol f vo rt ex
ar t
gc c
cf
Regular_Pred
Two-level_Pred
Context + Regular_Pred
Regular_Pred
Two-level_Pred
Context + Regular_Pred
8 billion instruction window Two-level prediction about 93% prediction hit Context based + regular prediction almost 99% prediction hit
19
vp up r w is Av e er ag e
bz ip
ap p
gz i
Am
IPC
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t gr id Pa rs er Sw im Tw ol Vo f rt ex Vp up r w is Av e er ag e p ip 2 Gc c cf Gz i M Bz Am M
Normalized IPC (256K L2)
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t
Normalized IPC (1M L2)
gr id Pa rs er Sw im Tw ol Vo f rt ex
Regular_Pred
2level_Pred
Context + Regular_Pred
Regular_Pred
2level_Pred
Context + Regular_Pred
IPC normalized to scenario of no decryption 1-3% loss of performance using best prediction
Vp up r w is Av e er ag e
ip 2
Gc c
p Gz i
Am
Bz
cf
20
Conclusions
Counter value prediction allows pre-computing of pads speculatively
without counter value caching.
Spacial and temporal coherence of memory update frequency enables
effective counter value prediction.
Use idle cycles of pipelined decryption engine Counter prediction achieves better performance than some of the large
cache settings.
Complementary with caching technique
21
Questions
22