Scalable Parallel Flash Firmware For Many-Core Architectures
Scalable Parallel Flash Firmware For Many-Core Architectures
USENIX Association 18th USENIX Conference on File and Storage Technologies 121
block addresses and physical page numbers is simultane- &ůĂƐŚ &ůĂƐŚ
ously performed by many threads at the trans-apply stage. &ƵƚƵƌĞĞdžƚĞŶƐŝŽŶ
&ƵƚƵƌĞ
ĞdžƚĞŶƐŝŽŶ
EEĨůĂƐŚW,z
As each stage can have different numbers of threads, con-
ϴĐŚĂŶŶĞůƐ
WƌŽĐĞƐƐŽƌ/ŶƚĞƌĐŽŶŶĞĐƚ
&ůĂƐŚ &ůĂƐŚ
tention between the threads for shared hardware resources W/Ğ
'ĞŶϯ͘Ϭ
and structures, such as mapping table, metadata and mem- EsDĞ ^ĐƌĂƚĐŚ Z &ůĂƐŚ &ůĂƐŚ
ŽŶƚƌŽůůĞƌ ƉĂĚ ŽŶƚƌŽůůĞƌ
ory management structures can arise. Integrating many cores &ƵƚƵƌĞ
ĞdžƚĞŶƐŝŽŶ
W/Ğ W/Ğ D >W
in the scalable flash firmware design also introduces data W,z ŽƌĞ ŽŶƚƌŽůůĞƌ ^ĞƋƵĞŶĐĞƌ &ůĂƐŚ &ůĂƐŚ
consistency, coherence and hazard issues. We analyze new
challenges arising from concurrency, and address them by Figure 1: Overall architecture of an NVMe SSD.
applying concurrency-aware optimization techniques to each a dozen of lightweight in-order cores to deliver 1MIOPS.
stage, such as parallel queue processing, cache bypassing and
background work for time-consuming SSD internal tasks.
2 Background
We evaluate a real system with our hardware platform that
implements DeepFlash and internally emulates low-level
flash media in a timing accurate manner. Our evaluation re- 2.1 High Performance NVMe SSDs
sults show that DeepFlash successfully provides more than Baseline. Figure 1 shows an overview of a high-performance
1MIOPS with a dozen of simple low-power cores for all reads SSD architecture that Marvell recently published [43]. The
and writes with sequential and random access patterns. In host connects to the underlying SSD through four Gen 3.0
addition, DeepFlash reaches 4.5 GB/s (above 1MIOPS), on PCIe lanes (4 GB/s) and a PCIe controller. The SSD archi-
average, under the execution of diverse real server workloads. tecture employs three embedded processors, each employing
The main contributions of this work are summarized as below: two cores [27], which are connected to an internal DRAM
• Many-to-many threading firmware. We identify scalabil- controller via a processor interconnect. The SSD employs
ity and parallelism opportunities for high-performance flash several special-purpose processing elements, including a low-
firmware. Our many-to-many threading model allows future density parity-check (LDPC) sequencer, data transfer (DMA)
manycore-based SSDs to dynamically shift their computing engine, and scratch-pad memory for metadata management.
power based on different workload demands without any hard- All these multi-core processors, controllers, and components
ware modification. DeepFlash splits all functions from the are connected to a flash complex that connects to eight chan-
existing layered firmware architecture into three stages, each nels, each connecting to eight packages, via flash physical
with one or more thread groups. Different thread groups can layer (PHY). We select this multicore architecture description
communicate with each other over an on-chip interconnection as our reference and extend it, since it is only documented
network within the target SSD. NVMe storage architecture that employs multiple cores at this
• Parallel NVMe queue management. While employing juncture, but other commercially available SSDs also employ
many NVMe queues allows the SSD to handle many I/O a similar multi-core firmware controller [38, 50, 59].
requests through PCIe communication, it is hard to coordinate Future architecture. The performance offered by these de-
simultaneous queue accesses from many cores. DeepFlash vices is by far below 1MIOPS. For higher bandwidth, a future
dynamically allocates the cores to process NVMe queues device can extend storage and processor complexes with more
rather than statically assigning one core per queue. Thus, a flash packages and cores, respectively, which are highlighted
single queue is serviced by multiple cores, and a single core by red in the figure. The bandwidth of each flash package
can service multiple queues, which can deliver full bandwidth is in practice tens of MB/s, and thus, it requires employing
for both balanced and unbalanced NVMe I/O workloads. We more flashes/channels, thereby increasing I/O parallelism.
show that this parallel NVMe queue processing exceeds the This flash-side extension raises several architectural issues.
performance of the static core-per-queue allocation by 6x, on First, the firmware will make frequent SSD-internal memory
average, when only a few queues are in use. DeepFlash also accesses that stress the processor complex. Even though the
balances core utilization over computing resources. PCIe core, channel and other memory control logic may be
• Efficient I/O processing. We increase the parallel scala- implemented, metadata information increases for the exten-
bility of many-to-many threading model by employing non- sion, and its access frequency gets higher to achieve 1MIOPS.
blocking communication mechanisms. We also apply sim- In addition, DRAM accesses for I/O buffering can be a crit-
ple but effective lock and address randomization methods, ical bottleneck to hide flash’s long latency. Simply making
which can distribute incoming I/O requests across multiple cores faster may not be sufficient because the processors will
address translators and flash packages. The proposed method suffer from frequent stalls due to less locality and contention
minimizes the number of hardware core to achieve 1MIOPS. at memory. This, in turn, makes each core bandwidth lower,
Putting all it together, DeepFlash improves bandwidth by which should be addressed with higher parallelism on com-
3.4× while significantly reducing CPU requirements, com- putation parts. We will explain the current architecture and
pared to conventional firmware. Our DeepFlash requires only show why it is non-scalable in Section 3.
122 18th USENIX Conference on File and Storage Technologies USENIX Association
EsDĐŽŵŵƵŶŝĐĂƚŝŽŶ
,ŽƐƚ ZŝŶŐŽŽƌďĞůů Ϯ ^YdĂŝů ϰ ϱ ϲ ƉƉ ƉƉ ƉƉ ,ŽƐƚ
ĐŽƌĞ ĐŽƌĞ ĐŽƌĞ
ĐŽŵƉůĞƚŝŽŶ ĐŽŵŵĂŶĚ
EsDĞ^^
D YƵĞƵĞ,ĂŶĚůŝŶŐ
YƵĞƵĞƐ /ŶƚĞƌŶĂůDĞŵŽƌLJ^ƉĂĐĞ
,E ,Ϭ
ϳ ^ƚĂƌƚ WW
> >ŽŐŝĐĂůůŽĐŬ^ƉĂĐĞ
'ĞŶĞƌĂƚĞ WW ĚĚƌĞƐƐ
ϴ ůŽĐŬ dƌĂŶƐůĂƚŝŽŶ
WƌŽĐĞƐƐ
USENIX Association 18th USENIX Conference on File and Storage Technologies 123
180 5 YƵĞƵĞͲŐĂƚŚĞƌ dƌĂŶƐͲĂƉƉůLJ &ůĂƐŚͲƐĐĂƚƚĞƌ
SSD latency breakdown
Performance (K IOPS)
Expected
Perf.degradation (%)
1.0
150 4 Naive 80 EsDY />K< EsDY , dZE^ >K' &D
IO fetch
IO cache
DMA
0.8
120
0.6 3 60
MIOPS
90
0.4 60 2 40
'
Addr. trans
0.2 30 1 20
Flash
IO parse
0.0 0
2
4
8
16
32
64
128
256
512
0 0
1
2
4
8
16
32
Number of flash packages Number of flash cores
(a) Flash scaling. (b) Core scaling.
Figure 3: Perf. with varying flash packages and cores. Figure 4: Many-to-many threading firmware model.
for only exploring the limit of scalability, rather than as a with varying number of cores, ranging from 1 to 32. Figure
suggestion for actual SSD controller. 3b compares the performance of aforementioned naive many-
Flash scaling. The bandwidth of a low-level flash package core approach (e.g., Naive) with the system that expects per-
is several orders of magnitude lower than the PCIe band- fect parallel scalability (e.g., Expected). Expected’s perfor-
width. Thus, SSD vendors integrate many flash packages over mance is calculated by multiplying the number of cores with
multiple channels, which can in parallel serve I/O requests IOPS of Naive built on a single core SSD. One can observe
managed by NVMe. Figure 3a shows the relationship of band- from this figure that Naive can only achieve 813K IOPS even
width and execution latency breakdown with various number with 32 cores, which exhibits 82.6% lower performance, com-
of flash packages. In this evaluation, we emulate an SSD by pared to Expected. This is because contention and consis-
creating a layered firmware instance in a single MIC core, in tency management for the memory spaces of internal DRAM
which two threads are initialized to process the tasks of HIL (cf. Section 5.3) introduces significant synchronization over-
and FTL, respectively. We also assign 16 MIC cores (one core heads. In addition, the FTL must serialize the I/O requests to
per flash channel) to manage flash interface subsystems. We avoid hazards while processing many queues in parallel. Since
evaluate the performance of the configured SSD emulation all these issues are not considered by the layered firmware
platform by testing 4KB sequential writes. For the break- model, it should be re-designed by considering core scaling.
down analysis, we decompose total latency into i) NVMe The goal of our new firmware is to fully parallelize multiple
management (I/O parse and I/O fetch), ii) I/O cache, NVMe processing datapaths in a highly scalable manner while
iii) address translation (including flash scheduling), vi) minimizing the usage of SSD internal resources. DeepFlash
NVMe data transfers (DMA) and v) flash operations (Flash). requires only 12 in-order cores to achieve 1M or more IOPS.
One can observe from the figure that the SSD performance
saturates at 170K IOPS with 64 flash packages, connected
over 16 channels. Specifically, the flash operations are the 4 Many-to-Many Threading Firmware
main contributor of the total execution time in cases where
our SSD employs tens of flash packages (73% of the total Conventional FTL designs are unable to fully convert the
latency). However, as the number of flash packages increases computing power brought by a manycore processor to storage
(more than 32), the layered firmware operations on a core performance, as they put all FTL tasks into a single large block
become the performance bottleneck. NVMe management and of the software stack. In this section, we analyze the func-
address translation account for 41% and 29% of the total time, tions of the traditional FTLs and decompose them into seven
while flash only consumes 12% of the total cycles. different function groups: 1) NVMe queue handling (NVMQ),
There are two reasons that flash firmware turns into the 2) data cache (CACHE), 3) address translation (TRANS), 4)
performance bottleneck with many underlying flash devices. index lock (ILOCK), 5) logging utility (LOG), 6) background
First, NVMe queues can supply many I/O requests to take garbage collection utility (BGC), and 7) flash command and
advantages of the SSD internal parallelism, but a single-core transaction scheduling (FCMD). We then reconstruct the key
SSD controller is insufficient to fetch all the requests. Second, function groups from the ground up, keeping in mind con-
it is faster to parallelize I/O accesses across many flash chips currency, and deploy our reworked firmware modules across
than performing address translation only on one core. These multiple cores in a scalable manner.
new challenges make it difficult to fully leverage the internal
parallelism with the conventional layered firmware model. 4.1 Overview
Core scaling. To take flash firmware off the critical path in
scalable I/O processing, one can increase computing power Figure 4 shows our DeepFlash’s many-to-many threading
with the execution of many firmware instances. This approach firmware model. The firmware is a set of modules (i.e.,
can allocate a core per NVMe SQ/CQ and initiate one layered threads) in a request-processing network that is mapped to
firmware instance in each core. However, we observe that a set of processors. Each thread can have a firmware opera-
this naive approach cannot successfully address the burdens tion, and the task can be scaled by instantiating into multiple
brought by flash firmware. To be precise, we evaluate IOPS parallel threads, referred to as stages. Based on different data
124 18th USENIX Conference on File and Storage Technologies USENIX Association
YƵĞƵĞͲŐĂƚŚĞƌ &ůĂƐŚ &ůĂƐŚ ŽƌĞϬ ^YϬ
dĂƐŬϬ dĂƐŬE
ϭ /ŶŝƚŝĂƚĞĐŵĚƐ ZĞƋϭ ZĞƋϮ
ĂƐLJŶĐ/ͬK
ƐLJŶĐ/ͬK
dƌĂŶƐͲĂƉƉůLJ
,ŽƐƚŵĞŵŽƌLJ
&ůĂƐŚ &ůĂƐŚ /ͬK
&ůĂƐŚͲƐĐĂƚƚĞƌ EsDY ' >K' WZWƐ /ͬK
^YϬ
ǁƌΛ ƌĚΛ
W/Ğ
ƐƐŝŐŶƚŽ WĂƌĂůůĞů &ůĂƐŚ &ůĂƐŚ ĂƚĂ ϰ<
ƉƌŽĐĞƐƐŝŶŐĨůŽǁ ϬdžϴϬ ϬdžϴϬ
ĂŶLJǁŚĞƌĞ &ůĂƐŚ &ůĂƐŚ
^YE
^YϬ
, />K< dZE^ Ϯ ϯ EsDY EsDY
ĐŽƌĞƐ
^^ Ͳ &ůĂƐŚ &ůĂƐŚ &ĞƚĐŚ D
Đƚƌůƌ EE ĐŽŶĨůŝĐƚ ĐŽŶĨůŝĐƚ ĐŽƌĞϬ ĐŽƌĞϭ
ďƵƐLJ
DĂŶLJĐŽƌĞ DĂŶLJͲƚŽͲŵĂŶLJ^^ĨŝƌŵǁĂƌĞ
EsDĞ^^ ^ƚĂůĞ
EsDY EsDY EsDY EsDY ^ƚĂůů ĚĂƚĂ
^^
Figure 5: Firmware architecture. ĐŽƌĞϬ ĐŽƌĞE ĐŽƌĞϬ ĐŽƌĞE ΛϬdžϴϬ
processing flows and tasks, we group the stages into queue-
(a) Data contention (1:N). (b) Unbalanced task. (c) I/O hazard.
gather, trans-apply, and flash-scatter modules. The queue-
Figure 6: Challenges of NVMQ allocation (SQ:NVMQ).
gather stage mainly parses NVMe requests and collects them
of such data frames exist across non-contiguous host-side
to the SSD-internal DRAM, whereas the trans-apply stage
DRAM (for a single I/O request). The firmware parses the
mainly buffers the data and translates addresses. The flash-
PRP and begins DMA for multiple data frames per request.
scatter stage spreads the requests across many underlying
Once all the I/O services associated with those data frames
flash packages and manages background SSD-internal tasks
complete, firmware notifies the host of completion through
in parallel. This new firmware enables scalable and flexible
the target CQ. We refer to all the tasks, related to this NVMe
computing, and highly parallel I/O executions.
command and queue management, as NVMQ.
All threads are maximally independent, and I/O requests
A challenge to employ many cores for parallel queue pro-
are always processed from left to right in the thread network,
cessing is that, multiple NVMQ cores may simultaneously
which reduces the hardware contentions and consistency prob-
fetch a same set of NVMe commands from a single queue.
lems, imposed by managing various memory spaces. For ex-
This in turn accesses the host memory by referring a same
ample, two independent I/O requests are processed by two
set of PRPs, which makes the behaviors of parallel queue
different network paths (which are highlighted in Figure 4
accesses undefined and non-deterministic (Figure 6a). To ad-
by red and blue lines, respectively). Consequently, it can si-
dress this challenge, one can make each core handle only a
multaneously service incoming I/O requests as many network
set of SQ/CQ, and therefore, there is no contention, caused by
paths on as DeepFlash can create. In contrast to the other
simultaneous queue processing or PRP accesses (Figure 6b).
threads, background threads are asynchronous with the incom-
In this “static” queue allocation, each NVMQ core fetches a
ing I/O requests or host-side services. Therefore, they create
request from a different queue, based on the doorbell’s queue
their own network paths (dashed lines), which perform SSD
index and brings the corresponding data from the host system
internal tasks at background. Since each stage can process
memory to SSD internal memory. However, this static ap-
a different part of an I/O request, DeepFlash can process
proach requires that the host balance requests across queues
multiple requests in a pipeline manner. Our firmware model
to maximize the resource utilization of NVMQ threads. In
also can be simply extended by adding more threads based on
addition, it is difficult to scale to a large number of queues.
performance demands of the target system.
DeepFlash addresses these challenges by introducing a dy-
Figure 5 illustrates how our many-to-many threading model
namic I/O serialization, which allows multiple NVMQ threads
can be applied to and operate in the many-core based SSD ar-
to access each SQ/CQ in parallel while avoiding a consistency
chitecture of DeepFlash. While the procedure of I/O services
violation. Details of NVMQ will be explained in Section 5.1.
is managed by many threads in the different data processing
I/O mutual exclusion. Even though the NVMe specification
paths, the threads can be allocated in any core in the network,
does not regulate the processing ordering of NVMe com-
in a parallel and scalable manner.
mands in a range from where the head pointer indicates to
the entry that the tail pointer refers to [3], users may expect
4.2 Queue-gather Stage that the SSD processes the requests in the order that users
submitted. However, in our DeepFlash, many threads can
NVMe queue management. For high performance, NVMe simultaneously process I/O requests in any order of accesses.
supports up to 64K queues, each up to 64K entries. As It can make the order of I/O processing different with the
shown in Figure 6a, once a host initiates an NVMe com- order NVMe queues (and users) expected, which may in turn
mand to an SQ and writes the corresponding doorbell, the introduce an I/O hazard or a consistency issue. For example,
firmware fetches the command from the SQ and decodes a Figure 6c shows a potential problem brought by parallel I/O
non-contiguous set of host physical memory pages by refer- processing. In this figure, there are two different I/O requests
ring a kernel list structure [2], called a physical region page from the same NVMe SQ, request-1 (a write) and request-2
(PRP) [23]. Since the length of the data in a request can vary, (a read), which create two different paths, but target to the
its data can be delivered by multiple data frames, each of same PPA. Since these two requests are processed by different
which is usually 4KB. While all command information can NVMQ threads, the request-2 can be served from the target
be retrieved by the device-level registers and SQ, the contents slightly earlier than the request-1. The request-1 then will be
USENIX Association 18th USENIX Conference on File and Storage Technologies 125
> ϬdžϬ ϬdžD ϬdžϭD
KĚĚ /ŶĚĞdž
ŝƌĞĐƚͲ Śŝƚ
ŵĂƉ
As shown in Figure 7a, each CACHE thread has its own
>WE ƚĂďůĞ DŽĚƵůŽD mapping table to record the memory locations of the buffered
ƵŶďĂůĂŶĐĞĚ requests. CACHE threads are configured with a traditional
dZE^
ǀĞŶ >WE ƉŽŝŶƚĞƌ
>WE Śŝƚ ŽƌĞϭ ŽƌĞD
WWE
direct-map cache to reduce the burden of table lookup or cache
ŵŽĚ
DŽĚƵůŽ<
replacement. In this design, as each CACHE thread has a
/ŶĚĞdž ŝƌĞĐƚͲŵĂƉƚĂďůĞ /ŶƚĞƌŶĂů
ZD different memory region to manage, NVMQ simply calculates
&ůĂƐŚ
ƵŶďĂůĂŶĐĞĚ
DĂŶLJ ĞŶƚƌĂůŝnjĞĚ the index of the target memory region by modulating the
ŵĞƐƐĂŐĞƐ ůŽŽŬƵƉŽǀĞƌŚĞĂĚ Śϭ ŚϮ Ś<
request’s LBA, and forwards the incoming requests to the
(a) Main procedure of CACHE. (b) Shards (TRANS).
target CACHE. However, since all NVMQ threads possibly
Figure 7: Challenge analysis in CACHE and TRANS.
communicate with a CACHE thread for every I/O request,
stalled, and the request-2 will be served with stale data. Dur- it can introduce extra latency imposed by passing messages
ing this phase, it is also possible that any thread can invalidate among threads. In addition, to minimize the number of cores
the data while transferring or buffering them out of order. that DeepFlash uses, we need to fully utilize the allocated
While serializing the I/O request processing with a strong cores and dedicate them to each firmware operation while
ordering can guarantee data consistency, it significantly hurts minimizing the communication overhead. To this end, we
SSD performance. One potential solution is introducing a put a cache tag inquiry method in NVMQ and make CACHE
locking system, which provides a lock per page. However, threads fully handle cache hits and evictions. With the tag
per-page lock operations within an SSD can be one of the inquiry method, NVMQ can create a bypass path, which can
most expensive mechanisms due to various I/O lengths and remove the communication overheads (cf. Section 5.3).
a large storage capacity of the SSD. Instead, we partition Parallel address translation. The FTL manages physical
physical flash address space into many shards, whose access blocks and is aware of flash-specific behavior such as erase-
granularity is greater than a page, and assign an index-based before-write and asymmetric erase and read/write operation
lock to each shard. We implement the index lock as a red- unit (block vs. page). We decouple FTL address translation
black tree and make this locking system as a dedicated thread from system management activities such as garbage collection
(ILOCK). This tree helps ILOCK quickly identify which or logging (e.g., journaling) and allocate the management to
lock to use, and reduces the overheads of lock acquisition multiple threads. The threads that perform this simplified
and release. Nevertheless, since NVMQ threads may access address translation are referred to as TRANS. To translate
a few ILOCK threads, it also can be resource contention. addresses in parallel, it needs to partition both LBA space and
DeepFlash optimizes ILOCK by redistributing the requests PPA space and allocate them to each TRANS thread.
based on lock ownership (cf., Section 5.2). Note that there is As shown in Figure 7b, a simple solution is to split a sin-
no consistency issue if the I/O requests target different LBAs. gle LBA space into m numbers of address chunks, where
In addition, as most OSes manage the access control to prevent m is the number of TRANS threads, and map the addresses
different cores from accessing the same files [19, 41, 52], I/O by wrapping around upon reaching m. To take advantage of
requests from different NVMe queues (mapping to different channel-level parallelism, it can also separate a single PPA
cores) access different LBAs, which also does not introduce space into k shards, where k is the number of underlying
the consistency issue. Therefore, DeepFlash can solve the I/O channels, and map the shards to each TRANS with arith-
hazard by guaranteeing the ordering of I/O requests, which metic modulo k. While this address partitioning can make
are issued to the same queue and access the same LBAs, while all TRANS threads operate in parallel without interference,
DeepFlash can process other I/O requests out of order. unbalanced I/O accesses can activate a few TRANS threads or
channels. This can introduce a poor resource utilization and
many resource conflicts and stall a request service on the fly.
4.3 Trans-apply Stage Thus, we randomize the addresses when partitioning the LBA
Data caching and buffering. To appropriately handle space with simple XOR operators. This can scramble LBA
NVMe’s parallel queues and achieve more than 1MIOPS, and statically assign all incoming I/O requests across different
it is important to utilize the internal DRAM buffer efficiently. TRANS threads in an evenly distributed manner. We also allo-
Specifically, even though modern SSDs enjoy the massive cate all the physical blocks of the PPA space to each TRANS
internal parallelism stemming from tens or hundreds of flash in a round-robin fashion. This block-interleaved virtualization
packages, the latency for each chip is orders of magnitude allows us to split the PPA space with finer granularity.
longer than DRAM [22, 45, 46], which can stall NVMQ’s
I/O processing. DeepFlash, therefore, incorporates CACHE 4.4 Flash-scatter Stage
threads that incarnate SSD internal memory as a burst buffer
by mapping LBAs to DRAM addresses rather than flash ones. Background task scheduling. The datapath for garbage col-
The data buffered by CACHE can be drained by striping re- lection (GCs) can be another critical path to achieve high
quests across many flash packages with high parallelism. bandwidth as it stalls many I/O services while reclaiming
126 18th USENIX Conference on File and Storage Technologies USENIX Association
ƉĂƌƐĞ &D ,ŽƐƚͲƐŝĚĞ ^ƚŽƌĂŐĞͲƐŝĚĞ
dZE^Ϭ ϴ ϳ ϲ ϱ ϰ ϯ Ϯ ϭ /K ϯ &ĞƚĐŚ
ŵƐŐ ƌĞƋƵĞƐƚ ϭ ^Ƶďŵŝƚ Ϯ &ĞƚĐŚ EsDYϬ
^Y ^Y ĂƚŽŵŝĐ;ͲŚĞĂĚнϭͿ
dZE^Ŭ dĂŝůнϮ ͲƚĂŝůнϮ
EsDĞƌŝǀĞƌ
,Ϭ ƉƌĞ ƉƌĞ ƉŽƐƚ ƉŽƐƚ ϰ &ĞƚĐŚ
>K'Ϭ ŝĞϬ ŵĞŵ ŵĞŵ ĂƚŽŵŝĐ;ͲŚĞĂĚнϭͿ EsDYϭ
YƵĞƵĞ
'Ϭ ,ϭ ƉƌĞ ƉŽƐƚ ďƵĨĨĞƌ
KƵƚͲŽĨͲŽƌĚĞƌƐƵďŵŝƚ
ŝĞϭ ŵĞŵ ŽŵƉůĞƚĞ ^Ƶďŵŝƚ ϱ ĂƚŽŵŝĐ;ͲƚĂŝůнϭͿ
Figure 8: The main procedure of FCMD cores. ϴ ,ĞĂĚнϮ Y
ϳ dĂŝůнϮ
Y ϲ ĂƚŽŵŝĐ;ͲƚĂŝůнϭͿ
flash block(s). In this work, GCs can be performed in parallel
by allocating separate core(s), referred to as BGC. BGC(s) Figure 9: Dynamic I/O serialization (DIOS).
records the block numbers that have no more entries to write ings within a flash transaction can be classified by pre-dma,
when TRANS threads process incoming I/O requests. BGC mem-op and post-dma. While pre-dma includes operation
then merges the blocks and update the mapping table of corre- command, address, and data transfer (for writes), post-dma
sponding TRANS in behind I/O processing. Since a thread in is composed by completion command and another data trans-
TRANS can process address translations during BGC’s block fer (for reads). Memory operations of the underlying flash
reclaims, it would introduce a consistency issue on mapping are called mem-op in this example. FCMD(s) then scatters
table updates. To avoid conflicts with TRANS threads, BGC the composed transactions over multiple flash resources. Dur-
reclaims blocks and updates the mapping table at background ing this time, all transaction activities are scheduled in an
when there is no activity in NVMQ and the TRANS threads interleaved way, so that it can maximize the utilization of
complete translation tasks. If the system experiences a heavy channel and flash resources. The completion order of multiple
load and clean blocks are running out, our approach performs I/O requests processed by this transaction scheduling can be
on-demand GC. To avoid data consistency issue, we only spontaneously an out-of-order.
block the execution of the TRANS thread, which is responsi- In our design, each FCMD thread is statically mapped to
ble for the address translation of the reclaiming flash block. one or more channels, and the number of channels that will
Journalling. SSD firmware requires journalling by period- be assigned to the FCMD thread is determined based on the
ically dumping the local metadata of TRANS threads (e.g., SSD vendor demands (and/or computing power).
mapping table) from DRAM to a designated flash. In ad-
dition, it needs to keep track of the changes, which are not 5 Optimizing DeepFlash
dumped yet. However, managing consistency and coherency
for persistent data can introduce a burden to TRANS. Our While the baseline DeepFlash architecture distributes func-
DeepFlash separates the journalling from TRANS and as- tionality with many-to-many threading, there are scalability
signs it to a LOG thread. Specifically, TRANS writes the issues. In this section, we will explain the details of thread
LPN-to-PPN mapping information of a FTL page table en- optimizations to increase parallel scalability that allows faster,
try (PTE) to out-of-band (OoB) of the target flash page [64] more parallel implementations.
in each flash program operation (along with the per-page
data). In the meantime, LOG periodically reads all metadata
5.1 Parallel Processing for NVMe Queue
in DRAM, stores them to flash, and builds a checkpoint in the
background. For each checkpoint, LOG records a version, a To address the challenges of the static queue allocation ap-
commit and a page pointer indicating the physical location of proach, we introduce the dynamic I/O serialization (DIOS),
the flash page where TRANS starts writing to. At a boot time, which allows a variable ratio of queues to cores. DIOS de-
LOG checks sanity by examining the commit. If the latest ver- couples the fetching and parsing processes of NVMe queue
sion is staled, LOG loads a previous version and reconstructs entries. As shown in Figure 9, once a NVMQ thread fetches
mapping information by combining the checkpointed table a batch of NVMe commands from a NVMe queue, other
and PTEs that TRANS wrote since the previous checkpoint. NVMQ threads can simultaneously parse the fetched NVMe
Parallel flash accesses. At the end of the DeepFlash net- queue entries. This allows all NVMQ threads to participate in
work, the firmware threads need to i) compose flash transac- processing the NVMe queue entries from the same queue or
tions respecting flash interface timing and ii) schedule them multiple queues. Specifically, DIOS allocates a storage-side
across different flash resources over the flash physical layer SQ buffer (per SQ) in a shared memory space (visible to all
(PHY). These activities are managed by separate cores, re- NVMQ threads) when the host initializes NVMe SSD. If the
ferred to as FCMD. As shown in Figure 8, each thread in host writes the tail index to the doorbell, a NVMQ thread
FCMD parses the PPA translated by TRANS (or generated fetches multiple NVMe queue entries and copies them (not
by BGC/LOG) into the target channel, package, chip and actual data) to the SQ buffer. All NVMQ threads then process
plane numbers. The threads then check the target resources’ the NVMe commands existing in the SQ buffer in parallel.
availability and compose flash transactions by following the The batch copy is performed per 64 entries or till the tail for
underlying flash interface protocol. Typically, memory tim- SQ and CQ points a same position. Similarly, DIOS creates
USENIX Association 18th USENIX Conference on File and Storage Technologies 127
a CQ buffer (per CQ) in the shared memory. NVMQ threads ZĞƚƵƌŶŽǁŶĞƌ/ͬƐƚĂƚƵƐ />K<
EŽŶͲĚĞƚĞƌŵŝŶŝƐƚŝĐ
>ŽĐŬ KǁŶĞƌ
ŵĞƐƐĂŐĞŽƌĚĞƌ
update the CQ buffer instead of the actual CQ as an out of EsDYϬ / /
^Y/ŶĚĞdž ^YŽƌĚĞƌ
order, and flush the NVMe completion messages from the /ŶĨĞƌĞŶĐĞ
EsDY/ ĐƋƵŝƌĞ >ŽŽŬƵƉ
CQ buffer to the CQ in batch. This allows multiple threads >
>ŽĐŬZĞƋ͘ DƐŐ ZĞůĞĂƐĞ
update an NVMe queue in parallel without a modification ŵĞƐƐĂŐĞ
DĞƐƐĂŐĞ
ƋƵĞƵĞ ĞůĞƚĞ >ŽĐŬZͲƚƌĞĞ
of the NVMe protocol and host side storage stack. Another ;ĂͿdŚĞŵĂŝŶƉƌŽĐĞĚƵƌĞŽĨ/>K<͘
technical challenge for processing a queue in parallel is that ZĞƋ ϭ ǁƌΛϬdžϬϯ ďLJƉĂƐƐ ϯ
ZĞƋ Ϯ ƌĚΛϬdžϬϬ ϭ Ϯ ϰ
the head and tail pointers of SQ and CQ buffers are also ZĞƋ ϯ ǁƌΛϬdžϬϲ EsDY , dZE^
shared resources, which requires a protection for simultane-
ĚŝƌĞĐƚͲ /ŶŝƚŝĂů
Ϭ ƐƚĂƚƵƐ tƌŵŝƐƐ ZĚŚŝƚ ϯ tƌŵŝƐƐ ϰ ĂĐŚĞ
ous access. DeepFlash offers DIOS’s head (D-head) and tail ϭ ƐĞƌǀĞĚ Ϯ ƐĞƌǀĞĚ ďLJƉĂƐƐ ĞǀŝĐƚ
ŵĂƉ ĂĐŚĞĚ
(D-tail) pointers, and allows NVMQ threads to access SQ and ƚĂďůĞ >WE ϬdžϬϬ ϬdžϬϯ ϬdžϬϯ ϬdžϬϯ ϬdžϬϯ
ǀŝĐƚĞĚ ŶƵůů
ŶƵůů ϬdžϬϬ ϬdžϬϬ ϬdžϬϬ ŶƵůů
CQ through those pointers, respectively. Since D-head and ZD >WE
;ďͿdŚĞĞdžĂŵƉůĞŽĨ,ďLJƉĂƐƐŝŶŐ͘
D-tail pointers are managed by gcc atomic built-in function,
__sync_fetch_and_add [21], and the core allocation is per- Figure 10: Optimization details.
formed by all NVMQ threads, in parallel, the host memory This, in turn, can free the NVMQ thread from waiting for the
can be simultaneously accessed but at different locations. lock acquisition, which increases the parallelism of DIOS.
128 18th USENIX Conference on File and Storage Technologies USENIX Association
and generates the target TRANS index, which takes less than (MPS) [35], FIU SRCMap [63], Enterprise, and FIU IOD-
20 ns. The randomization allows queue-gather stages to issue edup [40]. Each workload exhibits various request sizes, rang-
requests to TRANS by addressing load imbalance. ing from 4KB to tens of KB, which are listed in Table 1. Since
all the workload traces are collected from the narrow-queue
SATA hard disks, replaying the traces with the original times-
6 Evaluation
tamps cannot fully utilize the deep NVMe queues, which in
Implementation platform. We set up an accurate SSD em- turn conceals the real performance of SSD [29]. To this end,
ulation platform by respecting the real NVMe protocol, the our trace replaying approach allocates 16 worker threads in
timing constraints for flash backbone and the functionality the host to keep issuing I/O requests, so that the NVMe queues
of a flexible firmware. Specifically, we emulate a manycore- are not depleted by the SSD platforms.
based SSD firmware by using a MIC 5120D accelerator that
employs 60 lightweight in-order cores (4 hardware threads 6.1 Performance Analysis
per core) [28]. The MIC cores operate at 1GHz and are im-
plemented by applying low power techniques such as short Microbenchmarks. Figure 11 compares the throughput of
in-order pipeline. We emulate the flash backbone by mod- the five SSD platforms with I/O sizes varying from 4KB
elling various flash latencies, different levels of parallelism to 32KB. Overall, ManyLayered outperforms 750SSD and
(i.e., channel/way/flash) and the request conflicts for flash 4600SSD by 1.5× and 45%, on average, respectively. This is
resources. Our flash backbone consists of 16 channels, each because ManyLayered can partially take the benefits of many-
connecting 16 QDP flash packages [69]; we observed that core computing and parallelize I/O processing across multi-
the performance of both read and write operations on the ple queues and channels over the static resource partitioning.
backbone itself is not the bottleneck to achieve more than 1 BaseDeepFlash exhibits poor performance in cases that the
MIOPS. The NVMe interface on the accelerator is also fully request size is smaller than 24KB with random patterns. This
emulated by wrapping Intel’s symmetric communications in- is because threads in NVMQ/ILOCK keep tight inter-thread
terface (SCIF) with an NVMe emulation driver and controller communications to appropriately control the consistency over
that we implemented. The host employs a Xeon 16-core pro- locks. However, for large requests (32KB), BaseDeepFlash
cessor and 256 GB DRAM, running Linux kernel 2.6.32 [62]. exhibits good performance close to ManyLayered, as multiple
It should be noted that this work uses MIC to explore the pages in large requests can be merged to acquire one range
scalability limits of the design; the resulting software can run lock, which reduces the communication (compared to smaller
with fewer cores if they are more powerful, but the design request sizes), and thus, it achieves higher bandwidth.
can now be about what is most economic and power efficient, We observe that ManyLayered and BaseDeepFlash have
rather than whether the firmware can be scalable. a significant performance degradation in random reads and
Configurations. DeepFlash is the emulated SSD platform random writes (cf. Figures 11b and 11d). DeepFlash, in con-
including all the proposed designs of this paper. Compared to trast, provides more than 1MIOPS in all types of I/O requests;
DeepFlash, BaseDeepFlash does not apply the optimization 4.8 GB/s and 4.5 GB/s bandwidth for reads and writes, respec-
techniques (described in Section 5). We evaluate the perfor- tively. While those many-core approaches suffer from many
mance of a real Intel customer-grade SSD (750SSD) [25] core/flash-level conflicts (ManyLayered) and lock/sync is-
and high-performance NVMe SSD (4600SSD) [26] for a sues (BaseDeepFlash) on the imbalanced random workloads,
better comparison. We also emulate another SSD platform DeepFlash scrambles the LBA space and evenly distributes
(ManyLayered), which is an approach to scale up the layered all the random I/O requests to different TRANS threads with
firmware on many cores. Specifically, ManyLayered statically a low overhead. In addition, it applies cache bypass and lock
splits the SSD hardware resources into multiple subsets, each forwarding techniques to mitigate the long stalls, imposed by
containing the resources of one flash channel and running a lock inquiry and inter-thread communication. This can enable
layered firmware independently. For each layered firmware more threads to serve I/O requests in parallel.
instance, ManyLayered assigns a pair of threads: one is used As shown in Figure 12, DeepFlash can mostly activate
for managing flash transaction, and another is assigned to run 6.3 cores that run 25 threads to process I/O services in paral-
HIL and FTL. All these emulation platforms use “12 cores" lel, which is better than BaseDeepFlash by 127% and 63%
by default. Lastly, we also test different flash technologies for reads and writes, respectively. Note that, for the random
such as SLC, MLC, TLC, each of which latency characteris- writes, the bandwidth of DeepFlash is sustainable (4.2 GB/s)
tics are extracted from [44], [45] and [46], respectively. By by activating only 4.5 cores (18 threads). This is because al-
default, the MLC flash array in pristine state is used for our though many cores contend to acquire ILOCK which makes
evaluations. The details of SSD platform are in Table 1. more cores stay in idle, the burst buffer successfully over-
Workloads. In addition to microbenchmarks (reads and comes the long write latency of the flash.
writes with sequential and random patterns), we test diverse Figure 12e shows the active core decomposition of
server workloads, collected from Microsoft Production Server DeepFlash. As shown in the figure, reads require 23% more
USENIX Association 18th USENIX Conference on File and Storage Technologies 129
Host Workloadsets Microsoft,Production Server FIU IODedup
CPU/mem Xeon 16-core processor/256GB, DDR4 Workloads 24HR 24HRS BS CFS DADS DAP DDR cheetah homes webonline
Storage platform/firmware Read Ratio 0.06 0.13 0.11 0.82 0.87 0.57 0.9 0.99 0 0
Controller Xeon-phi, 12 cores by default Avg length (KB) 7.5 12.1 26.3 8.6 27.6 63.4 12.2 4 4 4
FTL/buffer hybrid, n:m=1:8, 1 GB/512 MB Randomness 0.3 0.4937 0.87 0.94 0.99 0.38 0.313 0.12 0.14 0.14
Flash 16 channels/16 pkgs per channel/1k blocks per die Workloadsets FIU SRCMap Enterprise
array 512GB(SLC),1TB(MLC),1.5TB(TLC) Workloads ikki online topgun webmail casa webresearch webusers madmax Exchange
SLC R: 25us, W: 300us, E: 2ms, Max: 1.4 MIOPS Read Ratio 0 0 0 0 0 0 0 0.002 0.24
MLC R: 53us, W: 0.9ms, E: 2.3ms, Max: 1.3 MIOPS Avg length (KB) 4 4 4 4 4 4 4 4.005 9.2
TLC R: 78us, W: 2.6ms, E: 2.3ms, Max: 1.1 MIOPS Randomness 0.39 0.17 0.14 0.21 0.65 0.11 0.14 0.08 0.84
Table 1: H/W configurations and Important workload characteristics of the workloads that we tested.
750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash
BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered
Throughput (GB/s)
Throughput (GB/s)
Throughput (GB/s)
Throughput (GB/s)
5 5 5 5
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32
IO request size (KB) IO request size (KB) IO request size (KB) IO request size (KB)
(a) Sequential reads. (b) Random reads. (c) Sequential writes. (d) Random writes.
Figure 11: Performance comparison.
750SSD 4600SSD ManyLayered BaseDeepFlash DeepFlash lower than 56% due to the address patterns of DDR. However,
Throughput (GB/s)
12 12
Active cores
80
Active cores
9 9 9 9
60
6 6 6 6
40
3 3 3 3
20
0 0 0
0
20 40 60 80 100 20 40 60 80 100 0 r r
20 40 60 80 100 20 40 60 80 100 d d
qR dR qW dW
Time (ms) Time (ms) Time (ms) Time (ms) Se Rn Se Rn
(a) Sequential reads. (b) Random reads. (c) Sequential writes. (d) Random writes. (e) Core decomposition.
Figure 12: Dynamics of active cores for parallel I/O processing.
130 18th USENIX Conference on File and Storage Technologies USENIX Association
Min. Required Cores
60
5
1MOPS
TRANS ILOCK
50 ILOCK-1MB ILOCK-forwd
1.5 RndRd RndWr
MIOPS
CACHE NVMQ
4
9 40 15
NVMQ threads
Core power (W)
100 50
30
12 1.0
3
80 40 6 9
2 20 6 0.5
60 30
3 10 3
1 SLC 40 20
- ss
0 0 0 0.0 22a
Byp
MLC
10 0.0 0.5 1.0
0 1 2 4
20
0 TLC G G 1G Time (s) Number of cache
0 0 .2 2.4 -
2 4 8 15 19 -1 - IO
2
8
(a) ILOCK impact. (b) CACHE IOPS.
9
oO oO
1
#Cores #Cores O O
Average bandwidth
Bandwidth (GB/s)
NVMQ LOG BGC 6
Figure 14: Resource requirement analysis. Pristine FGC FLOG+FGC
4 1.0
(GB/s)
4
1.2 IOPS/thread(K) 90
IOPS (M)
Avg. 2 0.5
0.9 2
80
Dynamic
Static
0.6 0.0
70 0 0
R S S sa ki x e
0.3 0.0 0.1 0.2 0.3 H R B ik ma lin
60 24 4H ca on
Time (s) 2 ad
0.0 m
1 2 4 8 16 50
Number of SQs Static Dynamic (a) LOG/BGC. (b) BGC overhead.
(a) NVMQ performance. (b) IOPS per NVMQ thread. Figure 17: Background task optimizations.
Figure 15: Performance on different queue allocations. CPU core [10]. Furthermore, the per-thread IOPS of Dynamic
sumption is higher than existing SSDs (20 ∼ 30W [30,58,73]), (with 16 queues) is better than Static by 6.9% (cf. Figure
power-efficient manycores [68] can be used to reduce the 15b). This is because Dynamic can fully utilize all NVMQ
power of our prototype. When we break down energy con- threads when the loads of different queues are unbalanced;
sumed by each stage, FCMD, TRANS and NVMQ consume the NVMQ performance variation of Dynamic (between min
42%, 21%, and 26% of total energy, respectively, as the num- and max) is only 12%, whereas that of Static is 48%.
ber of threads increases. This is because while CACHE, LOG, ILOCK. Figure 16a compares the different locking systems.
ILOCK, and BGC require more computing power, most cores Page-lock is a page-granular lock, while ILOCK-base is
should be assigned to handle a large flash complex, many ILOCK that has no ownership forwarding. ILOCK-forwd
queues and frequent address translation for better scalability. is the one that DeepFlash employs. While ILOCK-base
Different CPUs. Figure 14c compares the minimum number and ILOCK-forwd use a same granular locking (256KB),
of cores that DeepFlash requires to achieve 1MIOPS for both ILOCK-1MB employs 1MB for its lock range but has no
reads and writes. We evaluate different CPU technologies: i) forwarding. Page-lock can activate NVMQ threads more
OoO-1.2G, ii) OoO-2.4G and iii) IO-1G. While IO-1G uses than ILOCK-1MB by 82% (Figure 16a). However, due to the
the default in-order pipeline 1GHz core that our emulation overheads imposed by frequent lock node operations and
platform employs, OoO-1.2G and OoO-2.4G employ Intel RB tree management, the average lock inquiry latency of
Xeon CPU, an out-of-order execution processor [24] with 1.2 Page-lock is as high as 10 us, which is 11× longer than that
and 2.4GHz CPU frequency, respectively. One can observe of ILOCK-forwd. In contrast, ILOCK-forwd can activate the
from the figure that a dozen of cores that DeepFlash uses similar number of NVMQ threads as Page-lock, and exhibits
can be reduced to five high-frequency cores (cf. OoO-2.4G). 0.93 us average lock inquiry latency.
However, due to the complicated core logic (e.g., reorder CACHE. Figure 16b illustrates CACHE performance with
buffer), OoO-1.2G and OoO-2.4G consume 93% and 110% multiple threads varying from 0 to 4. “2-Bypass" employs the
more power than IO-1G to achieve the same level of IOPS. bypass technique (with only 2 threads). Overall, the read per-
formance (even with no-cache) is close to 1MIOPS, thanks
6.3 Performance Analysis of Optimization to massive parallelism in back-end stages. However, write
performance with no-cache is only around 0.65 MIOPS, on
In this analysis, we examine different design choices of the average. By enabling a single CACHE thread to buffer data
components in DeepFlash and evaluate their performance in SSD internal DRAM rather than underlying flash media,
impact on our proposed SSD platform. The following experi- write bandwidth increases by 62%, compared to the system of
ments use the configuration of DeepFlash by default. no-cache. But single CACHE thread reduces read bandwidth
NVMQ. Figures 15a and 15b compare NVMQ’s IOPS and by 25%, on average, due to communication overheads (be-
per-thread IOPS, delivered by a non-optimized queue allo- tween CACHE and NVMQ) for each I/O service. Even with
cation (i.e., Static) and our DIOS (i.e., Dynamic), respec- more CACHE threads, performance gains diminish due to
tively. Dynamic achieves the bandwidth goal, irrespective of communication overhead. In contrast, DeepFlash’s 2-Bypass
the number of NVMe queues that the host manages, whereas can be ideal as it requires fewer threads to achieve 1MIOPS.
Static requires more than 16 NVMe queues to achieve Background activities. Figure 17a shows how DeepFlash
1MIOPS (cf. Figure 15a). This implies that the host also re- coordinates NVMQ, LOG and BGC threads to avoid con-
quires more cores since the NVMe allocates a queue per host tentions on flash resources and maximize SSD performance.
USENIX Association 18th USENIX Conference on File and Storage Technologies 131
As shown in the figure, when NVMQ actively parses and ware/hardware with an in-depth analysis and offers 1MIOPS
fetches data (between 0.04 and 0.2 s), LOG stops draining for all microbenchmarks (read, write, sequential and random)
the data from internal DRAM to flash, since TRANS needs with varying I/O sizes. In addition, our solution is orthogonal
to access their meta information as a response of NVMQ’s to (and still necessary for) host-side optimizations.
queue processing. Similarly, BGC also suspends the block re- Emulation. There is unfortunately no open hardware plat-
claiming since data migration (associated to the reclaim) may form, employing multiple cores and flash packages. For ex-
cause flash-level contentions, thereby interfering NVMQ’s ample, OpenSSD has two cores [59], and Dell/EMC’s Open-
activities. As DeepFlash can minimize the impact from LOG channel SSD (only opens to a small and verified community)
and BGC, the I/O access bandwidth stays above 4 GB/s. Once also employs 4∼8 NXP cores on a few flash [17]. Although
NVMQ is in idle, LOG and BGC reactivate their work. this is an emulation study, we respected all real NVMe/ONFi
STEADY-STATE performance. Figure 17b shows the protocols and timing constraints for SSD and flash, and the
impact of on-demand garbage collection (FGC) and jour- functionality and performance of flexible firmware are demon-
nalling (FLOG) on the performance of DeepFlash. The re- strated by a real lightweight many-core system.
sults are compared to the ideal performance of DeepFlash Scale-out vs. scale-up options. A set of prior work proposes
(Pristine), which has no GC and LOG activities. Compared to architect the SSD as the RAID0-like scale-out option. For
to Pristine, the performance of FGC degrades by 5.4%, example, Amfeltec introduces an M.2-to-PCIe carrier card,
while FLOG+FGC decreases the throughput by 8.8%, on av- which can include four M.2 NVMe SSDs as the RAID0-
erage. The reason why there is negligible performance loss like scale-up solution [5]. However, this solution only offers
is that on-demand GC only blocks single TRANS thread 340K IOPS due to the limited computing power. Recently,
that manages the reclaimed flash block, while the remaining CircuitBlvd overcomes such limitation by putting eight car-
TRANS threads keep serving the I/O requests. In the mean- rier cards into a storage box [6]. Unfortunately, this scale-out
time, LOG works in parallel with TRANS, but consumes the option also requires two extra E5-2690v2 CPUs (3.6GHz 20
usage of FCMD to dump data. cores) with seven PCIe switches, which consumes more than
450W. In addition, these scale-out solutions suffer from serv-
7 Related Work and Discussion ing small-sized requests with a random access-pattern (less
than 2GB/sec) owing to frequent interrupt handling and I/O
OS optimizations. To achieve higher IOPS, host-level opti- request coordination mechanisms. In contrast, DeepFlash, as
mization on multicore systems [8, 36, 75] have been studied. an SSD scale-up solution, can achieve promising performance
Bjorling et al. changes Linux block layer in OS and achieves of random accesses by eliminating the overhead imposed by
1MIOPS on the high NUMA-factor processor systems [8]. such RAID0 design. In addition, compared to the scale-out op-
Zheng et al. redesigns buffer cache on file systems and rein- tions, DeepFlash employs fewer CPU cores to execute only
vent overhead and lock-contention in a 32-core NUMA ma- SSD firmware, which in turn reduces the power consumption.
chine to achieve 1MIOPS [75]. All these systems exploit
heavy manycore processors on the host and buffer data atop
SSDs to achieve higher bandwidth. 8 Conclusion
Industry trend. To the best of our knowledge, while there
are no manycore SSD studies in literature, industry already In this work, we designed scalable flash firmware inspired by
begun to explore manycore based SSDs. Even though they parallel data analysis systems, which can extract the max-
do not publish the actual device in publicly available market, imum performance of the underlying flash memory com-
there are several devices that partially target to 1MIOPS. For plex by concurrently executing multiple firmware compo-
example, FADU is reported to offer around 1MIOPS (only for nents within a single device. Our emulation prototype on a
sequential reads with prefetching) and 539K IOPS (for writes) manycore-integrated accelerator reveals that it simultaneously
[20]; Samsung PM1725 offers 1MIOPS (for reads) and 120K processes beyond 1MIOPS, while successfully hiding long
IOPS (for writes). Unfortunately, there are no information latency imposed by internal flash media.
regarding all industry SSD prototypes and devices in terms of
hardware and software architectures. We believe that future
architecture requires brand-new flash firmware for scalable 9 Acknowledgement
I/O processing to reach 1MIOPS.
Host-side FTL. LightNVM [9], including CNEX solution [1], The authors thank Keith Smith for shepherding their
aims to achieve high performance (∼1MIOPS) by moving paper. This research is mainly supported by NRF
FTL to the host and optimizing user-level and host-side soft- 2016R1C182015312, MemRay grant (G01190170) and
ware stack. But their performance are achieved by evalu- KAIST start-up package (G01190015). J. Zhang and M.
ating only specific operations (like reads or sequential ac- Kwon equally contribute to the work. Myoungsoo Jung is
cesses). In contrast, DeepFlash reconstructs device-level soft- the corresponding author.
132 18th USENIX Conference on File and Storage Technologies USENIX Association
References [13] Wonil Choi, Myoungsoo Jung, Mahmut Kandemir, and
Chita Das. Parallelizing garbage collection with i/o to
[1] CNEX Labs. https://www.cnexlabs.com. improve flash resource utilization. In Proceedings of the
27th International Symposium on High-Performance
[2] Microsoft SGL Description. https://docs.
Parallel and Distributed Computing, pages 243–254,
microsoft.com/en-us/windows-hardware/
2018.
drivers/kernel/using-scatter-gather-dma.
[14] Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, My-
[3] Nvm express. http://nvmexpress.org/ oungsoo Jung, and Mahmut Kandemir. An in-depth
wp-content/uploads/NVM-Express-1_ study of next generation interface for emerging non-
3a-20171024_ratified.pdf. volatile memories. In Non-Volatile Memory Systems
[4] Ultra-low Latency with Samsung Z-NAND SSD. http: and Applications Symposium (NVMSA), 2016 5th, pages
//www.samsung.com/us/labs/pdfs/collateral/ 1–6. IEEE, 2016.
Samsung_Z-NAND_Technology_Brief_v5.pdf, [15] cnet. Samsung 850 Pro SSD review.
2017. https://www.cnet.com/products/
[5] Squid carrier board family pci express samsung-ssd-850-pro/, 2015.
gen 3 carrier board for 4 m.2 pcie ssd [16] Danny Cobb and Amber Huffman. NVM Express and
modules. https://amfeltec.com/ the PCI Express SSD revolution. In Intel Developer
pci-express-gen-3-carrier-board-for-m-2-ssd/, Forum. Santa Clara, CA, USA: Intel, 2012.
2018.
[17] Jae Do. SoftFlash: Programmable storage in
[6] Cinabro platform v1. https://www.circuitblvd. future data centers. https://www.snia.
com/post/cinabro-platform-v1, 2019. org/sites/default/files/SDC/2017/
presentations/Storage_Architecture/
[7] Jasmin Ajanovic. PCI express 3.0 overview. In Proceed-
Do_Jae_Young_SoftFlash_Programmable_
ings of Hot Chip: A Symposium on High Performance
Storage_in_Future_Data_Centers.pdf,
Chips, 2009.
2017.
[8] Matias Bjørling, Jens Axboe, David Nellans, and
[18] Alejandro Duran and Michael Klemm. The Intel R
Philippe Bonnet. Linux block IO: introducing multi-
many integrated core architecture. In High Performance
queue SSD access on multi-core systems. In Proceed-
Computing and Simulation (HPCS), 2012 International
ings of the 6th international systems and storage confer-
Conference on, pages 365–366. IEEE, 2012.
ence, page 22. ACM, 2013.
[19] FreeBSD. Freebsd manual pages: flock.
[9] Matias Bjørling, Javier González, and Philippe Bonnet. https://www.freebsd.org/cgi/man.cgi?
LightNVM: The Linux Open-Channel SSD Subsystem. query=flock&sektion=2, 2011.
In FAST, pages 359–374, 2017.
[20] Anthony Garreffa. Fadu unveils world’s fastest SSD,
[10] Keith Busch. Linux NVMe driver. https: capable of 5gb/sec. http://tiny.cc/eyzdcz,
//www.flashmemorysummit.com/English/ 2016.
Collaterals/Proceedings/2013/
20130812_PreConfD_Busch.pdf, 2013. [21] Arthur Griffith. GCC: the complete reference. McGraw-
Hill, Inc., 2002.
[11] Adrian M Caulfield, Joel Coburn, Todor Mollov, Arup
De, Ameen Akel, Jiahua He, Arun Jagatheesan, Rajesh K [22] Laura M Grupp, John D Davis, and Steven Swanson.
Gupta, Allan Snavely, and Steven Swanson. Understand- The bleak future of NAND flash memory. In Proceed-
ing the impact of emerging non-volatile memories on ings of the 10th USENIX conference on File and Storage
high-performance, io-intensive computing. In High Per- Technologies, pages 2–2. USENIX Association, 2012.
formance Computing, Networking, Storage and Analysis
[23] Amber Huffman. NVM Express, revision 1.0 c. Intel
(SC), 2010 International Conference for, pages 1–11.
Corporation, 2012.
IEEE, 2010.
[24] Intel. Intel Xeon Processor E5 2620 v3. http://
[12] Adrian M Caulfield, Laura M Grupp, and Steven Swan- tiny.cc/a1zdcz, 2014.
son. Gordon: using flash memory to build fast, power-
efficient clusters for data-intensive applications. ACM [25] Intel. Intel SSD 750 series. http://tiny.cc/
Sigplan Notices, 44(3):217–228, 2009. qyzdcz, 2015.
USENIX Association 18th USENIX Conference on File and Storage Technologies 133
[26] Intel. Intel SSD DC P4600 Series. http://tiny. [37] Hyojun Kim, Nitin Agrawal, and Cristian Ungureanu.
cc/dzzdcz, 2018. Revisiting storage for smartphones. ACM Transactions
on Storage (TOS), 8(4):14, 2012.
[27] Xabier Iturbe, Balaji Venu, Emre Ozer, and Shidhartha
Das. A triple core lock-step (TCLS) ARM R Cortex R - [38] Nathan Kirsch. Phison E12 high-performance SSD
R5 processor for safety-critical and ultra-reliable appli- controller. http://tiny.cc/91zdcz, 2018.
cations. In Dependable Systems and Networks Work-
shop, 2016 46th Annual IEEE/IFIP International Con- [39] Sungjoon Koh, Junhyeok Jang, Changrim Lee,
ference on, pages 246–249. IEEE, 2016. Miryeong Kwon, Jie Zhang, and Myoungsoo Jung.
Faster than flash: An in-depth study of system chal-
[28] James Jeffers and James Reinders. Intel Xeon Phi co-
lenges for emerging ultra-low latency ssds. arXiv
processor high-performance programming. Newnes,
preprint arXiv:1912.06998, 2019.
2013.
[29] Jaeyong Jeong, Sangwook Shane Hahn, Sungjin Lee, [40] Ricardo Koller et al. I/O deduplication: Utilizing content
and Jihong Kim. Lifetime improvement of NAND flash- similarity to improve I/O performance. TOS, 2010.
based storage systems using dynamic program and erase
[41] Linux. Mandatory file locking for the linux
scaling. In Proceedings of the 12th USENIX Conference
operating system. https://www.kernel.
on File and Storage Technologies (FAST 14), pages 61–
org/doc/Documentation/filesystems/
74, 2014.
mandatory-locking.txt, 2007.
[30] Myoungsoo Jung. Exploring design challenges in get-
ting solid state drives closer to cpu. IEEE Transactions [42] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Har-
on Computers, 65(4):1103–1115, 2016. iharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and
Remzi H Arpaci-Dusseau. Wisckey: Separating keys
[31] Myoungsoo Jung, Wonil Choi, Shekhar Srikantaiah, from values in SSD-conscious storage. ACM Transac-
Joonhyuk Yoo, and Mahmut T Kandemir. Hios: A tions on Storage (TOS), 13(1):5, 2017.
host interface i/o scheduler for solid state disks. ACM
SIGARCH Computer Architecture News, 42(3):289–300, [43] marvell. Marvell 88ss1093 flash memory controller.
2014. https://www.marvell.com/storage/
assets/Marvell-88SS1093-0307-2017.
[32] Myoungsoo Jung and Mahmut Kandemir. Revisiting
pdf, 2017.
widely held SSD expectations and rethinking system-
level implications. In ACM SIGMETRICS Performance [44] Micron. Mt29f2g08aabwp/mt29f2g16aabwp NAND
Evaluation Review, volume 41, pages 203–216. ACM, flash datasheet. 2004.
2013.
[45] Micron. Mt29f256g08cjaaa/mt29f256g08cjaab NAND
[33] Myoungsoo Jung and Mahmut T Kandemir. Sprinkler:
flash datasheet. 2008.
Maximizing resource utilization in many-chip solid state
disks. In 2014 IEEE 20th International Symposium [46] Micron. Mt29f1ht08emcbbj4-
on High Performance Computer Architecture (HPCA), 37:b/mt29f1ht08emhbbj4-3r:b NAND flash datasheet.
pages 524–535. IEEE, 2014. 2016.
[34] Myoungsoo Jung, Ellis H Wilson III, and Mahmut Kan-
[47] Yongseok Oh, Eunjae Lee, Choulseung Hyun, Jongmoo
demir. Physically addressed queueing (PAQ): improving
Choi, Donghee Lee, and Sam H Noh. Enabling cost-
parallelism in solid state disks. In ACM SIGARCH Com-
effective flash based caching with an array of commodity
puter Architecture News, volume 40, pages 404–415.
ssds. In Proceedings of the 16th Annual Middleware
IEEE Computer Society, 2012.
Conference, pages 63–74. ACM, 2015.
[35] Bruce Worthington Qi Zhang Kavalanekar, Swaroop and
Vishal Sharda. Characterization of storage workload [48] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou,
traces from production windows servers. In IISWC, Yong Wang, and Yuanzheng Wang. SDF: software-
2008. defined flash for web-scale internet storage systems.
ACM SIGPLAN Notices, 49(4):471–484, 2014.
[36] Byungseok Kim, Jaeho Kim, and Sam H Noh. Managing
array of ssds when the storage device is no longer the [49] Seon-yeong Park, Euiseong Seo, Ji-Yong Shin, Seun-
performance bottleneck. In 9th {USENIX} Workshop gryoul Maeng, and Joonwon Lee. Exploiting internal
on Hot Topics in Storage and File Systems (HotStorage parallelism of flash-based SSDs. IEEE Computer Archi-
17), 2017. tecture Letters, 9(1):9–12, 2010.
134 18th USENIX Conference on File and Storage Technologies USENIX Association
[50] Chris Ramseyer. Seagate SandForce SF3500 client SSD [61] Arash Tavakkol, Juan Gómez-Luna, Mohammad
controller detailed. http://tiny.cc/f2zdcz, Sadrosadati, Saugata Ghose, and Onur Mutlu. MQSim:
2015. A framework for enabling realistic studies of modern
multi-queue SSD devices. In 16th USENIX Conference
[51] Tim Schiesser. Correction: PCIe 4.0 won’t support on File and Storage Technologies (FAST 18), pages
up to 300 watts of slot power. http://tiny.cc/ 49–66, 2018.
52zdcz, 2017.
[62] Linus Torvalds. Linux kernel repo. https://
[52] Windows SDK. Lockfileex function. github.com/torvalds/linux, 2017.
https://docs.microsoft.com/ [63] Akshat Verma, Ricardo Koller, Luis Useche, and Raju
en-us/windows/win32/api/fileapi/ Rangaswami. SRCMap: Energy proportional storage
nf-fileapi-lockfileex, 2018. using dynamic consolidation. In FAST, volume 10, pages
267–280, 2010.
[53] Hynix Semiconductor et al. Open NAND flash interface
specification. Technical Report ONFI, 2006. [64] Shunzhuo Wang, Fei Wu, Zhonghai Lu, You Zhou, Qin
Xiong, Meng Zhang, and Changsheng Xie. Lifetime
[54] Narges Shahidi, Mahmut T Kandemir, Mohammad Ar- adaptive ecc in nand flash page management. In Design,
jomand, Chita R Das, Myoungsoo Jung, and Anand Automation & Test in Europe Conference & Exhibition
Sivasubramaniam. Exploring the potentials of parallel (DATE), 2017, pages 1253–1556. IEEE, 2017.
garbage collection in ssds for enterprise storage systems.
In SC’16: Proceedings of the International Conference [65] Qingsong Wei, Bozhao Gong, Suraj Pathak, Bharadwaj
for High Performance Computing, Networking, Storage Veeravalli, LingFang Zeng, and Kanzo Okada. WAFTL:
and Analysis, pages 561–572. IEEE, 2016. A workload adaptive flash translation layer with data
partition. In Mass Storage Systems and Technologies
[55] Yakun Sophia Shao and David Brooks. Energy charac- (MSST), 2011 IEEE 27th Symposium on, pages 1–12.
terization and instruction-level energy model of Intel’s IEEE, 2011.
Xeon Phi processor. In International Symposium on
[66] Zev Weiss, Sriram Subramanian, Swaminathan Sun-
Low Power Electronics and Design (ISLPED), pages
dararaman, Nisha Talagala, Andrea C Arpaci-Dusseau,
389–394. IEEE, 2013.
and Remzi H Arpaci-Dusseau. ANViL: Advanced vir-
[56] Mustafa M Shihab, Jie Zhang, Myoungsoo Jung, and tualization for modern non-volatile memory devices. In
Mahmut Kandemir. Revenand: A fast-drift-aware re- FAST, pages 111–118, 2015.
silient 3d nand flash design. ACM Transactions on Ar- [67] Matt Welsh, David Culler, and Eric Brewer. SEDA:
chitecture and Code Optimization (TACO), 15(2):1–26, an architecture for well-conditioned, scalable internet
2018. services. In ACM SIGOPS Operating Systems Review,
volume 35, pages 230–243. ACM, 2001.
[57] Ji-Yong Shin, Zeng-Lin Xia, Ning-Yi Xu, Rui Gao,
Xiong-Fei Cai, Seungryoul Maeng, and Feng-Hsiung [68] Norbert Werner, Guillermo Payá-Vayá, and Holger
Hsu. FTL design exploration in reconfigurable high- Blume. Case study: Using the xtensa lx4 configurable
performance SSD for server applications. In Proceed- processor for hearing aid applications. Proceedings of
ings of the 23rd international conference on Supercom- the ICT. OPEN, 2013.
puting, pages 338–349. ACM, 2009.
[69] ONFI Workgroup. Open NAND flash interface specifi-
[58] S Shin and D Shin. Power analysis for flash memory cation revision 3.0. ONFI Workgroup, Published Mar,
SSD. Work-shop for Operating System Support for Non- 15:288, 2011.
Volatile RAM (NVRAMOS 2010 Spring)(Jeju, Korea, [70] Guanying Wu and Xubin He. Delta-FTL: improving
April 2010), 2010. SSD lifetime via exploiting content locality. In Proceed-
ings of the 7th ACM european conference on Computer
[59] Yong Ho Song, Sanghyuk Jung, Sang-Won Lee, and Jin- Systems, pages 253–266. ACM, 2012.
Soo Kim. Cosmos openSSD: A PCIe-based open source
SSD platform. Proc. Flash Memory Summit, 2014. [71] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh,
Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita
[60] Wei Tan, Liana Fong, and Yanbin Liu. Effectiveness Shayesteh, and Vijay Balakrishnan. Performance analy-
assessment of solid-state drive used in big data services. sis of NVMe SSDs and their implication on real world
In Web Services (ICWS), 2014 IEEE International Con- databases. In Proceedings of the 8th ACM International
ference on, pages 393–400. IEEE, 2014. Systems and Storage Conference, page 6. ACM, 2015.
USENIX Association 18th USENIX Conference on File and Storage Technologies 135
[72] Jie Zhang, Gieseo Park, Mustafa M Shihab, David Warming up storage-level caches with bonfire. In FAST,
Donofrio, John Shalf, and Myoungsoo Jung. Open- pages 59–72, 2013.
NVM: An open-sourced fpga-based nvm controller for
low level memory characterization. In 2015 33rd IEEE [75] Da Zheng, Randal Burns, and Alexander S Szalay. To-
International Conference on Computer Design (ICCD), ward millions of file system iops on low-cost, commod-
pages 666–673. IEEE, 2015. ity hardware. In Proceedings of the international con-
ference on high performance computing, networking,
[73] Jie Zhang, Mustafa Shihab, and Myoungsoo Jung. storage and analysis, page 69. ACM, 2013.
Power, energy, and thermal considerations in SSD-based
I/O acceleration. In HotStorage, 2014. [76] You Zhou, Fei Wu, Ping Huang, Xubin He, Changsheng
Xie, and Jian Zhou. An efficient page-level FTL to opti-
[74] Yiying Zhang, Gokul Soundararajan, Mark W Storer, mize address translation in flash memory. In Proceed-
Lakshmi N Bairavasundaram, Sethuraman Subbiah, An- ings of the Tenth European Conference on Computer
drea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Systems, page 12. ACM, 2015.
136 18th USENIX Conference on File and Storage Technologies USENIX Association