0% found this document useful (0 votes)
60 views17 pages

Scalable Parallel Flash Firmware For Many-Core Architectures

The paper presents DeepFlash, a scalable parallel flash firmware designed for many-core architectures, enabling the processing of over one million I/O requests per second (1MIOPS) while minimizing CPU demands. It utilizes a many-to-many threading model to optimize the performance of SSDs by concurrently executing firmware components across multiple cores, addressing challenges such as garbage collection and data consistency. The evaluation demonstrates that DeepFlash achieves significant improvements in bandwidth and efficiency compared to conventional firmware designs.

Uploaded by

ashuqwerty5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views17 pages

Scalable Parallel Flash Firmware For Many-Core Architectures

The paper presents DeepFlash, a scalable parallel flash firmware designed for many-core architectures, enabling the processing of over one million I/O requests per second (1MIOPS) while minimizing CPU demands. It utilizes a many-to-many threading model to optimize the performance of SSDs by concurrently executing firmware components across multiple cores, addressing challenges such as garbage collection and data consistency. The evaluation demonstrates that DeepFlash achieves significant improvements in bandwidth and efficiency compared to conventional firmware designs.

Uploaded by

ashuqwerty5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Scalable Parallel Flash Firmware

for Many-core Architectures


Jie Zhang and Miryeong Kwon, KAIST;
Michael Swift, University of Wisconsin-Madison; Myoungsoo Jung, KAIST
https://www.usenix.org/conference/fast20/presentation/zhang-jie

This paper is included in the Proceedings of the


18th USENIX Conference on File and
Storage Technologies (FAST ’20)
February 25–27, 2020 • Santa Clara, CA, USA
978-1-939133-12-0

Open access to the Proceedings of the


18th USENIX Conference on File and
Storage Technologies (FAST ’20)
is sponsored by
Scalable Parallel Flash Firmware for Many-core Architectures

Jie Zhang1 , Miryeong Kwon1 , Michael Swift2 , Myoungsoo Jung1


Computer Architecture and Memory Systems Laboratory,
Korea Advanced Institute of Science and Technology (KAIST)1 , University of Wisconsin at Madison2
http://camelab.org

Abstract requests are sent to tens or hundreds of flash packages. This


enables assigning queues to different applications; multiple
NVMe is designed to unshackle flash from a traditional stor-
deep NVMe queues allow the host to employ many threads
age bus by allowing hosts to employ many threads to achieve
thereby maximizing the storage utilization.
higher bandwidth. While NVMe enables users to fully exploit
An SSD should handle many concurrent requests with its
all levels of parallelism offered by modern SSDs, current
massive internal parallelism [12, 31, 33, 34, 61]. However, it
firmware designs are not scalable and have difficulty in han-
is difficult for a single storage device to manage the tremen-
dling a large number of I/O requests in parallel due to its
dous number of I/O requests arriving in parallel over many
limited computation power and many hardware contentions.
NVMe queues. Since highly parallel I/O services require si-
We propose DeepFlash, a novel manycore-based storage
multaneously performing many SSD internal tasks, such as
platform that can process more than a million I/O requests
address translation, multi-queue processing, and flash schedul-
in a second (1MIOPS) while hiding long latencies imposed
ing, the SSD needs multiple cores and parallel implementation
by its internal flash media. Inspired by a parallel data analy-
for a higher throughput. In addition, as the tasks inside the
sis system, we design the firmware based on many-to-many
SSD increase, the SSD must address several scalability chal-
threading model that can be scaled horizontally. The proposed
lenges brought by garbage collection, memory/storage con-
DeepFlash can extract the maximum performance of the un-
tention and data consistency management when processing
derlying flash memory complex by concurrently executing
I/O requests in parallel. These new challenges can introduce
multiple firmware components across many cores within the
high computation loads, making it hard to satisfy the perfor-
device. To show its extreme parallel scalability, we implement
mance demands of diverse data-centric systems. Thus, the
DeepFlash on a many-core prototype processor that employs
high-performance SSDs require not only a powerful CPU and
dozens of lightweight cores, analyze new challenges from par-
controller but also an efficient flash firmware.
allel I/O processing and address the challenges by applying
concurrency-aware optimizations. Our comprehensive evalua- We propose DeepFlash, a manycore-based NVMe SSD
tion reveals that DeepFlash can serve around 4.5 GB/s, while platform that can process more than one million I/O requests
minimizing the CPU demand on microbenchmarks and real within a second (1MIOPS) while minimizing the require-
server workloads. ments of internal resources. To this end, we design a new
flash firmware model, which can extract the maximum per-
formance of hundreds of flash packages by concurrently exe-
1 Introduction cuting firmware components atop a manycore processor. The
layered flash firmware in many SSD technologies handles
Solid State Disks (SSDs) are extensively used as caches, the internal datapath from PCIe to physical flash interfaces
databases, and boot drives in diverse computing domains as a single heavy task [66, 76]. In contrast, DeepFlash em-
[37, 42, 47, 60, 74]. The organizations of modern SSDs and ploys a many-to-many threading model, which multiplexes
flash packages therein have undergone significant technology any number of threads onto any number of cores in firmware.
shifts [11, 32, 39, 56, 72]. In the meantime, new storage in- Specifically, we analyze key functions of the layered flash
terfaces have been proposed to reduce overheads of the host firmware and decompose them into multiple modules, each is
storage stack thereby improving the storage-level bandwidth. scaled independently to run across many cores. Based on the
Specifically, NVM Express (NVMe) is designed to unshackle analysis, this work classifies the modules into a queue-gather
flash from a traditional storage interface and enable users stage, a trans-apply stage, and a flash-scatter stage, inspired
to take full advantages of all levels of SSD internal paral- by a parallel data analysis system [67]. Multiple threads on
lelism [13, 14, 54, 71]. For example, it provides streamlined the queue-gather stage handle NVMe queues, while each
commands and up to 64K deep queues, each with up to 64K thread on the flash-scatter stage handles many flash devices
entries. There is massive parallelism in the backend where on a channel bus. The address translation between logical

USENIX Association 18th USENIX Conference on File and Storage Technologies 121
block addresses and physical page numbers is simultane- &ůĂƐŚ &ůĂƐŚ
ously performed by many threads at the trans-apply stage. &ƵƚƵƌĞĞdžƚĞŶƐŝŽŶ
&ƵƚƵƌĞ
ĞdžƚĞŶƐŝŽŶ

EEĨůĂƐŚW,z
As each stage can have different numbers of threads, con-

ϴĐŚĂŶŶĞůƐ
WƌŽĐĞƐƐŽƌ/ŶƚĞƌĐŽŶŶĞĐƚ
&ůĂƐŚ &ůĂƐŚ
tention between the threads for shared hardware resources W/Ğ
'ĞŶϯ͘Ϭ
and structures, such as mapping table, metadata and mem- EsDĞ ^ĐƌĂƚĐŚ Z &ůĂƐŚ &ůĂƐŚ
ŽŶƚƌŽůůĞƌ ƉĂĚ ŽŶƚƌŽůůĞƌ
ory management structures can arise. Integrating many cores &ƵƚƵƌĞ
ĞdžƚĞŶƐŝŽŶ
W/Ğ W/Ğ D >W
in the scalable flash firmware design also introduces data W,z ŽƌĞ ŽŶƚƌŽůůĞƌ ^ĞƋƵĞŶĐĞƌ &ůĂƐŚ &ůĂƐŚ
consistency, coherence and hazard issues. We analyze new
challenges arising from concurrency, and address them by Figure 1: Overall architecture of an NVMe SSD.
applying concurrency-aware optimization techniques to each a dozen of lightweight in-order cores to deliver 1MIOPS.
stage, such as parallel queue processing, cache bypassing and
background work for time-consuming SSD internal tasks.
2 Background
We evaluate a real system with our hardware platform that
implements DeepFlash and internally emulates low-level
flash media in a timing accurate manner. Our evaluation re- 2.1 High Performance NVMe SSDs
sults show that DeepFlash successfully provides more than Baseline. Figure 1 shows an overview of a high-performance
1MIOPS with a dozen of simple low-power cores for all reads SSD architecture that Marvell recently published [43]. The
and writes with sequential and random access patterns. In host connects to the underlying SSD through four Gen 3.0
addition, DeepFlash reaches 4.5 GB/s (above 1MIOPS), on PCIe lanes (4 GB/s) and a PCIe controller. The SSD archi-
average, under the execution of diverse real server workloads. tecture employs three embedded processors, each employing
The main contributions of this work are summarized as below: two cores [27], which are connected to an internal DRAM
• Many-to-many threading firmware. We identify scalabil- controller via a processor interconnect. The SSD employs
ity and parallelism opportunities for high-performance flash several special-purpose processing elements, including a low-
firmware. Our many-to-many threading model allows future density parity-check (LDPC) sequencer, data transfer (DMA)
manycore-based SSDs to dynamically shift their computing engine, and scratch-pad memory for metadata management.
power based on different workload demands without any hard- All these multi-core processors, controllers, and components
ware modification. DeepFlash splits all functions from the are connected to a flash complex that connects to eight chan-
existing layered firmware architecture into three stages, each nels, each connecting to eight packages, via flash physical
with one or more thread groups. Different thread groups can layer (PHY). We select this multicore architecture description
communicate with each other over an on-chip interconnection as our reference and extend it, since it is only documented
network within the target SSD. NVMe storage architecture that employs multiple cores at this
• Parallel NVMe queue management. While employing juncture, but other commercially available SSDs also employ
many NVMe queues allows the SSD to handle many I/O a similar multi-core firmware controller [38, 50, 59].
requests through PCIe communication, it is hard to coordinate Future architecture. The performance offered by these de-
simultaneous queue accesses from many cores. DeepFlash vices is by far below 1MIOPS. For higher bandwidth, a future
dynamically allocates the cores to process NVMe queues device can extend storage and processor complexes with more
rather than statically assigning one core per queue. Thus, a flash packages and cores, respectively, which are highlighted
single queue is serviced by multiple cores, and a single core by red in the figure. The bandwidth of each flash package
can service multiple queues, which can deliver full bandwidth is in practice tens of MB/s, and thus, it requires employing
for both balanced and unbalanced NVMe I/O workloads. We more flashes/channels, thereby increasing I/O parallelism.
show that this parallel NVMe queue processing exceeds the This flash-side extension raises several architectural issues.
performance of the static core-per-queue allocation by 6x, on First, the firmware will make frequent SSD-internal memory
average, when only a few queues are in use. DeepFlash also accesses that stress the processor complex. Even though the
balances core utilization over computing resources. PCIe core, channel and other memory control logic may be
• Efficient I/O processing. We increase the parallel scala- implemented, metadata information increases for the exten-
bility of many-to-many threading model by employing non- sion, and its access frequency gets higher to achieve 1MIOPS.
blocking communication mechanisms. We also apply sim- In addition, DRAM accesses for I/O buffering can be a crit-
ple but effective lock and address randomization methods, ical bottleneck to hide flash’s long latency. Simply making
which can distribute incoming I/O requests across multiple cores faster may not be sufficient because the processors will
address translators and flash packages. The proposed method suffer from frequent stalls due to less locality and contention
minimizes the number of hardware core to achieve 1MIOPS. at memory. This, in turn, makes each core bandwidth lower,
Putting all it together, DeepFlash improves bandwidth by which should be addressed with higher parallelism on com-
3.4× while significantly reducing CPU requirements, com- putation parts. We will explain the current architecture and
pared to conventional firmware. Our DeepFlash requires only show why it is non-scalable in Section 3.

122 18th USENIX Conference on File and Storage Technologies USENIX Association
EsDĐŽŵŵƵŶŝĐĂƚŝŽŶ
,ŽƐƚ ZŝŶŐŽŽƌďĞůů Ϯ ^YdĂŝů ϰ ϱ ϲ ƉƉ ƉƉ ƉƉ ,ŽƐƚ
ĐŽƌĞ ĐŽƌĞ ĐŽƌĞ
ĐŽŵƉůĞƚŝŽŶ ĐŽŵŵĂŶĚ

^Y ŽŽƌďĞůů WĂƌƐĞ dƌĂŶƐůĂƚĞ WƌŽĐĞƐƐ


tƌŝƚĞ

ϭ &ĞƚĐŚ ƌƌŝǀĂůƐ ĂĚĚƌĞƐƐ ĂĚĚƌĞƐƐ ƚƌĂŶƐĂĐƚŝŽŶ ^LJƐƚĞŵDĞŵŽƌLJ^ƉĂĐĞ


ϯ EsDĞZŝĐŚ

EsDĞ^^
D YƵĞƵĞ,ĂŶĚůŝŶŐ
YƵĞƵĞƐ /ŶƚĞƌŶĂůDĞŵŽƌLJ^ƉĂĐĞ

,E ,Ϭ
ϳ ^ƚĂƌƚ WW
> >ŽŐŝĐĂůůŽĐŬ^ƉĂĐĞ
'ĞŶĞƌĂƚĞ WW ĚĚƌĞƐƐ
ϴ ůŽĐŬ dƌĂŶƐůĂƚŝŽŶ
WƌŽĐĞƐƐ

/ŶƚĞƌƌƵƉƚ EƵŵ WŚLJƐŝĐĂů&ůĂƐŚ^ƉĂĐĞ


ϵ ZDĐĂĐŚĞ
Y Y,ĞĂĚ WW džĞĐƵƚŝŽŶ &ůĂƐŚdƌĂŶƐĂĐƚŝŽŶ &ůĂƐŚ/ŶƚĞƌĨĂĐĞWƌŽƚŽĐŽů
ŽŽƌďĞůů >ŽŽŬƵƉ ^ĞƋƵĞŶĐĞƐ DĂŶĂŐĞŵĞŶƚ
ZŝŶŐŽŽƌďĞůů ϭϬ &d>
^LJƐƚĞŵŵĞŵŽƌLJ /ŶƚĞƌŶĂůŵĞŵŽƌLJ >ŽŐŝĐĂů WŚLJƐŝĐĂů /ŶƚĞƌĨĂĐĞ &ůĂƐŚ &ůĂƐŚ &ůĂƐŚ
&ůĂƐŚďĂĐŬďŽŶĞ
ƐƉĂĐĞ ƐƉĂĐĞ ďůŽĐŬƐƉĂĐĞ ĨůĂƐŚƐƉĂĐĞ ƉƌŽƚŽĐŽů &ůĂƐŚ &ůĂƐŚ &ůĂƐŚ

(a) NVMe SSD datapath. (b) Flash firmware.


Figure 2: Datapath from PCIe to Flash and overview of flash firmware.
Datapath from PCIe to flash. To understand the source of processing latency. However, SSDs require a large number
scalability problems, it requires being aware of the internal of flash packages and queues to handle more than a thousand
datapath of NVMe SSDs and details of the datapath manage- requests per msec. When increasing the number of underlying
ment. Figure 2a illustrates the internal datapath between PCIe flash packages, the FTL requires powerful computation not
and NV-DDR [7, 53], which is managed by NVMe [16] and only to spread I/O requests across flash packages but also to
ONFi [69] protocols, respectively. NVMe employs multiple process I/O commands in parallel. We observe that, compute
device-side doorbell registers, which are designed to mini- latency keeps increasing due to non-scalable firmware and
mize handshaking overheads. Thus, to issue an I/O request, ap- takes 93.6% of the total I/O processing time in worst case.
plications submit an NVMe command to a submission queue Memory spaces. While the FTL manages the logical block
(SQ) (¶) and notify the SSD of the request arrival by writing space and physical flash space, it also handles SSD’s internal
to the doorbell corresponding to the queue (·). After fetching memory space and accesses to host system memory space (cf.
a host request from the queue (¸), flash firmware, known Figure 2b). SSDs manage internal memory for caching incom-
as flash translation layer (FTL), parses the I/O operation, ing I/O requests and the corresponding data. Similarly, the
metadata, and data location of the target command (¹). The FTL uses the internal memory for metadata and NVMe queue
FTL then translates the physical page address (PPA) from management (e.g., SQs/CQs). In addition, the FTL requires
the host’s logical block address (LBA) (º). In the meantime, accessing the host system memory space to transfer actual
the FTL also orchestrates data transfers. Once the address data contents over PCIe. Unfortunately, a layered firmware
translation is completed, the FTL moves the data, based on design engages in accesses to memory without any constraint
the I/O timing constraints defined by ONFi (»). A comple- and protection mechanism, which can make the data incon-
tion queue (CQ) is always paired with an SQ in the NVMe sistent and incoherent in simultaneous accesses. However,
protocol, and the FTL writes a result to the CQ and updates computing resources with more parallelism must increase
the tail doorbell corresponding to the host request. The FTL to achieve more than 1MIOPS, and many I/O requests need
notifies the queue completion to the host (¼) by generating processing simultaneously. Thus, all shared memory spaces
a message-signaled interrupt (MSI) (½). The host can finish of a manycore SSD platform require appropriate concurrency
the I/O process (¾) and acknowledge the MSI by writing the control and resource protection, similar to virtual memory.
head doorbell associated with the original request (¿).
3 Challenges to Exceeding 1MIOPS
2.2 Software Support
To understand the main challenges in scaling SSD firmware,
Flash firmware. Figure 2b shows the processes of the FTL, we extend the baseline SSD architecture in a highly scalable
which performs the steps ¸ ∼ ½. The FTL manages NVMe environment: Intel many-integrated cores (MIC) [18]. We
queues/requests and responds to the host requests by pro- select this processor platform, because its architecture uses a
cessing the corresponding doorbell. The FTL then performs simple in-order and low-frequency core model, but provides
address translations and manages memory transactions for a high core count to study parallelism and scalability. The
the flash media. While prior studies [34, 48, 49] distinguish platform internally emulates low-level flash modules with
host command controls and flash transaction management as hardware-validated software1 , so that the flash complex can
the host interface layer (HIL) and flash interface layer (FIL), be extended by adding more emulated channels and flash re-
respectively, in practice, both modules are implemented as sources: the number of flash (quad-die package, QDP) varies
a layered firmware calling through functions of event-based from 2 to 512. Note that MIC is a prototyping platform used
codes with a single thread [57, 65, 70]. The performance of 1 This emulation framework is validated by comparing with Samsung Z-
the layered firmware is not on the critical path as flash latency SSD prototype [4], multi-stream 983 DCT (Proof of Concept), 850 Pro [15]
is several orders of magnitude longer than one I/O command and Intel NVMe 750 [25]. The software will be publicly available.

USENIX Association 18th USENIX Conference on File and Storage Technologies 123
180 5 YƵĞƵĞͲŐĂƚŚĞƌ dƌĂŶƐͲĂƉƉůLJ &ůĂƐŚͲƐĐĂƚƚĞƌ
SSD latency breakdown

Performance (K IOPS)
Expected

Perf.degradation (%)
1.0
150 4 Naive 80 EsDY />K< EsDY , dZE^ >K' &D

IO fetch
IO cache
DMA
0.8
120
0.6 3 60

MIOPS
90
0.4 60 2 40
'

Addr. trans
0.2 30 1 20

Flash
IO parse
0.0 0
2
4
8
16
32
64
128
256
512
0 0

1
2
4
8
16
32
Number of flash packages Number of flash cores
(a) Flash scaling. (b) Core scaling.
Figure 3: Perf. with varying flash packages and cores. Figure 4: Many-to-many threading firmware model.
for only exploring the limit of scalability, rather than as a with varying number of cores, ranging from 1 to 32. Figure
suggestion for actual SSD controller. 3b compares the performance of aforementioned naive many-
Flash scaling. The bandwidth of a low-level flash package core approach (e.g., Naive) with the system that expects per-
is several orders of magnitude lower than the PCIe band- fect parallel scalability (e.g., Expected). Expected’s perfor-
width. Thus, SSD vendors integrate many flash packages over mance is calculated by multiplying the number of cores with
multiple channels, which can in parallel serve I/O requests IOPS of Naive built on a single core SSD. One can observe
managed by NVMe. Figure 3a shows the relationship of band- from this figure that Naive can only achieve 813K IOPS even
width and execution latency breakdown with various number with 32 cores, which exhibits 82.6% lower performance, com-
of flash packages. In this evaluation, we emulate an SSD by pared to Expected. This is because contention and consis-
creating a layered firmware instance in a single MIC core, in tency management for the memory spaces of internal DRAM
which two threads are initialized to process the tasks of HIL (cf. Section 5.3) introduces significant synchronization over-
and FTL, respectively. We also assign 16 MIC cores (one core heads. In addition, the FTL must serialize the I/O requests to
per flash channel) to manage flash interface subsystems. We avoid hazards while processing many queues in parallel. Since
evaluate the performance of the configured SSD emulation all these issues are not considered by the layered firmware
platform by testing 4KB sequential writes. For the break- model, it should be re-designed by considering core scaling.
down analysis, we decompose total latency into i) NVMe The goal of our new firmware is to fully parallelize multiple
management (I/O parse and I/O fetch), ii) I/O cache, NVMe processing datapaths in a highly scalable manner while
iii) address translation (including flash scheduling), vi) minimizing the usage of SSD internal resources. DeepFlash
NVMe data transfers (DMA) and v) flash operations (Flash). requires only 12 in-order cores to achieve 1M or more IOPS.
One can observe from the figure that the SSD performance
saturates at 170K IOPS with 64 flash packages, connected
over 16 channels. Specifically, the flash operations are the 4 Many-to-Many Threading Firmware
main contributor of the total execution time in cases where
our SSD employs tens of flash packages (73% of the total Conventional FTL designs are unable to fully convert the
latency). However, as the number of flash packages increases computing power brought by a manycore processor to storage
(more than 32), the layered firmware operations on a core performance, as they put all FTL tasks into a single large block
become the performance bottleneck. NVMe management and of the software stack. In this section, we analyze the func-
address translation account for 41% and 29% of the total time, tions of the traditional FTLs and decompose them into seven
while flash only consumes 12% of the total cycles. different function groups: 1) NVMe queue handling (NVMQ),
There are two reasons that flash firmware turns into the 2) data cache (CACHE), 3) address translation (TRANS), 4)
performance bottleneck with many underlying flash devices. index lock (ILOCK), 5) logging utility (LOG), 6) background
First, NVMe queues can supply many I/O requests to take garbage collection utility (BGC), and 7) flash command and
advantages of the SSD internal parallelism, but a single-core transaction scheduling (FCMD). We then reconstruct the key
SSD controller is insufficient to fetch all the requests. Second, function groups from the ground up, keeping in mind con-
it is faster to parallelize I/O accesses across many flash chips currency, and deploy our reworked firmware modules across
than performing address translation only on one core. These multiple cores in a scalable manner.
new challenges make it difficult to fully leverage the internal
parallelism with the conventional layered firmware model. 4.1 Overview
Core scaling. To take flash firmware off the critical path in
scalable I/O processing, one can increase computing power Figure 4 shows our DeepFlash’s many-to-many threading
with the execution of many firmware instances. This approach firmware model. The firmware is a set of modules (i.e.,
can allocate a core per NVMe SQ/CQ and initiate one layered threads) in a request-processing network that is mapped to
firmware instance in each core. However, we observe that a set of processors. Each thread can have a firmware opera-
this naive approach cannot successfully address the burdens tion, and the task can be scaled by instantiating into multiple
brought by flash firmware. To be precise, we evaluate IOPS parallel threads, referred to as stages. Based on different data

124 18th USENIX Conference on File and Storage Technologies USENIX Association
YƵĞƵĞͲŐĂƚŚĞƌ &ůĂƐŚ &ůĂƐŚ ŽƌĞϬ ^YϬ
dĂƐŬϬ dĂƐŬE
ϭ /ŶŝƚŝĂƚĞĐŵĚƐ ZĞƋϭ ZĞƋϮ

ĂƐLJŶĐ/ͬK
ƐLJŶĐ/ͬK
dƌĂŶƐͲĂƉƉůLJ

,ŽƐƚŵĞŵŽƌLJ
&ůĂƐŚ &ůĂƐŚ /ͬK
&ůĂƐŚͲƐĐĂƚƚĞƌ EsDY ' >K' WZWƐ /ͬK

^YϬ
ǁƌΛ ƌĚΛ

W/Ğ
ƐƐŝŐŶƚŽ WĂƌĂůůĞů &ůĂƐŚ &ůĂƐŚ ĂƚĂ ϰ<
ƉƌŽĐĞƐƐŝŶŐĨůŽǁ ϬdžϴϬ ϬdžϴϬ
ĂŶLJǁŚĞƌĞ &ůĂƐŚ &ůĂƐŚ

^YE
^YϬ
, />K< dZE^ Ϯ ϯ EsDY EsDY
ĐŽƌĞƐ
^^ Ͳ &ůĂƐŚ &ůĂƐŚ &ĞƚĐŚ D
Đƚƌůƌ EE ĐŽŶĨůŝĐƚ ĐŽŶĨůŝĐƚ ĐŽƌĞϬ ĐŽƌĞϭ
ďƵƐLJ
DĂŶLJĐŽƌĞ DĂŶLJͲƚŽͲŵĂŶLJ^^ĨŝƌŵǁĂƌĞ
EsDĞ^^ ^ƚĂůĞ
EsDY EsDY EsDY EsDY ^ƚĂůů ĚĂƚĂ

^^
Figure 5: Firmware architecture. ĐŽƌĞϬ ĐŽƌĞE ĐŽƌĞϬ ĐŽƌĞE ΛϬdžϴϬ
processing flows and tasks, we group the stages into queue-
(a) Data contention (1:N). (b) Unbalanced task. (c) I/O hazard.
gather, trans-apply, and flash-scatter modules. The queue-
Figure 6: Challenges of NVMQ allocation (SQ:NVMQ).
gather stage mainly parses NVMe requests and collects them
of such data frames exist across non-contiguous host-side
to the SSD-internal DRAM, whereas the trans-apply stage
DRAM (for a single I/O request). The firmware parses the
mainly buffers the data and translates addresses. The flash-
PRP and begins DMA for multiple data frames per request.
scatter stage spreads the requests across many underlying
Once all the I/O services associated with those data frames
flash packages and manages background SSD-internal tasks
complete, firmware notifies the host of completion through
in parallel. This new firmware enables scalable and flexible
the target CQ. We refer to all the tasks, related to this NVMe
computing, and highly parallel I/O executions.
command and queue management, as NVMQ.
All threads are maximally independent, and I/O requests
A challenge to employ many cores for parallel queue pro-
are always processed from left to right in the thread network,
cessing is that, multiple NVMQ cores may simultaneously
which reduces the hardware contentions and consistency prob-
fetch a same set of NVMe commands from a single queue.
lems, imposed by managing various memory spaces. For ex-
This in turn accesses the host memory by referring a same
ample, two independent I/O requests are processed by two
set of PRPs, which makes the behaviors of parallel queue
different network paths (which are highlighted in Figure 4
accesses undefined and non-deterministic (Figure 6a). To ad-
by red and blue lines, respectively). Consequently, it can si-
dress this challenge, one can make each core handle only a
multaneously service incoming I/O requests as many network
set of SQ/CQ, and therefore, there is no contention, caused by
paths on as DeepFlash can create. In contrast to the other
simultaneous queue processing or PRP accesses (Figure 6b).
threads, background threads are asynchronous with the incom-
In this “static” queue allocation, each NVMQ core fetches a
ing I/O requests or host-side services. Therefore, they create
request from a different queue, based on the doorbell’s queue
their own network paths (dashed lines), which perform SSD
index and brings the corresponding data from the host system
internal tasks at background. Since each stage can process
memory to SSD internal memory. However, this static ap-
a different part of an I/O request, DeepFlash can process
proach requires that the host balance requests across queues
multiple requests in a pipeline manner. Our firmware model
to maximize the resource utilization of NVMQ threads. In
also can be simply extended by adding more threads based on
addition, it is difficult to scale to a large number of queues.
performance demands of the target system.
DeepFlash addresses these challenges by introducing a dy-
Figure 5 illustrates how our many-to-many threading model
namic I/O serialization, which allows multiple NVMQ threads
can be applied to and operate in the many-core based SSD ar-
to access each SQ/CQ in parallel while avoiding a consistency
chitecture of DeepFlash. While the procedure of I/O services
violation. Details of NVMQ will be explained in Section 5.1.
is managed by many threads in the different data processing
I/O mutual exclusion. Even though the NVMe specification
paths, the threads can be allocated in any core in the network,
does not regulate the processing ordering of NVMe com-
in a parallel and scalable manner.
mands in a range from where the head pointer indicates to
the entry that the tail pointer refers to [3], users may expect
4.2 Queue-gather Stage that the SSD processes the requests in the order that users
submitted. However, in our DeepFlash, many threads can
NVMe queue management. For high performance, NVMe simultaneously process I/O requests in any order of accesses.
supports up to 64K queues, each up to 64K entries. As It can make the order of I/O processing different with the
shown in Figure 6a, once a host initiates an NVMe com- order NVMe queues (and users) expected, which may in turn
mand to an SQ and writes the corresponding doorbell, the introduce an I/O hazard or a consistency issue. For example,
firmware fetches the command from the SQ and decodes a Figure 6c shows a potential problem brought by parallel I/O
non-contiguous set of host physical memory pages by refer- processing. In this figure, there are two different I/O requests
ring a kernel list structure [2], called a physical region page from the same NVMe SQ, request-1 (a write) and request-2
(PRP) [23]. Since the length of the data in a request can vary, (a read), which create two different paths, but target to the
its data can be delivered by multiple data frames, each of same PPA. Since these two requests are processed by different
which is usually 4KB. While all command information can NVMQ threads, the request-2 can be served from the target
be retrieved by the device-level registers and SQ, the contents slightly earlier than the request-1. The request-1 then will be

USENIX Association 18th USENIX Conference on File and Storage Technologies 125
> ϬdžϬ ϬdžD ϬdžϭD
KĚĚ /ŶĚĞdž
ŝƌĞĐƚͲ Śŝƚ
ŵĂƉ
As shown in Figure 7a, each CACHE thread has its own
>WE ƚĂďůĞ DŽĚƵůŽD mapping table to record the memory locations of the buffered
ƵŶďĂůĂŶĐĞĚ requests. CACHE threads are configured with a traditional

dZE^
ǀĞŶ >WE ƉŽŝŶƚĞƌ
>WE Śŝƚ ŽƌĞϭ ŽƌĞD
WWE
direct-map cache to reduce the burden of table lookup or cache
ŵŽĚ
DŽĚƵůŽ<
replacement. In this design, as each CACHE thread has a
/ŶĚĞdž ŝƌĞĐƚͲŵĂƉƚĂďůĞ /ŶƚĞƌŶĂů
ZD different memory region to manage, NVMQ simply calculates

&ůĂƐŚ
ƵŶďĂůĂŶĐĞĚ
DĂŶLJ ĞŶƚƌĂůŝnjĞĚ the index of the target memory region by modulating the
ŵĞƐƐĂŐĞƐ ůŽŽŬƵƉŽǀĞƌŚĞĂĚ Śϭ ŚϮ Ś<
request’s LBA, and forwards the incoming requests to the
(a) Main procedure of CACHE. (b) Shards (TRANS).
target CACHE. However, since all NVMQ threads possibly
Figure 7: Challenge analysis in CACHE and TRANS.
communicate with a CACHE thread for every I/O request,
stalled, and the request-2 will be served with stale data. Dur- it can introduce extra latency imposed by passing messages
ing this phase, it is also possible that any thread can invalidate among threads. In addition, to minimize the number of cores
the data while transferring or buffering them out of order. that DeepFlash uses, we need to fully utilize the allocated
While serializing the I/O request processing with a strong cores and dedicate them to each firmware operation while
ordering can guarantee data consistency, it significantly hurts minimizing the communication overhead. To this end, we
SSD performance. One potential solution is introducing a put a cache tag inquiry method in NVMQ and make CACHE
locking system, which provides a lock per page. However, threads fully handle cache hits and evictions. With the tag
per-page lock operations within an SSD can be one of the inquiry method, NVMQ can create a bypass path, which can
most expensive mechanisms due to various I/O lengths and remove the communication overheads (cf. Section 5.3).
a large storage capacity of the SSD. Instead, we partition Parallel address translation. The FTL manages physical
physical flash address space into many shards, whose access blocks and is aware of flash-specific behavior such as erase-
granularity is greater than a page, and assign an index-based before-write and asymmetric erase and read/write operation
lock to each shard. We implement the index lock as a red- unit (block vs. page). We decouple FTL address translation
black tree and make this locking system as a dedicated thread from system management activities such as garbage collection
(ILOCK). This tree helps ILOCK quickly identify which or logging (e.g., journaling) and allocate the management to
lock to use, and reduces the overheads of lock acquisition multiple threads. The threads that perform this simplified
and release. Nevertheless, since NVMQ threads may access address translation are referred to as TRANS. To translate
a few ILOCK threads, it also can be resource contention. addresses in parallel, it needs to partition both LBA space and
DeepFlash optimizes ILOCK by redistributing the requests PPA space and allocate them to each TRANS thread.
based on lock ownership (cf., Section 5.2). Note that there is As shown in Figure 7b, a simple solution is to split a sin-
no consistency issue if the I/O requests target different LBAs. gle LBA space into m numbers of address chunks, where
In addition, as most OSes manage the access control to prevent m is the number of TRANS threads, and map the addresses
different cores from accessing the same files [19, 41, 52], I/O by wrapping around upon reaching m. To take advantage of
requests from different NVMe queues (mapping to different channel-level parallelism, it can also separate a single PPA
cores) access different LBAs, which also does not introduce space into k shards, where k is the number of underlying
the consistency issue. Therefore, DeepFlash can solve the I/O channels, and map the shards to each TRANS with arith-
hazard by guaranteeing the ordering of I/O requests, which metic modulo k. While this address partitioning can make
are issued to the same queue and access the same LBAs, while all TRANS threads operate in parallel without interference,
DeepFlash can process other I/O requests out of order. unbalanced I/O accesses can activate a few TRANS threads or
channels. This can introduce a poor resource utilization and
many resource conflicts and stall a request service on the fly.
4.3 Trans-apply Stage Thus, we randomize the addresses when partitioning the LBA
Data caching and buffering. To appropriately handle space with simple XOR operators. This can scramble LBA
NVMe’s parallel queues and achieve more than 1MIOPS, and statically assign all incoming I/O requests across different
it is important to utilize the internal DRAM buffer efficiently. TRANS threads in an evenly distributed manner. We also allo-
Specifically, even though modern SSDs enjoy the massive cate all the physical blocks of the PPA space to each TRANS
internal parallelism stemming from tens or hundreds of flash in a round-robin fashion. This block-interleaved virtualization
packages, the latency for each chip is orders of magnitude allows us to split the PPA space with finer granularity.
longer than DRAM [22, 45, 46], which can stall NVMQ’s
I/O processing. DeepFlash, therefore, incorporates CACHE 4.4 Flash-scatter Stage
threads that incarnate SSD internal memory as a burst buffer
by mapping LBAs to DRAM addresses rather than flash ones. Background task scheduling. The datapath for garbage col-
The data buffered by CACHE can be drained by striping re- lection (GCs) can be another critical path to achieve high
quests across many flash packages with high parallelism. bandwidth as it stalls many I/O services while reclaiming

126 18th USENIX Conference on File and Storage Technologies USENIX Association
ƉĂƌƐĞ &D ,ŽƐƚͲƐŝĚĞ ^ƚŽƌĂŐĞͲƐŝĚĞ
dZE^Ϭ ϴ ϳ ϲ ϱ ϰ ϯ Ϯ ϭ /K ϯ &ĞƚĐŚ
ŵƐŐ ƌĞƋƵĞƐƚ ϭ ^Ƶďŵŝƚ Ϯ &ĞƚĐŚ EsDYϬ
^Y ^Y ĂƚŽŵŝĐ;ͲŚĞĂĚнϭͿ
dZE^Ŭ dĂŝůнϮ ͲƚĂŝůнϮ

EsDĞƌŝǀĞƌ
,Ϭ ƉƌĞ ƉƌĞ ƉŽƐƚ ƉŽƐƚ ϰ &ĞƚĐŚ
>K'Ϭ ŝĞϬ ŵĞŵ ŵĞŵ ĂƚŽŵŝĐ;ͲŚĞĂĚнϭͿ EsDYϭ
YƵĞƵĞ
'Ϭ ,ϭ ƉƌĞ ƉŽƐƚ ďƵĨĨĞƌ
KƵƚͲŽĨͲŽƌĚĞƌƐƵďŵŝƚ
ŝĞϭ ŵĞŵ ŽŵƉůĞƚĞ ^Ƶďŵŝƚ ϱ ĂƚŽŵŝĐ;ͲƚĂŝůнϭͿ
Figure 8: The main procedure of FCMD cores. ϴ ,ĞĂĚнϮ Y
ϳ dĂŝůнϮ
Y ϲ ĂƚŽŵŝĐ;ͲƚĂŝůнϭͿ
flash block(s). In this work, GCs can be performed in parallel
by allocating separate core(s), referred to as BGC. BGC(s) Figure 9: Dynamic I/O serialization (DIOS).
records the block numbers that have no more entries to write ings within a flash transaction can be classified by pre-dma,
when TRANS threads process incoming I/O requests. BGC mem-op and post-dma. While pre-dma includes operation
then merges the blocks and update the mapping table of corre- command, address, and data transfer (for writes), post-dma
sponding TRANS in behind I/O processing. Since a thread in is composed by completion command and another data trans-
TRANS can process address translations during BGC’s block fer (for reads). Memory operations of the underlying flash
reclaims, it would introduce a consistency issue on mapping are called mem-op in this example. FCMD(s) then scatters
table updates. To avoid conflicts with TRANS threads, BGC the composed transactions over multiple flash resources. Dur-
reclaims blocks and updates the mapping table at background ing this time, all transaction activities are scheduled in an
when there is no activity in NVMQ and the TRANS threads interleaved way, so that it can maximize the utilization of
complete translation tasks. If the system experiences a heavy channel and flash resources. The completion order of multiple
load and clean blocks are running out, our approach performs I/O requests processed by this transaction scheduling can be
on-demand GC. To avoid data consistency issue, we only spontaneously an out-of-order.
block the execution of the TRANS thread, which is responsi- In our design, each FCMD thread is statically mapped to
ble for the address translation of the reclaiming flash block. one or more channels, and the number of channels that will
Journalling. SSD firmware requires journalling by period- be assigned to the FCMD thread is determined based on the
ically dumping the local metadata of TRANS threads (e.g., SSD vendor demands (and/or computing power).
mapping table) from DRAM to a designated flash. In ad-
dition, it needs to keep track of the changes, which are not 5 Optimizing DeepFlash
dumped yet. However, managing consistency and coherency
for persistent data can introduce a burden to TRANS. Our While the baseline DeepFlash architecture distributes func-
DeepFlash separates the journalling from TRANS and as- tionality with many-to-many threading, there are scalability
signs it to a LOG thread. Specifically, TRANS writes the issues. In this section, we will explain the details of thread
LPN-to-PPN mapping information of a FTL page table en- optimizations to increase parallel scalability that allows faster,
try (PTE) to out-of-band (OoB) of the target flash page [64] more parallel implementations.
in each flash program operation (along with the per-page
data). In the meantime, LOG periodically reads all metadata
5.1 Parallel Processing for NVMe Queue
in DRAM, stores them to flash, and builds a checkpoint in the
background. For each checkpoint, LOG records a version, a To address the challenges of the static queue allocation ap-
commit and a page pointer indicating the physical location of proach, we introduce the dynamic I/O serialization (DIOS),
the flash page where TRANS starts writing to. At a boot time, which allows a variable ratio of queues to cores. DIOS de-
LOG checks sanity by examining the commit. If the latest ver- couples the fetching and parsing processes of NVMe queue
sion is staled, LOG loads a previous version and reconstructs entries. As shown in Figure 9, once a NVMQ thread fetches
mapping information by combining the checkpointed table a batch of NVMe commands from a NVMe queue, other
and PTEs that TRANS wrote since the previous checkpoint. NVMQ threads can simultaneously parse the fetched NVMe
Parallel flash accesses. At the end of the DeepFlash net- queue entries. This allows all NVMQ threads to participate in
work, the firmware threads need to i) compose flash transac- processing the NVMe queue entries from the same queue or
tions respecting flash interface timing and ii) schedule them multiple queues. Specifically, DIOS allocates a storage-side
across different flash resources over the flash physical layer SQ buffer (per SQ) in a shared memory space (visible to all
(PHY). These activities are managed by separate cores, re- NVMQ threads) when the host initializes NVMe SSD. If the
ferred to as FCMD. As shown in Figure 8, each thread in host writes the tail index to the doorbell, a NVMQ thread
FCMD parses the PPA translated by TRANS (or generated fetches multiple NVMe queue entries and copies them (not
by BGC/LOG) into the target channel, package, chip and actual data) to the SQ buffer. All NVMQ threads then process
plane numbers. The threads then check the target resources’ the NVMe commands existing in the SQ buffer in parallel.
availability and compose flash transactions by following the The batch copy is performed per 64 entries or till the tail for
underlying flash interface protocol. Typically, memory tim- SQ and CQ points a same position. Similarly, DIOS creates

USENIX Association 18th USENIX Conference on File and Storage Technologies 127
a CQ buffer (per CQ) in the shared memory. NVMQ threads ZĞƚƵƌŶŽǁŶĞƌ/ͬƐƚĂƚƵƐ />K<

EŽŶͲĚĞƚĞƌŵŝŶŝƐƚŝĐ
>ŽĐŬ KǁŶĞƌ

ŵĞƐƐĂŐĞŽƌĚĞƌ
update the CQ buffer instead of the actual CQ as an out of EsDYϬ / /
^Y/ŶĚĞdž ^YŽƌĚĞƌ
order, and flush the NVMe completion messages from the /ŶĨĞƌĞŶĐĞ
EsDY/ ĐƋƵŝƌĞ >ŽŽŬƵƉ
CQ buffer to the CQ in batch. This allows multiple threads >
>ŽĐŬZĞƋ͘ DƐŐ ZĞůĞĂƐĞ
update an NVMe queue in parallel without a modification ŵĞƐƐĂŐĞ
DĞƐƐĂŐĞ
ƋƵĞƵĞ ĞůĞƚĞ >ŽĐŬZͲƚƌĞĞ
of the NVMe protocol and host side storage stack. Another ;ĂͿdŚĞŵĂŝŶƉƌŽĐĞĚƵƌĞŽĨ/>K<͘
technical challenge for processing a queue in parallel is that ZĞƋ ϭ ǁƌΛϬdžϬϯ ďLJƉĂƐƐ ϯ
ZĞƋ Ϯ ƌĚΛϬdžϬϬ ϭ Ϯ ϰ
the head and tail pointers of SQ and CQ buffers are also ZĞƋ ϯ ǁƌΛϬdžϬϲ EsDY , dZE^
shared resources, which requires a protection for simultane-
ĚŝƌĞĐƚͲ /ŶŝƚŝĂů
Ϭ ƐƚĂƚƵƐ tƌŵŝƐƐ ZĚŚŝƚ ϯ tƌŵŝƐƐ ϰ ĂĐŚĞ
ous access. DeepFlash offers DIOS’s head (D-head) and tail ϭ ƐĞƌǀĞĚ Ϯ ƐĞƌǀĞĚ ďLJƉĂƐƐ ĞǀŝĐƚ
ŵĂƉ ĂĐŚĞĚ
(D-tail) pointers, and allows NVMQ threads to access SQ and ƚĂďůĞ >WE ϬdžϬϬ ϬdžϬϯ ϬdžϬϯ ϬdžϬϯ ϬdžϬϯ
ǀŝĐƚĞĚ ŶƵůů
ŶƵůů ϬdžϬϬ ϬdžϬϬ ϬdžϬϬ ŶƵůů
CQ through those pointers, respectively. Since D-head and ZD >WE
;ďͿdŚĞĞdžĂŵƉůĞŽĨ,ďLJƉĂƐƐŝŶŐ͘
D-tail pointers are managed by gcc atomic built-in function,
__sync_fetch_and_add [21], and the core allocation is per- Figure 10: Optimization details.
formed by all NVMQ threads, in parallel, the host memory This, in turn, can free the NVMQ thread from waiting for the
can be simultaneously accessed but at different locations. lock acquisition, which increases the parallelism of DIOS.

5.2 Index Lock Optimization 5.3 Non-blocking Cache


When multiple NVMQ threads contend to acquire or release To get CACHE off the critical path, we add a direct path be-
the same lock due to their same target address range, it can tween NVMQ and TRANS threads and make NVMQ threads
raise two technical issues: i) lock contention and ii) low re- access CACHE threads "only if" there is data in CACHE.
source utilization of NVMQ. As shown in Figure 10a, an We allocate direct-map table(s) in a shared memory space
ILOCK thread sees all incoming lock requests (per page by to accommodate the cache metadata so that NVMQ threads
LBA) through its message queue. This queue sorts the mes- can lookup the cache metadata on their own and send I/O
sages based on SQ indices, and each message maintains thread requests only if there is a hit. However, this simple approach
request structure that includes an SQ index, NVMQ ID, LBA, may introduce inconsistency between the cache metadata of
and lock request information (e.g., acquire and release). Since the direct-map table and target data of the CACHE. When a
the order of queue’s lock requests is non-deterministic, in a write evicts a dirty page from the burst buffer, the metadata
case of contention on acquisition, it must perform I/O services of such evicted page is removed from the direct-map table
by respecting the order of requests in the corresponding SQ. immediately. However, the target data of the evicted page
Thus, the ILOCK thread infers the SQ order by referring to may still stay in the burst buffer, due to the long latency of
the SQ index in the message queue if the target LBA with the a flash write. Therefore, when a dirty page is in progress of
lock request has a conflict. It then checks the red-black (RB) eviction, read requests, which target for the same page, may
tree whose LBA-indexed node contains the lock number and access stale data from the flash. To coordinate the direct-map
owner ID that already acquired the corresponding address. table and CACHE correctly, we add “evicted LPN” field in
If there is no node in the lock RB tree, the ILOCK thread each map table entry that presents the page number, being in
allocates a node with the request’s NVMQ ID. When ILOCK eviction (cf. Figure 10b). In this example of the figure, we as-
receives a release request, it directly removes the target node sume the burst buffer is a direct mapped cache with 3 entries.
without an SQ inference process. If the target address is al- The request (Req ¶) evicts the dirty page at LPN 0x00. Thus,
ready held by another NVMQ thread, the lock requester can NVMQ records the LPN of Req ¶ in the cached LPN field
be stalled until the corresponding I/O service is completed. of the direct-map table and moves the address of the evicted
Since low-level flash latency takes hundreds of microseconds page to its evicted LPN field. Later, as the LPN of Req · (the
to a few milliseconds, the stalled NVMQ can hurt overall per- read at 0x00) matches with the evicted LPN field, Req · is
formance. In our design, ILOCK returns the owner ID for all served by CACHE instead of accessing the flash. If CACHE
lock acquisition requests rather than returning simply acquisi- is busy in evicting the dirty page at LPN 0x00, Req ¸ (the
tion result (e.g., false or fail). The NVMQ thread receives the write at 0x06) has to be stalled. To address this, we make Req
ID of the owning NVMQ thread, and can forward the request ¸ directly bypass CACHE. Once the eviction successfully
there to be processed rather than communicating with ILOCK completes, CACHE clears the evicted LPN field (¹).
again. Alternatively, the NVMQ thread can perform other To make this non-blocking cache more efficient, we add
tasks, such as issuing the I/O service to TRANS or CACHE. a simple randomizing function to retrieve the target TRANS
The insight behind this forwarding is that if another NVMQ index for NVMQ and CACHE threads, which can evenly
owns the corresponding lock of request, then forwards the re- distribute their requests in a static manner. This function
quest to owner and stop further communication with ILOCK. performs an XOR operation per bit for all the bit groups

128 18th USENIX Conference on File and Storage Technologies USENIX Association
and generates the target TRANS index, which takes less than (MPS) [35], FIU SRCMap [63], Enterprise, and FIU IOD-
20 ns. The randomization allows queue-gather stages to issue edup [40]. Each workload exhibits various request sizes, rang-
requests to TRANS by addressing load imbalance. ing from 4KB to tens of KB, which are listed in Table 1. Since
all the workload traces are collected from the narrow-queue
SATA hard disks, replaying the traces with the original times-
6 Evaluation
tamps cannot fully utilize the deep NVMe queues, which in
Implementation platform. We set up an accurate SSD em- turn conceals the real performance of SSD [29]. To this end,
ulation platform by respecting the real NVMe protocol, the our trace replaying approach allocates 16 worker threads in
timing constraints for flash backbone and the functionality the host to keep issuing I/O requests, so that the NVMe queues
of a flexible firmware. Specifically, we emulate a manycore- are not depleted by the SSD platforms.
based SSD firmware by using a MIC 5120D accelerator that
employs 60 lightweight in-order cores (4 hardware threads 6.1 Performance Analysis
per core) [28]. The MIC cores operate at 1GHz and are im-
plemented by applying low power techniques such as short Microbenchmarks. Figure 11 compares the throughput of
in-order pipeline. We emulate the flash backbone by mod- the five SSD platforms with I/O sizes varying from 4KB
elling various flash latencies, different levels of parallelism to 32KB. Overall, ManyLayered outperforms 750SSD and
(i.e., channel/way/flash) and the request conflicts for flash 4600SSD by 1.5× and 45%, on average, respectively. This is
resources. Our flash backbone consists of 16 channels, each because ManyLayered can partially take the benefits of many-
connecting 16 QDP flash packages [69]; we observed that core computing and parallelize I/O processing across multi-
the performance of both read and write operations on the ple queues and channels over the static resource partitioning.
backbone itself is not the bottleneck to achieve more than 1 BaseDeepFlash exhibits poor performance in cases that the
MIOPS. The NVMe interface on the accelerator is also fully request size is smaller than 24KB with random patterns. This
emulated by wrapping Intel’s symmetric communications in- is because threads in NVMQ/ILOCK keep tight inter-thread
terface (SCIF) with an NVMe emulation driver and controller communications to appropriately control the consistency over
that we implemented. The host employs a Xeon 16-core pro- locks. However, for large requests (32KB), BaseDeepFlash
cessor and 256 GB DRAM, running Linux kernel 2.6.32 [62]. exhibits good performance close to ManyLayered, as multiple
It should be noted that this work uses MIC to explore the pages in large requests can be merged to acquire one range
scalability limits of the design; the resulting software can run lock, which reduces the communication (compared to smaller
with fewer cores if they are more powerful, but the design request sizes), and thus, it achieves higher bandwidth.
can now be about what is most economic and power efficient, We observe that ManyLayered and BaseDeepFlash have
rather than whether the firmware can be scalable. a significant performance degradation in random reads and
Configurations. DeepFlash is the emulated SSD platform random writes (cf. Figures 11b and 11d). DeepFlash, in con-
including all the proposed designs of this paper. Compared to trast, provides more than 1MIOPS in all types of I/O requests;
DeepFlash, BaseDeepFlash does not apply the optimization 4.8 GB/s and 4.5 GB/s bandwidth for reads and writes, respec-
techniques (described in Section 5). We evaluate the perfor- tively. While those many-core approaches suffer from many
mance of a real Intel customer-grade SSD (750SSD) [25] core/flash-level conflicts (ManyLayered) and lock/sync is-
and high-performance NVMe SSD (4600SSD) [26] for a sues (BaseDeepFlash) on the imbalanced random workloads,
better comparison. We also emulate another SSD platform DeepFlash scrambles the LBA space and evenly distributes
(ManyLayered), which is an approach to scale up the layered all the random I/O requests to different TRANS threads with
firmware on many cores. Specifically, ManyLayered statically a low overhead. In addition, it applies cache bypass and lock
splits the SSD hardware resources into multiple subsets, each forwarding techniques to mitigate the long stalls, imposed by
containing the resources of one flash channel and running a lock inquiry and inter-thread communication. This can enable
layered firmware independently. For each layered firmware more threads to serve I/O requests in parallel.
instance, ManyLayered assigns a pair of threads: one is used As shown in Figure 12, DeepFlash can mostly activate
for managing flash transaction, and another is assigned to run 6.3 cores that run 25 threads to process I/O services in paral-
HIL and FTL. All these emulation platforms use “12 cores" lel, which is better than BaseDeepFlash by 127% and 63%
by default. Lastly, we also test different flash technologies for reads and writes, respectively. Note that, for the random
such as SLC, MLC, TLC, each of which latency characteris- writes, the bandwidth of DeepFlash is sustainable (4.2 GB/s)
tics are extracted from [44], [45] and [46], respectively. By by activating only 4.5 cores (18 threads). This is because al-
default, the MLC flash array in pristine state is used for our though many cores contend to acquire ILOCK which makes
evaluations. The details of SSD platform are in Table 1. more cores stay in idle, the burst buffer successfully over-
Workloads. In addition to microbenchmarks (reads and comes the long write latency of the flash.
writes with sequential and random patterns), we test diverse Figure 12e shows the active core decomposition of
server workloads, collected from Microsoft Production Server DeepFlash. As shown in the figure, reads require 23% more

USENIX Association 18th USENIX Conference on File and Storage Technologies 129
Host Workloadsets Microsoft,Production Server FIU IODedup
CPU/mem Xeon 16-core processor/256GB, DDR4 Workloads 24HR 24HRS BS CFS DADS DAP DDR cheetah homes webonline
Storage platform/firmware Read Ratio 0.06 0.13 0.11 0.82 0.87 0.57 0.9 0.99 0 0
Controller Xeon-phi, 12 cores by default Avg length (KB) 7.5 12.1 26.3 8.6 27.6 63.4 12.2 4 4 4
FTL/buffer hybrid, n:m=1:8, 1 GB/512 MB Randomness 0.3 0.4937 0.87 0.94 0.99 0.38 0.313 0.12 0.14 0.14
Flash 16 channels/16 pkgs per channel/1k blocks per die Workloadsets FIU SRCMap Enterprise
array 512GB(SLC),1TB(MLC),1.5TB(TLC) Workloads ikki online topgun webmail casa webresearch webusers madmax Exchange
SLC R: 25us, W: 300us, E: 2ms, Max: 1.4 MIOPS Read Ratio 0 0 0 0 0 0 0 0.002 0.24
MLC R: 53us, W: 0.9ms, E: 2.3ms, Max: 1.3 MIOPS Avg length (KB) 4 4 4 4 4 4 4 4.005 9.2
TLC R: 78us, W: 2.6ms, E: 2.3ms, Max: 1.1 MIOPS Randomness 0.39 0.17 0.14 0.21 0.65 0.11 0.14 0.08 0.84
Table 1: H/W configurations and Important workload characteristics of the workloads that we tested.
750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash 750SSD 4600SSD DeepFlash
BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered BaseDeepFlash ManyLayered
Throughput (GB/s)

Throughput (GB/s)

Throughput (GB/s)

Throughput (GB/s)
5 5 5 5
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32
IO request size (KB) IO request size (KB) IO request size (KB) IO request size (KB)
(a) Sequential reads. (b) Random reads. (c) Sequential writes. (d) Random writes.
Figure 11: Performance comparison.
750SSD 4600SSD ManyLayered BaseDeepFlash DeepFlash lower than 56% due to the address patterns of DDR. However,
Throughput (GB/s)

5 MPS SRCMap Enter IODedup


since all NVMQ threads parse and fetch incoming requests
4
3 in parallel, even for such workloads, DeepFlash provides 3
2 GB/s, which is 42% better than ManyLayered.
1
0 P i e x n il a h s g h s e
HR RS B CFSADSDA DDR
S ikknlin mapgu ma cas esc ser xch ta e lin
2424H D o ad to eb r bu E ee hombon
m w w web e ch
we
6.2 CPU, Energy and Power Analyses
Figure 13: Overall throughput analysis.
CPU usage and different flashes. Figures 14a and 14b show
Flash activities, compared to writes, as they are accommo- sensitivity analysis for bandwidth and power/energy, respec-
dated in internal DRAM. In addition, NVMQ requires 1.5% tively. In this evaluation, we collect and present the perfor-
more compute resources to process write requests than pro- mance results of all four microbenchmarks by employing a
cessing read requests, which can offer slightly worse band- varying number of cores (2∼19) and different flash technolo-
width on writes, compared to that of reads. Note that back- gies (SLC/MLC/TLC). The overall SSD bandwidth starts to
ground activities such as garbage collection and logging are saturate from 12 cores (48 hardware threads) for the most case.
not invoked during this evaluation as we configured the emu- Since TLC flash exhibits longer latency than SLC/MLC flash,
lation platform as a pristine SSD. TLC-based SSD requires more cores to reduce the firmware
Server workload traces. Figure 13 illustrates the throughput latency such that it can reach 1MIOPS. When we increase
of server workloads. As shown in the figure, BaseDeepFlash the number of threads more, the performance gains start to
exhibits 1.6, 2.7, 1.1, and 2.9 GB/s, on average, for MPS, SR- diminish due to the overhead of exchanging many messages
CMap, Enterprise and IODedup workload sets, respectively, among thread groups. Finally, when 19 cores are employed,
and DeepFlash improves those of BaseDeepFlash, by 260%, SLC, MLC, and TLC achieve the maximum bandwidths that
64%, 299% and 35%, respectively. BaseDeepFlash exhibits all the underlying flashes aggregately expose, which are 5.3,
a performance degradation, compared to ManyLayered with 4.8, and 4.8 GB/s, respectively.
MPS. This is because MPS generates multiple lock con- Power/Energy. Figure 14b shows the energy breakdown of
tentions, due to more small-size random accesses than other each SSD stage and the total core power. The power and en-
workloads (c.f. Table 1). Interestingly, while DeepFlash out- ergy are estimated based on an instruction-level energy/power
performs other SSD platforms in most workloads, its perfor- model of Xeon Phi [55]. As shown in Figure 14b, DeepFlash
mance is not as good under DDR workloads (slightly better with 12 cores consumes 29 W, which can satisfy the power de-
than BaseDeepFlash). This is because FCMD utilization is livery capability of PCIe [51]. Note that while this power con-
Active core brkdown

BaseDeepFlash BaseDeepFlash BaseDeepFlash BaseDeepFlash Flash FCMD TRANS

CACHE ILOCK NVMQ


DeepFlash DeepFlash DeepFlash DeepFlash
100
12 12
Active cores
Active cores

12 12
Active cores

80
Active cores

9 9 9 9
60
6 6 6 6
40
3 3 3 3
20
0 0 0
0
20 40 60 80 100 20 40 60 80 100 0 r r
20 40 60 80 100 20 40 60 80 100 d d
qR dR qW dW
Time (ms) Time (ms) Time (ms) Time (ms) Se Rn Se Rn

(a) Sequential reads. (b) Random reads. (c) Sequential writes. (d) Random writes. (e) Core decomposition.
Figure 12: Dynamics of active cores for parallel I/O processing.

130 18th USENIX Conference on File and Storage Technologies USENIX Association
Min. Required Cores
60

Core power (W)


Energy breakdown (%)
FCMD LOG BGC Page-lock ILOCK-base SeqRd SeqWr
12
Bandwidth (GB/s)

5
1MOPS
TRANS ILOCK
50 ILOCK-1MB ILOCK-forwd
1.5 RndRd RndWr

MIOPS
CACHE NVMQ
4
9 40 15

NVMQ threads
Core power (W)
100 50
30
12 1.0
3
80 40 6 9
2 20 6 0.5
60 30
3 10 3
1 SLC 40 20
- ss
0 0 0 0.0 22a
Byp
MLC
10 0.0 0.5 1.0
0 1 2 4
20
0 TLC G G 1G Time (s) Number of cache
0 0 .2 2.4 -
2 4 8 15 19 -1 - IO
2

8
(a) ILOCK impact. (b) CACHE IOPS.

9
oO oO

1
#Cores #Cores O O

(a) Bandwidth. (b) Breakdown. (c) Cores.


Figure 16: ILOCK and CACHE optimizations.

Average bandwidth
Bandwidth (GB/s)
NVMQ LOG BGC 6
Figure 14: Resource requirement analysis. Pristine FGC FLOG+FGC

4 1.0

(GB/s)
4
1.2 IOPS/thread(K) 90
IOPS (M)

Avg. 2 0.5
0.9 2
80
Dynamic
Static

0.6 0.0
70 0 0
R S S sa ki x e
0.3 0.0 0.1 0.2 0.3 H R B ik ma lin
60 24 4H ca on
Time (s) 2 ad
0.0 m
1 2 4 8 16 50
Number of SQs Static Dynamic (a) LOG/BGC. (b) BGC overhead.
(a) NVMQ performance. (b) IOPS per NVMQ thread. Figure 17: Background task optimizations.
Figure 15: Performance on different queue allocations. CPU core [10]. Furthermore, the per-thread IOPS of Dynamic
sumption is higher than existing SSDs (20 ∼ 30W [30,58,73]), (with 16 queues) is better than Static by 6.9% (cf. Figure
power-efficient manycores [68] can be used to reduce the 15b). This is because Dynamic can fully utilize all NVMQ
power of our prototype. When we break down energy con- threads when the loads of different queues are unbalanced;
sumed by each stage, FCMD, TRANS and NVMQ consume the NVMQ performance variation of Dynamic (between min
42%, 21%, and 26% of total energy, respectively, as the num- and max) is only 12%, whereas that of Static is 48%.
ber of threads increases. This is because while CACHE, LOG, ILOCK. Figure 16a compares the different locking systems.
ILOCK, and BGC require more computing power, most cores Page-lock is a page-granular lock, while ILOCK-base is
should be assigned to handle a large flash complex, many ILOCK that has no ownership forwarding. ILOCK-forwd
queues and frequent address translation for better scalability. is the one that DeepFlash employs. While ILOCK-base
Different CPUs. Figure 14c compares the minimum number and ILOCK-forwd use a same granular locking (256KB),
of cores that DeepFlash requires to achieve 1MIOPS for both ILOCK-1MB employs 1MB for its lock range but has no
reads and writes. We evaluate different CPU technologies: i) forwarding. Page-lock can activate NVMQ threads more
OoO-1.2G, ii) OoO-2.4G and iii) IO-1G. While IO-1G uses than ILOCK-1MB by 82% (Figure 16a). However, due to the
the default in-order pipeline 1GHz core that our emulation overheads imposed by frequent lock node operations and
platform employs, OoO-1.2G and OoO-2.4G employ Intel RB tree management, the average lock inquiry latency of
Xeon CPU, an out-of-order execution processor [24] with 1.2 Page-lock is as high as 10 us, which is 11× longer than that
and 2.4GHz CPU frequency, respectively. One can observe of ILOCK-forwd. In contrast, ILOCK-forwd can activate the
from the figure that a dozen of cores that DeepFlash uses similar number of NVMQ threads as Page-lock, and exhibits
can be reduced to five high-frequency cores (cf. OoO-2.4G). 0.93 us average lock inquiry latency.
However, due to the complicated core logic (e.g., reorder CACHE. Figure 16b illustrates CACHE performance with
buffer), OoO-1.2G and OoO-2.4G consume 93% and 110% multiple threads varying from 0 to 4. “2-Bypass" employs the
more power than IO-1G to achieve the same level of IOPS. bypass technique (with only 2 threads). Overall, the read per-
formance (even with no-cache) is close to 1MIOPS, thanks
6.3 Performance Analysis of Optimization to massive parallelism in back-end stages. However, write
performance with no-cache is only around 0.65 MIOPS, on
In this analysis, we examine different design choices of the average. By enabling a single CACHE thread to buffer data
components in DeepFlash and evaluate their performance in SSD internal DRAM rather than underlying flash media,
impact on our proposed SSD platform. The following experi- write bandwidth increases by 62%, compared to the system of
ments use the configuration of DeepFlash by default. no-cache. But single CACHE thread reduces read bandwidth
NVMQ. Figures 15a and 15b compare NVMQ’s IOPS and by 25%, on average, due to communication overheads (be-
per-thread IOPS, delivered by a non-optimized queue allo- tween CACHE and NVMQ) for each I/O service. Even with
cation (i.e., Static) and our DIOS (i.e., Dynamic), respec- more CACHE threads, performance gains diminish due to
tively. Dynamic achieves the bandwidth goal, irrespective of communication overhead. In contrast, DeepFlash’s 2-Bypass
the number of NVMe queues that the host manages, whereas can be ideal as it requires fewer threads to achieve 1MIOPS.
Static requires more than 16 NVMe queues to achieve Background activities. Figure 17a shows how DeepFlash
1MIOPS (cf. Figure 15a). This implies that the host also re- coordinates NVMQ, LOG and BGC threads to avoid con-
quires more cores since the NVMe allocates a queue per host tentions on flash resources and maximize SSD performance.

USENIX Association 18th USENIX Conference on File and Storage Technologies 131
As shown in the figure, when NVMQ actively parses and ware/hardware with an in-depth analysis and offers 1MIOPS
fetches data (between 0.04 and 0.2 s), LOG stops draining for all microbenchmarks (read, write, sequential and random)
the data from internal DRAM to flash, since TRANS needs with varying I/O sizes. In addition, our solution is orthogonal
to access their meta information as a response of NVMQ’s to (and still necessary for) host-side optimizations.
queue processing. Similarly, BGC also suspends the block re- Emulation. There is unfortunately no open hardware plat-
claiming since data migration (associated to the reclaim) may form, employing multiple cores and flash packages. For ex-
cause flash-level contentions, thereby interfering NVMQ’s ample, OpenSSD has two cores [59], and Dell/EMC’s Open-
activities. As DeepFlash can minimize the impact from LOG channel SSD (only opens to a small and verified community)
and BGC, the I/O access bandwidth stays above 4 GB/s. Once also employs 4∼8 NXP cores on a few flash [17]. Although
NVMQ is in idle, LOG and BGC reactivate their work. this is an emulation study, we respected all real NVMe/ONFi
STEADY-STATE performance. Figure 17b shows the protocols and timing constraints for SSD and flash, and the
impact of on-demand garbage collection (FGC) and jour- functionality and performance of flexible firmware are demon-
nalling (FLOG) on the performance of DeepFlash. The re- strated by a real lightweight many-core system.
sults are compared to the ideal performance of DeepFlash Scale-out vs. scale-up options. A set of prior work proposes
(Pristine), which has no GC and LOG activities. Compared to architect the SSD as the RAID0-like scale-out option. For
to Pristine, the performance of FGC degrades by 5.4%, example, Amfeltec introduces an M.2-to-PCIe carrier card,
while FLOG+FGC decreases the throughput by 8.8%, on av- which can include four M.2 NVMe SSDs as the RAID0-
erage. The reason why there is negligible performance loss like scale-up solution [5]. However, this solution only offers
is that on-demand GC only blocks single TRANS thread 340K IOPS due to the limited computing power. Recently,
that manages the reclaimed flash block, while the remaining CircuitBlvd overcomes such limitation by putting eight car-
TRANS threads keep serving the I/O requests. In the mean- rier cards into a storage box [6]. Unfortunately, this scale-out
time, LOG works in parallel with TRANS, but consumes the option also requires two extra E5-2690v2 CPUs (3.6GHz 20
usage of FCMD to dump data. cores) with seven PCIe switches, which consumes more than
450W. In addition, these scale-out solutions suffer from serv-
7 Related Work and Discussion ing small-sized requests with a random access-pattern (less
than 2GB/sec) owing to frequent interrupt handling and I/O
OS optimizations. To achieve higher IOPS, host-level opti- request coordination mechanisms. In contrast, DeepFlash, as
mization on multicore systems [8, 36, 75] have been studied. an SSD scale-up solution, can achieve promising performance
Bjorling et al. changes Linux block layer in OS and achieves of random accesses by eliminating the overhead imposed by
1MIOPS on the high NUMA-factor processor systems [8]. such RAID0 design. In addition, compared to the scale-out op-
Zheng et al. redesigns buffer cache on file systems and rein- tions, DeepFlash employs fewer CPU cores to execute only
vent overhead and lock-contention in a 32-core NUMA ma- SSD firmware, which in turn reduces the power consumption.
chine to achieve 1MIOPS [75]. All these systems exploit
heavy manycore processors on the host and buffer data atop
SSDs to achieve higher bandwidth. 8 Conclusion
Industry trend. To the best of our knowledge, while there
are no manycore SSD studies in literature, industry already In this work, we designed scalable flash firmware inspired by
begun to explore manycore based SSDs. Even though they parallel data analysis systems, which can extract the max-
do not publish the actual device in publicly available market, imum performance of the underlying flash memory com-
there are several devices that partially target to 1MIOPS. For plex by concurrently executing multiple firmware compo-
example, FADU is reported to offer around 1MIOPS (only for nents within a single device. Our emulation prototype on a
sequential reads with prefetching) and 539K IOPS (for writes) manycore-integrated accelerator reveals that it simultaneously
[20]; Samsung PM1725 offers 1MIOPS (for reads) and 120K processes beyond 1MIOPS, while successfully hiding long
IOPS (for writes). Unfortunately, there are no information latency imposed by internal flash media.
regarding all industry SSD prototypes and devices in terms of
hardware and software architectures. We believe that future
architecture requires brand-new flash firmware for scalable 9 Acknowledgement
I/O processing to reach 1MIOPS.
Host-side FTL. LightNVM [9], including CNEX solution [1], The authors thank Keith Smith for shepherding their
aims to achieve high performance (∼1MIOPS) by moving paper. This research is mainly supported by NRF
FTL to the host and optimizing user-level and host-side soft- 2016R1C182015312, MemRay grant (G01190170) and
ware stack. But their performance are achieved by evalu- KAIST start-up package (G01190015). J. Zhang and M.
ating only specific operations (like reads or sequential ac- Kwon equally contribute to the work. Myoungsoo Jung is
cesses). In contrast, DeepFlash reconstructs device-level soft- the corresponding author.

132 18th USENIX Conference on File and Storage Technologies USENIX Association
References [13] Wonil Choi, Myoungsoo Jung, Mahmut Kandemir, and
Chita Das. Parallelizing garbage collection with i/o to
[1] CNEX Labs. https://www.cnexlabs.com. improve flash resource utilization. In Proceedings of the
27th International Symposium on High-Performance
[2] Microsoft SGL Description. https://docs.
Parallel and Distributed Computing, pages 243–254,
microsoft.com/en-us/windows-hardware/
2018.
drivers/kernel/using-scatter-gather-dma.
[14] Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, My-
[3] Nvm express. http://nvmexpress.org/ oungsoo Jung, and Mahmut Kandemir. An in-depth
wp-content/uploads/NVM-Express-1_ study of next generation interface for emerging non-
3a-20171024_ratified.pdf. volatile memories. In Non-Volatile Memory Systems
[4] Ultra-low Latency with Samsung Z-NAND SSD. http: and Applications Symposium (NVMSA), 2016 5th, pages
//www.samsung.com/us/labs/pdfs/collateral/ 1–6. IEEE, 2016.
Samsung_Z-NAND_Technology_Brief_v5.pdf, [15] cnet. Samsung 850 Pro SSD review.
2017. https://www.cnet.com/products/
[5] Squid carrier board family pci express samsung-ssd-850-pro/, 2015.
gen 3 carrier board for 4 m.2 pcie ssd [16] Danny Cobb and Amber Huffman. NVM Express and
modules. https://amfeltec.com/ the PCI Express SSD revolution. In Intel Developer
pci-express-gen-3-carrier-board-for-m-2-ssd/, Forum. Santa Clara, CA, USA: Intel, 2012.
2018.
[17] Jae Do. SoftFlash: Programmable storage in
[6] Cinabro platform v1. https://www.circuitblvd. future data centers. https://www.snia.
com/post/cinabro-platform-v1, 2019. org/sites/default/files/SDC/2017/
presentations/Storage_Architecture/
[7] Jasmin Ajanovic. PCI express 3.0 overview. In Proceed-
Do_Jae_Young_SoftFlash_Programmable_
ings of Hot Chip: A Symposium on High Performance
Storage_in_Future_Data_Centers.pdf,
Chips, 2009.
2017.
[8] Matias Bjørling, Jens Axboe, David Nellans, and
[18] Alejandro Duran and Michael Klemm. The Intel R
Philippe Bonnet. Linux block IO: introducing multi-
many integrated core architecture. In High Performance
queue SSD access on multi-core systems. In Proceed-
Computing and Simulation (HPCS), 2012 International
ings of the 6th international systems and storage confer-
Conference on, pages 365–366. IEEE, 2012.
ence, page 22. ACM, 2013.
[19] FreeBSD. Freebsd manual pages: flock.
[9] Matias Bjørling, Javier González, and Philippe Bonnet. https://www.freebsd.org/cgi/man.cgi?
LightNVM: The Linux Open-Channel SSD Subsystem. query=flock&sektion=2, 2011.
In FAST, pages 359–374, 2017.
[20] Anthony Garreffa. Fadu unveils world’s fastest SSD,
[10] Keith Busch. Linux NVMe driver. https: capable of 5gb/sec. http://tiny.cc/eyzdcz,
//www.flashmemorysummit.com/English/ 2016.
Collaterals/Proceedings/2013/
20130812_PreConfD_Busch.pdf, 2013. [21] Arthur Griffith. GCC: the complete reference. McGraw-
Hill, Inc., 2002.
[11] Adrian M Caulfield, Joel Coburn, Todor Mollov, Arup
De, Ameen Akel, Jiahua He, Arun Jagatheesan, Rajesh K [22] Laura M Grupp, John D Davis, and Steven Swanson.
Gupta, Allan Snavely, and Steven Swanson. Understand- The bleak future of NAND flash memory. In Proceed-
ing the impact of emerging non-volatile memories on ings of the 10th USENIX conference on File and Storage
high-performance, io-intensive computing. In High Per- Technologies, pages 2–2. USENIX Association, 2012.
formance Computing, Networking, Storage and Analysis
[23] Amber Huffman. NVM Express, revision 1.0 c. Intel
(SC), 2010 International Conference for, pages 1–11.
Corporation, 2012.
IEEE, 2010.
[24] Intel. Intel Xeon Processor E5 2620 v3. http://
[12] Adrian M Caulfield, Laura M Grupp, and Steven Swan- tiny.cc/a1zdcz, 2014.
son. Gordon: using flash memory to build fast, power-
efficient clusters for data-intensive applications. ACM [25] Intel. Intel SSD 750 series. http://tiny.cc/
Sigplan Notices, 44(3):217–228, 2009. qyzdcz, 2015.

USENIX Association 18th USENIX Conference on File and Storage Technologies 133
[26] Intel. Intel SSD DC P4600 Series. http://tiny. [37] Hyojun Kim, Nitin Agrawal, and Cristian Ungureanu.
cc/dzzdcz, 2018. Revisiting storage for smartphones. ACM Transactions
on Storage (TOS), 8(4):14, 2012.
[27] Xabier Iturbe, Balaji Venu, Emre Ozer, and Shidhartha
Das. A triple core lock-step (TCLS) ARM R Cortex R - [38] Nathan Kirsch. Phison E12 high-performance SSD
R5 processor for safety-critical and ultra-reliable appli- controller. http://tiny.cc/91zdcz, 2018.
cations. In Dependable Systems and Networks Work-
shop, 2016 46th Annual IEEE/IFIP International Con- [39] Sungjoon Koh, Junhyeok Jang, Changrim Lee,
ference on, pages 246–249. IEEE, 2016. Miryeong Kwon, Jie Zhang, and Myoungsoo Jung.
Faster than flash: An in-depth study of system chal-
[28] James Jeffers and James Reinders. Intel Xeon Phi co-
lenges for emerging ultra-low latency ssds. arXiv
processor high-performance programming. Newnes,
preprint arXiv:1912.06998, 2019.
2013.
[29] Jaeyong Jeong, Sangwook Shane Hahn, Sungjin Lee, [40] Ricardo Koller et al. I/O deduplication: Utilizing content
and Jihong Kim. Lifetime improvement of NAND flash- similarity to improve I/O performance. TOS, 2010.
based storage systems using dynamic program and erase
[41] Linux. Mandatory file locking for the linux
scaling. In Proceedings of the 12th USENIX Conference
operating system. https://www.kernel.
on File and Storage Technologies (FAST 14), pages 61–
org/doc/Documentation/filesystems/
74, 2014.
mandatory-locking.txt, 2007.
[30] Myoungsoo Jung. Exploring design challenges in get-
ting solid state drives closer to cpu. IEEE Transactions [42] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Har-
on Computers, 65(4):1103–1115, 2016. iharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and
Remzi H Arpaci-Dusseau. Wisckey: Separating keys
[31] Myoungsoo Jung, Wonil Choi, Shekhar Srikantaiah, from values in SSD-conscious storage. ACM Transac-
Joonhyuk Yoo, and Mahmut T Kandemir. Hios: A tions on Storage (TOS), 13(1):5, 2017.
host interface i/o scheduler for solid state disks. ACM
SIGARCH Computer Architecture News, 42(3):289–300, [43] marvell. Marvell 88ss1093 flash memory controller.
2014. https://www.marvell.com/storage/
assets/Marvell-88SS1093-0307-2017.
[32] Myoungsoo Jung and Mahmut Kandemir. Revisiting
pdf, 2017.
widely held SSD expectations and rethinking system-
level implications. In ACM SIGMETRICS Performance [44] Micron. Mt29f2g08aabwp/mt29f2g16aabwp NAND
Evaluation Review, volume 41, pages 203–216. ACM, flash datasheet. 2004.
2013.
[45] Micron. Mt29f256g08cjaaa/mt29f256g08cjaab NAND
[33] Myoungsoo Jung and Mahmut T Kandemir. Sprinkler:
flash datasheet. 2008.
Maximizing resource utilization in many-chip solid state
disks. In 2014 IEEE 20th International Symposium [46] Micron. Mt29f1ht08emcbbj4-
on High Performance Computer Architecture (HPCA), 37:b/mt29f1ht08emhbbj4-3r:b NAND flash datasheet.
pages 524–535. IEEE, 2014. 2016.
[34] Myoungsoo Jung, Ellis H Wilson III, and Mahmut Kan-
[47] Yongseok Oh, Eunjae Lee, Choulseung Hyun, Jongmoo
demir. Physically addressed queueing (PAQ): improving
Choi, Donghee Lee, and Sam H Noh. Enabling cost-
parallelism in solid state disks. In ACM SIGARCH Com-
effective flash based caching with an array of commodity
puter Architecture News, volume 40, pages 404–415.
ssds. In Proceedings of the 16th Annual Middleware
IEEE Computer Society, 2012.
Conference, pages 63–74. ACM, 2015.
[35] Bruce Worthington Qi Zhang Kavalanekar, Swaroop and
Vishal Sharda. Characterization of storage workload [48] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou,
traces from production windows servers. In IISWC, Yong Wang, and Yuanzheng Wang. SDF: software-
2008. defined flash for web-scale internet storage systems.
ACM SIGPLAN Notices, 49(4):471–484, 2014.
[36] Byungseok Kim, Jaeho Kim, and Sam H Noh. Managing
array of ssds when the storage device is no longer the [49] Seon-yeong Park, Euiseong Seo, Ji-Yong Shin, Seun-
performance bottleneck. In 9th {USENIX} Workshop gryoul Maeng, and Joonwon Lee. Exploiting internal
on Hot Topics in Storage and File Systems (HotStorage parallelism of flash-based SSDs. IEEE Computer Archi-
17), 2017. tecture Letters, 9(1):9–12, 2010.

134 18th USENIX Conference on File and Storage Technologies USENIX Association
[50] Chris Ramseyer. Seagate SandForce SF3500 client SSD [61] Arash Tavakkol, Juan Gómez-Luna, Mohammad
controller detailed. http://tiny.cc/f2zdcz, Sadrosadati, Saugata Ghose, and Onur Mutlu. MQSim:
2015. A framework for enabling realistic studies of modern
multi-queue SSD devices. In 16th USENIX Conference
[51] Tim Schiesser. Correction: PCIe 4.0 won’t support on File and Storage Technologies (FAST 18), pages
up to 300 watts of slot power. http://tiny.cc/ 49–66, 2018.
52zdcz, 2017.
[62] Linus Torvalds. Linux kernel repo. https://
[52] Windows SDK. Lockfileex function. github.com/torvalds/linux, 2017.
https://docs.microsoft.com/ [63] Akshat Verma, Ricardo Koller, Luis Useche, and Raju
en-us/windows/win32/api/fileapi/ Rangaswami. SRCMap: Energy proportional storage
nf-fileapi-lockfileex, 2018. using dynamic consolidation. In FAST, volume 10, pages
267–280, 2010.
[53] Hynix Semiconductor et al. Open NAND flash interface
specification. Technical Report ONFI, 2006. [64] Shunzhuo Wang, Fei Wu, Zhonghai Lu, You Zhou, Qin
Xiong, Meng Zhang, and Changsheng Xie. Lifetime
[54] Narges Shahidi, Mahmut T Kandemir, Mohammad Ar- adaptive ecc in nand flash page management. In Design,
jomand, Chita R Das, Myoungsoo Jung, and Anand Automation & Test in Europe Conference & Exhibition
Sivasubramaniam. Exploring the potentials of parallel (DATE), 2017, pages 1253–1556. IEEE, 2017.
garbage collection in ssds for enterprise storage systems.
In SC’16: Proceedings of the International Conference [65] Qingsong Wei, Bozhao Gong, Suraj Pathak, Bharadwaj
for High Performance Computing, Networking, Storage Veeravalli, LingFang Zeng, and Kanzo Okada. WAFTL:
and Analysis, pages 561–572. IEEE, 2016. A workload adaptive flash translation layer with data
partition. In Mass Storage Systems and Technologies
[55] Yakun Sophia Shao and David Brooks. Energy charac- (MSST), 2011 IEEE 27th Symposium on, pages 1–12.
terization and instruction-level energy model of Intel’s IEEE, 2011.
Xeon Phi processor. In International Symposium on
[66] Zev Weiss, Sriram Subramanian, Swaminathan Sun-
Low Power Electronics and Design (ISLPED), pages
dararaman, Nisha Talagala, Andrea C Arpaci-Dusseau,
389–394. IEEE, 2013.
and Remzi H Arpaci-Dusseau. ANViL: Advanced vir-
[56] Mustafa M Shihab, Jie Zhang, Myoungsoo Jung, and tualization for modern non-volatile memory devices. In
Mahmut Kandemir. Revenand: A fast-drift-aware re- FAST, pages 111–118, 2015.
silient 3d nand flash design. ACM Transactions on Ar- [67] Matt Welsh, David Culler, and Eric Brewer. SEDA:
chitecture and Code Optimization (TACO), 15(2):1–26, an architecture for well-conditioned, scalable internet
2018. services. In ACM SIGOPS Operating Systems Review,
volume 35, pages 230–243. ACM, 2001.
[57] Ji-Yong Shin, Zeng-Lin Xia, Ning-Yi Xu, Rui Gao,
Xiong-Fei Cai, Seungryoul Maeng, and Feng-Hsiung [68] Norbert Werner, Guillermo Payá-Vayá, and Holger
Hsu. FTL design exploration in reconfigurable high- Blume. Case study: Using the xtensa lx4 configurable
performance SSD for server applications. In Proceed- processor for hearing aid applications. Proceedings of
ings of the 23rd international conference on Supercom- the ICT. OPEN, 2013.
puting, pages 338–349. ACM, 2009.
[69] ONFI Workgroup. Open NAND flash interface specifi-
[58] S Shin and D Shin. Power analysis for flash memory cation revision 3.0. ONFI Workgroup, Published Mar,
SSD. Work-shop for Operating System Support for Non- 15:288, 2011.
Volatile RAM (NVRAMOS 2010 Spring)(Jeju, Korea, [70] Guanying Wu and Xubin He. Delta-FTL: improving
April 2010), 2010. SSD lifetime via exploiting content locality. In Proceed-
ings of the 7th ACM european conference on Computer
[59] Yong Ho Song, Sanghyuk Jung, Sang-Won Lee, and Jin- Systems, pages 253–266. ACM, 2012.
Soo Kim. Cosmos openSSD: A PCIe-based open source
SSD platform. Proc. Flash Memory Summit, 2014. [71] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh,
Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita
[60] Wei Tan, Liana Fong, and Yanbin Liu. Effectiveness Shayesteh, and Vijay Balakrishnan. Performance analy-
assessment of solid-state drive used in big data services. sis of NVMe SSDs and their implication on real world
In Web Services (ICWS), 2014 IEEE International Con- databases. In Proceedings of the 8th ACM International
ference on, pages 393–400. IEEE, 2014. Systems and Storage Conference, page 6. ACM, 2015.

USENIX Association 18th USENIX Conference on File and Storage Technologies 135
[72] Jie Zhang, Gieseo Park, Mustafa M Shihab, David Warming up storage-level caches with bonfire. In FAST,
Donofrio, John Shalf, and Myoungsoo Jung. Open- pages 59–72, 2013.
NVM: An open-sourced fpga-based nvm controller for
low level memory characterization. In 2015 33rd IEEE [75] Da Zheng, Randal Burns, and Alexander S Szalay. To-
International Conference on Computer Design (ICCD), ward millions of file system iops on low-cost, commod-
pages 666–673. IEEE, 2015. ity hardware. In Proceedings of the international con-
ference on high performance computing, networking,
[73] Jie Zhang, Mustafa Shihab, and Myoungsoo Jung. storage and analysis, page 69. ACM, 2013.
Power, energy, and thermal considerations in SSD-based
I/O acceleration. In HotStorage, 2014. [76] You Zhou, Fei Wu, Ping Huang, Xubin He, Changsheng
Xie, and Jian Zhou. An efficient page-level FTL to opti-
[74] Yiying Zhang, Gokul Soundararajan, Mark W Storer, mize address translation in flash memory. In Proceed-
Lakshmi N Bairavasundaram, Sethuraman Subbiah, An- ings of the Tenth European Conference on Computer
drea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Systems, page 12. ACM, 2015.

136 18th USENIX Conference on File and Storage Technologies USENIX Association

You might also like