Storage Abstractions For SSDS: The Past, Present, and Future
Storage Abstractions For SSDS: The Past, Present, and Future
Authors’ Contact Information: Xiangqun Zhang, Syracuse University, Syracuse, New York, United States; e-mail: xzhang84@syr.edu; Janki
Bhimani, Florida International University, Miami, Florida, United States; e-mail: jbhimani@iu.edu; Shuyi Pei, Samsung Semiconductor Inc,
San Jose, California, United States; e-mail: shuyi.pei@samsung.com; Eunji Lee, Soongsil University, Seoul, Korea (the Republic of); e-mail:
ejlee@ssu.ac.kr; Sungjin Lee, DGIST, Daegu, Korea (the Republic of); e-mail: sungjin.lee@dgist.ac.kr; Yoon Jae Seong, FADU Inc., Seoul, Korea
(the Republic of); e-mail: yjseong@fadutec.com; Eui Jin Kim, FADU Inc., Seoul, Korea (the Republic of); e-mail: euijin.kim@fadutec.com;
Changho Choi, Samsung Semiconductor Inc, San Jose, California, United States; e-mail: changho.c@samsung.com; Eyee Hyun Nam, FADU
Inc., Seoul, Korea (the Republic of); e-mail: ehnam@fadutec.com; Jongmoo Choi, Dankook University, Yongin, Gyeonggi, Korea (the Republic
of); e-mail: choijm@dankook.ac.kr; Bryan S. Kim, Syracuse University, Syracuse, New York, United States; e-mail: bkim01@syr.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page.
Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
© 2024 Copyright held by the owner/author(s).
ACM 1553-3093/2024/12-ART
https://doi.org/10.1145/3708992
1 Introduction
Modern computers use abstractions for interoperability between the host and its peripherals[25]. Storage, as
the basis of the memory hierarchy, also requires an abstraction layer for host-storage communications. This
abstraction layer allows the host to use storage devices without regard to its underlying implementation. Over
the past decades, the fundamental design of the storage abstraction has remained relatively consistent: The device
exposes an address range, and the host system issues read/write requests within that range. Subsequently, the
storage device executes the requests and returns the results back to the host. This simple interface between the
storage device and the host system is suicient for data storage purposes. It aged well from loppy to hard drive
and inally was passed down to the SSD.
However, new developments and indings in the ield of storage device demand enhancements to the existing
storage abstraction. SSDs are fundamentally diferent from magnetic disks like loppies and hard disk drives.
First, by using NAND lash, SSDs achieve high throughput by eliminating moving parts and exploiting internal
parallelism. However, NAND lash does not support in-place updates, which means that garbage collection is
required. Although the existing storage abstraction works on SSDs, it does not provide a method to reduce garbage
collection overhead with host-SSD coordination. Second, the increasing speed of SSDs shifted the latency source
in the storage stack. The storage device is not dominating the latency anymore; instead, the host also contributes
about half of the total latency[170]. The internal bandwidth of the SSD is also larger than the external bandwidth
between the host and the SSDs[136], which means that data transfer between the host and the SSDs poses a
major limitation of utilizing SSDs to their full potential. Last but not least, the recent emergence of Compute
Express Link (CXL) allows researchers to reimagine integrating SSDs into the memory hierarchy by directly
using SSDs as part of the main memory. However, the main memory is byte-addressable while the traditional
storage abstraction sees secondary storage in units of sectors[25]. Together, these three points warrant new
enhancements to the existing storage abstraction.
Thankfully, both academia and industry are well aware of limitations in the traditional storage abstraction.
Many new features, enhancements, and host-device co-designs have been created to address the shortcomings of
the existing storage abstraction. In this paper, we survey the SSD device abstraction and enhancements, exploring
their backgrounds and designs that enable SSDs to better communicate with the host. We will also examine their
relationships, applications, standardization eforts, and luctuations in popularity over time. We categorize these
abstraction enhancements into four categories:
(1) extending block abstraction with host-SSD hints/directives. This includes TRIM, Multi-stream, and Flexible
Data Placement (FDP);
(2) enhancing host-level control over SSDs. This includes Open-channel (OC) SSD and Zoned Namespaces
(ZNS) SSD;
(3) oloading host-level management to SSDs. This includes Key-value (KV) SSD and computational storage;
(4) making SSDs byte-addressable, along with their use in CXL,
which we present along with their standardization history in Figure 1. Lastly, we will conclude our paper with
our view on the possible future directions for extensions to storage abstractions.
Our main contributions in this survey include:
(1) A comprehensive overview of enhancements to the traditional storage abstraction.
(2) Categorization of diferent enhancements based on their characteristics.
(3) Visualization of the relationship between diferent enhancements and their related work.
(4) Identiication of directions and challenges for current and future enhancements to storage abstractions.
The rest of this survey is organized as follows: Section 2 provides a background on SSD and storage abstractions.
Section 3 through Section 6 discuss storage abstraction enhancements, including those that providing extra
hints/directives to SSD (ğ 3), moving SSD responsibilities to the host (ğ 4) & vice versa (ğ 5), and making SSD
byte-addressable (ğ 6). Section 7 is our view on the future development of SSD storage abstractions, and Section 8
concludes this survey.
➉
[135]
Comp. Stor.
2001
SATA [1]
introduced TRIM
2007
➉ [26]
Linux Kernel
2008
➉ [3]
NVMe
2011
[4]
NVMe
2012
[7] [6]
OCSSD Linux Kernel
2015 2015
[65, 80]
AppMgmtFlash [106]
Comp. Stor.
FAST ’16
2016
[134, 164]
2B-SSD [29]
Multi-stream
ISCA ’18
2018
[96, 111, 159, 161] ➉ [35] [13] ➉ [11] [136] ➉ [12, 89]
Multi-stream ZNS NVMe CXL Comp. Stor. KV-SSD
2019 2019 2019 2019 2019 2019
[36, 66, 117] [46, 168] [27, 149] [16, 17] [99, 108, 154] [33, 85, 98, 133]
ZNS Multi-stream Linux Kernel NVMe Comp. Stor. KV-SSD
2021 2021 2021 2021 2021 2021
[34] [30, 48, 75, 81, 102] [69] [18, 82] [119, 170] [169]
FDP [19]
Multi-stream ZNS Linux Kernel CXL Comp. Stor. KV-SSD
2022
2022 2022 2022 2022 2022 2022
[32, 67, 97, 103, 112, 115, 123, 132, 152] [20, 101] [68, 128, 163] [61]
ZNS CXL Comp. Stor. KV-SSD
2023 2023 2023 2023
[22]
NVMe
2024
Fig. 1. The genealogy tree. Papers published in the same year are placed on the same row. Diferent colors indicate diferent
categories (Gray indicates no specific category). indicates the root of a given category. ➉
A detailed version showing every paper as an independent node can be found in the appendix as Figure 12.
Listing 1. The evolution of the bio layer from the initial version from Kernel v2.5.1 in 2001 (let) and Kernel v5.15 in 2021
(right) with discard, Multi-stream, and Zoned Namespaces support. More operations and abstraction enhancements, as
defined by BIO_* and REQ_OP_*, have been supported over the years.
Storage devices come in diferent shapes and sizes, ranging from legacy tapes and loppy disks to contemporary
hard disk drives and solid state drives, each characterized by its distinct underlying mechanisms. Despite this
diversity, most mass storage devices expose their storage capacity to the operating system and support two
primary operations: read and write. The operating system should be able to execute read and write requests
within the speciied address space regardless of the actual type and implementation of the device. Linux addresses
this requirement through the implementation of the bio layer. The bio layer serves as an interface that bridges
the gap between the operating system and various types of mass storage devices. By providing a consistent
interface, irrespective of the device’s type or implementation, the bio layer ensures interoperability[42]. This
enables host applications to issue I/O requests independent of the storage device’s implementation details. Despite
the overarching purpose of the bio layer, it is crucial for the operating system to acknowledge the inherent
diversity among storage devices in terms of their physical attributes and internal mechanisms. Consequently, the
bio layer must exhibit adaptability to cater to the unique speciications of diferent devices.
To take advantage of all the features that a storage device provides, the bio layer has been extended for
additional operations and ields tailored to numerous features of diferent storage devices. Listings 1 shows the
evolution of the bio layer over the years, with the initial version of the bio layer released in 2001 on the left and
a later version released in 2021 on the right. The initial bio layer was designed with two operations only: read
and write. However, to pursue improved I/O performance for a variety of devices, the bio layer has undergone
many enhancements. Notable additions include the discard operation (also known as TRIM)[143], introduced to
the bio layer to provide an optional hint for improved I/O performance[26]. Other enhancements, including
Multi-stream and Zoned Namespaces, are also added to the bio layer[9, 27]. The bio layer also underwent a
major redesign for a multi-queue design named blk-mq, which scales with the number of host CPU cores and
addresses the performance bottleneck at the bio layer due to the higher performance of the storage device that
comes with the era of SSDs[37]. We present an overview of storage enhancement changes to the Linux bio layer
using yellow boxes in Figure 1. In summary, the bio layer functions as the fundamental interface for all mass
storage devices for the Linux kernel since v2.5.1, ofering both a universal and lexible framework to adapt to the
general and distinct characteristics of diferent storage devices, enabling enhancements to storage abstractions.
Channel 1
Chip 1
Chip 2
S S S S
Channel 2 u u u u
p p
Chip 3 p p
e e e e
r r r r
b b Chip 4 b b
Controller Channel 3
l
o
l
o
l
o
l
o
c c Chip 5 c c
k k k k
0 0 Chip 6 1 1
1 2 5 6
Channel 4
Chip 7
Chip 8
In summary, SSD dominates the current storage market thanks to its performance by leveraging multiple
NAND lash chips. However, NAND lash is also a double-edged sword, which requires data relocation and block
erase during the garbage collection process since NAND lash does not support in-place updates. To overcome
the performance loss due to garbage collection, the host and the SSD should communicate and coordinate to
reduce the overhead caused by garbage collection.
3 Host-SSD Hints/Directives
In this section, we focus on the hints and directives provided to the SSD from the host. We deine hint as an
optional data entry ield that the SSD can optionally utilize or ignore. The SSD can utilize the given hints for
hopefully better performance, but it can also safely ignore such hints if the SSD does not know how to utilize
such hints. For example, an SSD that supports TRIM hint can sometimes ignore the hint if the given range is
too small[147, 150]. On the other hand, we deine directive as a command that SSD should follow. The SSD is
expected to perform the operation by utilizing the information sent by the host as every existing feature built
1
upon directives is designed to follow the orders given by the directives[19, 86, 120] . To summarize, hints are
more optional than directives: a device can safely ignore hints but is expected to follow directives to the maximum
extent.
1
Developers may sometimes call the directive values (e.g., Multi-stream stream IDs) to be passed as łwrite hintsž[105, 110]. We will also call
these values łwrite hintsž or łlifetime hintsž in this paper to avoid any confusion.
3.1 Discard/TRIM
First proposed in the paper by Sivathanu et al.[143], the design of the TRIM operation is simple but efective: the
host system tells the SSD using the data set management command[17, 121] about which logical address now
contains invalid data due to ile deletion. This essentially reduces the number of valid pages relocated in the SSD
in the garbage collection process. Otherwise, the SSD will not know that a logical address range contains deleted
data and will relocate the corresponding physical pages during garbage collection, causing extra GC overhead[1].
Since the TRIM command indicates the invalidity of the data, it is also used to securely erase data by physically
removing the data from the SSD immediately after the data is invalidated[55].
Unlike most other categories in this survey, there are very limited design choices and extensions for TRIM; one
example, though, is the frequency to send TRIM requests to limit the number of I/O requests[32]. Only a few prior
works focused on TRIM policies and implications[73, 83, 93, 104]. Nevertheless, the efectiveness and (relative)
simplicity of TRIM[86] encouraged most modern operating systems and storage protocols to adopt TRIM as a
part of them. This includes Microsoft Windows since Windows 7[122], Linux since kernel version 2.6.28-rc1[26],
macOS since Lion[144], SATA since 2007[1], and NVMe since version 1.0[3].
Despite the simplicity of TRIM, it took years to improve TRIM policies in the Linux kernel. The most intuitive
time to issue TRIM requests is right after when the ilesystem knows a location is freed up due to deletion and
overwrite. This is called online TRIM (also known as synchronous TRIM[50, 147]). However, early SATA protocol
does not support queued TRIM requests. To issue a TRIM request, the operating system has to ensure that the
current I/O queue is empty. This can cause signiicant performance degradation since TRIM is blocked by and can
also block other I/O requests[45, 50, 150]. To minimize the performance impact caused by TRIM, the operating
system needs to perform TRIM when there are no outstanding I/O requests in the I/O queue. One approach
is to keep track of all discardable segments on the host side and send a batch of TRIM requests when there
are no outstanding I/O requests (e.g., once a week, when the system is idle). This approach is called batched
TRIM[52, 92, 114]. Due to its simplicity, it is widely supported by diferent ilesystems like ext3/4[51] and F2FS[92].
However, batched TRIM is not real-time as the user is responsible for choosing an adequate frequency (e.g.,
once per week). If the SSD is near full, TRIM helps on SSD performance since it frees more internal space for data
relocation after garbage collection, which reduces the possibility of triggering GC every time after a write[62, 73].
If the user is unaware that the next batched TRIM is scheduled in the far future when the SSD is near full due to
excessive use, then the eiciency of the SSD can be impacted due to excessive GCs. With the introduction of
queueable TRIM in SATA 3.1[139] and later in NVMe[17], TRIM requests can also be sent into queues without
blocking or being blocked by other I/O requests. This means online TRIM is inally becoming practical, and the
operating system can issue online TRIM requests to SSDs without blocking other I/O requests.
Although online TRIM provides real-time information to the SSD about the freed space, this could severely
impact SSD performance since it takes a long time to handle TRIM requests[62, 84]. To mitigate this issue, some
Linux ilesystems now support asynchronous TRIM so that TRIM requests will only be issued to SSDs after there
are enough ranges of discardable spaces[145]. Unlike batched TRIM, asynchronous TRIM is more lexible because it
is designed to send TRIM requests more frequently[146]. It also prevents excessive, fragmented TRIM requests that
can sometimes be ignored by SSDs[145] or cause SSD performance degradation[147]. With its balance between
TRIM frequency and eiciency, asynchronous TRIM has recently become the default TRIM choice for ilesystems
like Btrfs[147], marking the most recent improvement for TRIM.
3.2 Multi-stream
3.2.1 From Manual to Automatic: Multi-stream[86] is a directive that allows hosts to inform SSD of the physical
placement preference of diferent data. It has been observed that data with similar characteristics may be
invalidated at the same time; by grouping data with similar characteristics into the same superblock, data in the
3 10
Multi-stream➉ [86]
HotStorage ’14
OpenChannelSSD [7]
2015
Linux Kernel v4.13-rc1 [9] Linux Kernel v4.15-rc1 [8] NVMe1.3 [10]
AutoStream [158]
supported Multi-stream F2FS supported write hints supported Multi-stream
SYSTOR ’17
2017 2017 2017
DTR-FTL [153]
DAC ’20
same superblock tend to be bimodal, i.e., being all valid or all invalid. If all picked garbage collection victims
contain mostly or entirely invalid data, the write ampliication can be kept low. Data attached with diferent
stream IDs based on host-calculated write hints[9, 110] will be written to diferent superblocks as shown in
Figure 3. To summarize, the host calculates write hints based on the data characteristics and uses these write
hints as stream ID directives when writing the data to the SSD[13]. When the superblock associated with the
stream ID is full, another free superblock will be assigned to this stream ID to ensure diferent superblocks will
not mix data with diferent stream IDs. Multi-stream thus provides the bridge for the host to communicate with
the SSD regarding the data characteristics known by the host. A series of works have spun of from the original
Multi-stream paper, providing diferent enhancements, and Multi-stream was ratiied into NVMe 1.3[10] and
supported by the Linux kernel since v4.13-rc1[9]. Figure 4 shows the relationship between diferent Multi-stream
related papers. Table 1 summarizes diferent Multi-stream related papers along with some of their most important
characteristics.
One critical shortcoming of the original Multi-stream paper is that it requires manual assignment of the stream
IDs. If an application would like to exploit the beneit provided by Multi-stream, it has to be rewritten so that
the stream ID can be assigned to each write request. For example, an update to RocksDB was necessary to fully
exploit the beneits provided by Multi-stream[110]. The requirement of manually assigning stream IDs limited
the adoption of Multi-stream. To mitigate this problem, Yang et al. proposed AutoStream[158] for automatic
assignment of stream IDs. It is the irst paper to enable automatic assignment of stream IDs, paving the way for
subsequent papers in this ield. AutoStream is implemented on the device driver level since some applications
may bypass the block I/O layer, and the SSD may have limited computational resources for stream ID calculation.
AutoStream provided two diferent algorithms, multi-queue (MQ) and sequentiality, frequency, and recency (SFR).
For both methods, the address space is split into chunks of a given size to reduce the overhead caused by
AutoStream. The MQ method has multiple queues corresponding to diferent hotness in logarithmic scale. All
chunks initially have a hotness of 1, and will eventually be promoted to queues for higher hotness if they have
enough accesses in a certain period. After each promotion, all head chunks in each queue will be checked to see
if they should be demoted. If a certain number of accesses were made to other chunks but not the queue head
chunk, the head chunk will be demoted to the lower queue.
The SFR method is based on sequentiality, frequency, and recency. If a write request starts from the end
address of the previous write request, it will use the same stream ID from the previous write request. For
time since prev access➞decay period
other cases, the recency weight for the chunk will be calculated using 2 , where the
decay period is a user-controlled variable. The new access count ���� of the chunk will be recalculated as
���� ✑ 1✝➞recency weight✝. The stream ID will be calculated using the logarithm of the new access count.
FIOS[34] later extended this method using PCC to calculate the correlations of diferent chunks.
3.2.2 Finding the Layer in Charge: AutoStream rekindles the research interest of Multi-stream in academia
after three years without any published paper on Multi-stream. It also marks the shift from manual stream ID
assignment to automatic, which reduces the development overhead of application developers. Several approaches
were created in diferent layers to automate the stream ID assignment process; possible layers to implement
stream assignment include the ilesystem[134], device driver[158], runtime[96], and SSD itself[153].
FStream[134] provided automatic stream ID assignment by separating diferent ilesystem metadata. For
example, ext4 has journal, inode, and other miscellaneous information. FStream separates those metadata
into diferent stream IDs with the ability to assign distinct streams to iles with speciic names or extensions.
FileStream[168] inherited the design choice of working on the ile level. Unlike most of the work in this category,
which attaches a stream ID to each single write request, FileStream calculates the stream ID based on the ile. It
attaches related information to the ile inode, which is stored in the VFS layer. Files with the same parent path
and ile extension are considered the same type of iles. The FileStream mapper aims to reduce the mixture of
diferent ile types and the lifetime diference of diferent iles, and the remapper will group iles with similar
characteristics using the K-means++ algorithm into a limited number of stream IDs based on the mapper results.
vStream[164] is the only work that requires manual stream ID assignment in this category after the original
Multi-stream paper. However, it proposed a new concept called virtual streams so that writes from diferent
sources can have their own streams. Multi-stream SSDs only allow for a limited number of concurrent streams,
which is not suicient for a large number of tenants. Diferent Multi-stream SSDs may provide diferent numbers
of streams, which requires developer attention if the stream IDs are assigned by the application. The problem
intensiies when diferent applications use the same stream ID for diferent purposes, which renders streams
useless. By providing a large number of virtual streams (i.e., 2 ✍ 1 in vStream), the problem can be mitigated
16
since there are enough (virtual) streams for diferent applications and diferent purposes. A remapper in the SSD
will map the virtual streams into a limited number of physical streams using the K-means algorithm to group
virtual streams with similar characteristics to the same physical streams, marking the irst of several works in
this category using K-means to cluster several entities (e.g., iles, streams) into a limited number of streams.
Both WARCIP[159] and DStream[111] later proposed having a variable number of clusters when using K-means.
WARCIP separates the address space into chunks, similar to AutoStream. It clusters chunks with similar lifetimes
into the same cluster using K-means, where each cluster corresponds to a stream. However, the number of clusters
can change depending on demand. If a cluster becomes too busy, i.e., too many requests have been put into a
single cluster, then the cluster will be split into two. In contrast, WARCIP may merge two clusters if a cluster
does not receive enough write requests in a period. The SSD will also provide feedback to the host if the host
falsely clusters long-lived data into clusters intended for short-lived data. Together, these features ensure that
each cluster minimizes the write interval of diferent write requests in each stream for a better WAF. DStream,
on the other hand, shows that the SSD internal metadata writes may increase if there are too many streams. It
counts the number of updates for each logical page, which will be used as the standard of hotness; however,
when using K-means to group pages into clusters, it may combine the two closest clusters into one if incoming
data has a farther distance than the distance between the two clusters and vice versa.
PCStream[95, 96] used program context (PC) instead of block address or ile information when assigning
stream IDs. The program context is deined as the call stack when a write request is issued. By knowing the PC,
one can identify which series of function calls ultimately caused the write request, which tells the origin of the
data. For applications written in Java, the PC lies in JVM, which the authors have modiied to support PCStream
for Java applications. Since data from the same origin can be considered to have the same characteristics, it makes
sense to put data from the same origin to the same stream. PCStream then uses K-means to cluster multiple
diferent PCs with similar characteristics into the same stream, since the number of stream IDs is limited.
DTR-FTL[153] is a scheme implemented in the FTL layer, which means that the host does not directly send
stream IDs to the SSD. The two components, lifetime-rating addressing and time-aware garbage collector, work
inside the SSD and decide which erase unit a write request should be assigned to. The lifetime-rating addressing
strategy assigns hot data to superblocks with higher Program/Erase (P/E) cycles, since blocks with higher P/E
cycles have a shorter data retention time, and cold data will stay on the SSD for a longer time. On the other hand,
the time-aware garbage collector interpolates the expected valid pages left in the superblock. Let ����� indicate
the average data invalidation time of a given superblock. After ����� , half of the pages in the superblock became
invalid, leaving the other half valid. The algorithm chooses the block with the smallest number of expected
valid pages calculated by interpolation. During the GC process, the relocated valid pages will be placed into a
superblock with a retention time that matches the expected lifetime left for those pages.
3.2.3 Machine Learning Model Approaches: With the increasing popularity of machine learning models, some
approaches leverage machine learning for stream ID assignment. StoneNeedle[161] started applying complex
machine learning algorithms to assign stream IDs. It extracts useful workload features to calculate hotness, which
is later processed using PCC to determine the correlation between hotness and features. Long short-term memory
(LSTM) is then used to learn the characteristics for stream ID prediction. A similar approach is later adopted
by ML-DT[46]. One major diference is that the training process is oline, which means that it cannot adapt to
diferent workloads on the ly. This is because the training process for machine learning models is too heavy
to be placed in SSD, since SSD has limited computational power. However, ML-DT utilizes multiple machine
learning algorithms, including temporal convolutional network (TCN), LSTM, support vector machine (SVM),
and random forest (RF). The authors concluded that TCN is the best model with the best accuracy and the lowest
resource requirement in SSD. It is also worth noting that this work is heavily inluenced by Multi-stream and
its related work (which can be seen from the Evaluation section of the paper), but it does not directly use the
Multi-stream interface between the host and the SSD. Rather, the trained model will be placed inside the SSD and
used to assign diferent superblocks internally.
3.2.4 The Fall of Multi-stream: Despite the promising results of prior works above, the Linux kernel removed
Multi-stream support in the bio layer and the default NVMe driver in 2022 due to the low adoption rate[69]. To
quote the commit message, łNo vendor ever really shipped working support for this, and they are not interested
in supporting it... No known applications use these functions.ž One possible reason is the emergence of Zoned
Namespaces SSD, which we will discuss in ğ 4.2. However, the removal of Multi-stream and write hints from the
bio layer received backlash from some vendors, including Samsung[105], Micron[130], and Western Digial[140].
They mentioned that the write hints in the bio layer was used by UFS[53], which is a common standard for
host/lash communication used in smartphones[72]. The write ampliication can be signiicantly reduced when
using F2FS with write hints. Despite the support from the vendors to keep write hints and Multi-stream in
the Linux kernel for UFS, the kernel maintainers still decided to remove the relevant code, stating the lack of
Multi-stream interface implementation on UFS in the Linux kernel to support the vendors’ claims[28]. The
support for Multi-stream and write hints was ultimately removed in the Linux kernel v5.18-rc1[69].
Although Multi-stream eventually faded out from the history, there are several legacies left by Multi-stream.
The authors of FileStream mentioned that they would like to apply their scheme to Zoned Namespaces SSD.
// RocksDB
// db/flush_job.cc
Status FlushJob::WriteLevel0Table() {
// ...
auto write_hint = cfd_->CalculateSSTWriteHint(0);
// ...
}
// ZenFS for ZNS Eval
// db/compaction/compaction_job.cc // fs/io_zenfs.cc
void CompactionJob::Prepare() { IOStatus ZoneFile::SetWriteLifeTimeHint
// ... (Env::WriteLifeTimeHint lifetime) {
write_hint_ = cfd->CalculateSSTWriteHint lifetime_ = lifetime;
(c->output_level()); return IOStatus::OK();
// ... }
} void ZonedWritableFile::SetWriteLifeTimeHint
Status CompactionJob::OpenCompactionOutputFile (Env::WriteLifeTimeHint hint) {
(SubcompactionState* sub_compact, zoneFile_->SetWriteLifeTimeHint(hint);
CompactionOutputs& outputs) { }
// ...
writable_file->SetWriteLifeTimeHint(write_hint_); IOStatus ZoneFile::Append
// ... (void* data, int data_size, int valid_size) {
} // ...
active_zone_ = zbd_->AllocateZone(lifetime_);
// env/io_posix.cc // ...
void PosixWritableFile::SetWriteLifeTimeHint }
(Env::WriteLifeTimeHint hint) {
// ...
if (fcntl(fd_, F_SET_RW_HINT, &hint) == 0) {
write_hint_ = hint;
}
// ...
}
Listing 2. Code from RocksDB (let) and ZenFS (right). The RocksDB code for calculating write_hint_ (not shown) and
using this value as desired stream ID (shown here) was added for Multi-stream support[110]. ZenFS uses the same algorithm
and I/O path for calculating and passing write_hint_ to the device[70].
The ZNS Evaluation paper[36] used the same algorithm and I/O path for assigning stream IDs to ZNS zones
in RocksDB[70, 110]. Listing 2 shows the relevant code. Using the method SetWriteLifeTimeHint, RocksDB
sets the write hint for a given ile depending on its type (e.g., SSTable of diferent levels, WAL, etc.), which is
calculated using the CalculateSSTWriteHint method. When writing a ile, the write hint will be passed, using
the fcntl operation F_SET_RW_HINT (added to the Linux kernel with Multi-stream[9]), as the stream ID when
the underlying SSD supports Multi-stream. The F_SET_RW_HINT operation is one of the few parts that survived
from the removal of Multi-stream code since it is used in ZenFS and ultimately ZNS SSD; the same write hint
ield is used to determine the zone to write to, which requires the F_SET_RW_HINT operation. This shows that
there are some possibilities to reapply schemes in the Multi-stream category to ZNS SSDs, and it is possible to
see some of the schemes built for Multi-stream SSDs appear in ZNS.
Channel 1
Chip 1
U U U U
n n n n
i i Chip 2 i i H
tReclaim
t Group
t 1t
a
Channel 2
0
1
0
2
Chip 3 1
5
1
6
n H
d a
Chip 4 l n
Controller Channel 3 e d
Chip 5
l
U U U U
n n Chip 6 n n
1 e
i i i i
Channel 4 tReclaim
t t
Group 2t
1 1
Chip 7
3 3
2
7 8 1 2
Chip 8
Traditional SSDs perform garbage collection on a superblock level, which is a collection of blocks from all
chips. Traditional SSDs do not provide any details about this information, and the garbage collection unit is also
nonconigurable by the host. FDP provides the ability for hosts to conigure the garbage collection unit based
on the capability of the SSD. An FDP-enabled SSD provides the host with a list of possible FDP conigurations,
which tells the possible granularities of the garbage collection unit supported by the SSD as shown in Figure 5.
The granularity is deined by the reclaim unit (RU), which can be as small as a single erase block on a die or as
big as a superblock. The reclaim units are then organized into reclaim groups (RGs); each can contain as little as
one reclaim unit or as many as all reclaim units[137]. The SSD provides a list of reclaim unit handles (RUHs);
each points to a reclaim unit in every single reclaim group. After choosing a coniguration, the host can choose
to place data in diferent garbage collection units by providing desired RUH and RG, ensuring that data with
diferent RUHs and RGs will not be mixed, similar to Multi-stream. When an RU pointed by an RUH is full, the
SSD automatically chooses another RU in the same RG to that RUH. Furthermore, the host can also allow or
disallow data isolation after garbage collection within a reclaim group, i.e., if data from the same reclaim group
but from diferent reclaim units can be mixed together after garbage collection[120]. These features more lexible
than Multi-stream and ZNS when choosing data placement locations.
Additionally, traditional SSDs do not provide any garbage collection-related statistics or feedbacks to the host.
This is the same for Multi-stream SSDs: although the host can choose to place data by attaching a stream ID, the
host does not know the efectiveness of such information; in other words, the host cannot identify if providing
the given stream ID helps in reducing write ampliication. FDP addresses this issue by providing statistics and
events related to garbage collection. The host can now query the exact number of bytes written by the host or
the SSD, which is enough for the host to calculate the exact write ampliication factor. FDP also supports event
logging; some notable examples include[120]:
❛ If the reclaim unit has changed for a write frontier (e.g., due to garbage collection);
❛ If the reclaim unit is underutilized (i.e., not written to full in a time period);
❛ If the reclaim unit was not written to full when the host changes a write frontier to another reclaim unit.
These features allow the host to learn about the internal states of the SSD with respect to garbage collection, and
the host can then adapt accordingly to achieve lower write ampliication.
Table 2. A summary of prior works that can be potentially improved with FDP. The genealogy tree is not provided, as papers
in each category show a linear, chronological relation.
Although FDP has been recently ratiied and accepted by NVMe, it still takes time for FDP to be applied and
researched, since the new NVMe version with FDP support is yet to be released. However, we identify two paper
categories closely related to FDP. The irst category is related to those FDP features on reducing WAF, and the
second category is SSD performance isolation. A summary of the papers can be found in Table 2.
3.3.1 WAF-reducing FDP Features: One feature provided by FDP is the ability to provide feedback to the host
regarding data placement and its efectiveness, so the host can dynamically change how data should be placed
physically. WARCIP[159], which we discussed in the Multi-stream section (ğ 3.2), provides similar feedback
mechanism from the SSD to the host; the host dynamically merges and splits streams according to the utilization
of each stream provided by the SSD feedback mechanism.
The second paper is PLAN[167], which utilizes some concepts in FDP, including dynamic garbage collection
unit granularities and shows that SSDs with superblocks exploiting all chips do not show the best performance
when performing random writes. The performance can be improved by dynamically assigning erase units with
diferent sizes by adjusting the number of chips used based on write characteristics. PLAN shows the feasibility
and efectiveness potential of FDP if RUs (i.e., erase units) with diferent sizes are used. However, PLAN is designed
entirely within the SSD irmware, which means that it can only infer the necessary information from the request
heuristics. With FDP, the host can directly control the organization of reclaim units from the information that
the host has, which is more comprehensive than what the SSD sees. This leads to a lower write ampliication
factor and a better SSD performance due to more informed decisions.
3.3.2 Performance Isolation: It is expected for SSDs to have multiple tenants. In a single computer, several
programs may run concurrently and issue I/O requests; in a large data center, an SSD may be shared by multiple
tenants, such as containers, virtual machines, and users. It is important to ensure that the I/O requests of one
tenant do not disturb other tenants; otherwise, the performance of other tenants may be afected[100]. Since FDP
provides an interface for choosing RUs that can potentially sit on top of diferent channels and chips for data
placement, using FDP to achieve performance isolation is feasible. The host can take over the task of identifying
diferent tenants and provide better performance isolation results with the FDP interface. Below are ive papers
we believe are most related to FDP with the potential to be improved using FDP.
The irst paper that may be potentially improved with FDP is OPS-Iso[94], which shows that whole-SSD GC
causes disturbance between diferent tenants because one tenant could trigger GC and afect the I/O performance
of others. As a mitigation, the paper separates the GC activities of diferent tenants so that the GC triggered by a
tenant will not afect others since the GC activity is limited to the chips used by that tenant. Another paper from
2015, VSSD[47], provides diferent virtual SSDs (vSSDs) to diferent users and uses a fair scheduler to provide
service to them. A later work, FlashBlox[71], provides vSSDs in diferent granularities (i.e., channel-isolated,
chip-isolated, etc.), which is very similar to FDP. CostPI[113] from ICPP 2019 provides more isolation for other
parts of SSD internals, including mapping table cache and data cache, in addition to the chip isolation used in its
prior works. Lastly, DC-Store[100] provides performance and resource isolation for containers. It implements
NVMe sets on a real SSD and statically assigns internal SSD resources to segregate diferent tenants physically.
These prior works can be implemented with FDP support to provide better performance isolation with tenant
identiication on the host side and garbage collection segregation in the SSD. To conclude, we hope that more
research can be done after FDP is integrated into the NVMe standard.
Channel 1
! ip 1
Z Z Z Z
o o o o
n n ! ip 2 n n
e e e e
Channel 2
0 0 ! ip 3 1 1
1 2 5 6
! ip '
Controller Channel 3
! ip )
Z Z Z Z
o o ! ip * o o
n n n n
Channel 4 e e e e
! ip +
1 1 3 3
7 8 1 2
! ip ,
OSSD [107]
MSST ’13
Multi-stream ➉[86]
HotStorage ’14
OpenChannelSSD [7]
2015
AppMgmtFlash [106]
FAST ’16
FStream [134]
FAST ’18
➉
ZNS [35]
VAULT ’19
CAZA [102] LL-compaction [81] ZNS Parallelism [30] Small Zone RocksDB [75] FS Journal In ZNS [48] FDP [19]
HotStorage ’22 HotStorage ’22 HotStorage ’22 Middleware ’22 JMIS ’22 2022
ZoneKV [115] LifetimeKV [112] Persimmon [132] WALTZ [103] RAIZN [97] DockerZNS [67] ZapRAID [152] eZNS [123] ZNSwap [32] KV-CSD [128]
DAC ’23 ICCD ’23 ICCD ’23 VLDB ’23 ASPLOS 2023 NVMSA ’23 APSys ’23 OSDI ’23 ToS ’23 CLUSTER ’23
considered optional, the host has to explicitly choose a zone to write to. The host must also manage the garbage
collection of ZNS SSDs by choosing a victim zone to perform GC. In summary, ZNS SSDs expose their superblocks
as zones to the host in some sense, and the host has to choose a zone to write to when issuing a write request
and a zone to erase when the number of available zones fall under a certain range. In the rest of the section,
we will irst focus on the eforts to support ZNS on the host side, then discuss the research works shown in the
genealogy tree (Figure 7). A summary of the research work related to ZNS can be found in Table 3.
4.2.2 Bringing Host Support to ZNS SSDs: When compared to Open-channel SSDs, ZNS SSDs require fewer
responsibilities to be moved from the host[117]. This makes the development of ZNS-aware applications easier.
The existing code for Multi-stream SSD can be used directly by applying stream ID as the zone number. The
existing code in the kernel can also be recycled similarly. Listing 2 shows an example of recycling the Multi-
stream stream ID assignment algorithm and the datapath by ZNS. The modiication in RocksDB was intended for
Multi-stream, but was directly used by ZenFS, the storage backend extension for RocksDB ZNS SSD support[36].
ZenFS is backed by a simple ilesystem named ZoneFS, which exposes each zone as a ile[59]; each ile stores its
corresponding zone information in its inode[56]. The desired zone number will be selected using the RocksDB
based on the type of data (e.g., SSTable level) by rehashing the stream ID calculation logic from Multi-stream.
RocksDB then passes the zone number to ZenFS as its middle layer, which ultimately uses ZoneFS as the backing
ilesystem. By writing to diferent iles, each representing a zone, RocksDB can efectively separate data with
diferent characteristics into diferent GC units.
However, ZoneFS has its limitations: the iles representing diferent zones can only be written sequentially[59],
requiring developers to write additional code to make their applications ZNS-compatible. Some traditional
ilesystems, including F2FS and Btrfs, now support ZNS SSDs as a backend so that applications can run with ZNS
SSDs without modiication. Both ilesystems can adapt to ZNS SSDs relatively easily due to their underlying
design concepts. F2FS is a log-structured ilesystem, which means that all write requests are placed sequentially; in
other words, F2FS aligns intrinsically with the design concept of ZNS SSDs. However, it still requires conventional
zones (i.e., with random write support) or other conventional devices to place ilesystem metadata due to its
design[58].
On the other hand, Btrfs is designed based on Copy-on-Write (CoW) to place the newer version of data in a
new location instead of performing the in-place update, which also aligns with the design concept of ZNS SSDs.
Since the only Btrfs metadata that requires a ixed location is the ilesystem superblocks (not to be confused with
the superblocks in SSDs we described in previous sections), two zones are reserved for ilesystem superblocks.
The latest ilesystem superblock will be appended to the previous version. When one zone is full, the ilesystem
superblock will be written to the other zone, and the previous zone will be reset and used when the new zone is
full. The latest ilesystem superblock can always be easily retrieved by checking the last write pointer location
since it is always appended to the previous ilesystem superblock[57].
In summary, ZNS SSDs can be supported in two ways: application being ZNS-aware or ilesystem with ZNS
support. The irst approach provides more lexibility for applications to control data placement, but this could be
time-consuming for developers and eventually hinder the adoption of ZNS SSDs. To mitigate the issue, ilesystems
can be designed to be ZNS compatible so that applications can run with ZNS SSDs smoothly without being aware
of the underlying SSD type.
4.2.3 Improving Performance on KV Database: Early research on ZNS focused on combining key-value databases
with ZNS SSDs. The policy of writing key-value database iles is the central topic for these works. ZoneKV[115]
tackles a problem inherited from Multi-stream. The write hint brought in by Linux kernel and RocksDB has
only four levels of temperature, namely SHORT, MEDIUM, LONG, and EXTREME[9, 110]. ZoneKV sets the lifetime to 1
for �0 and �1 , 2 for �2 , and 3 for �3 SSTables, which is the same for RocksDB[110]. However, RocksDB sets the
lifetime of all SSTables of �4 and beyond to 4 due to the limitation, while ZoneKV sets the lifetime of �� SSTables
to �, bypassing the limit of 4.
CAZA[102] from HotStorage 2022 improves the algorithm brought in by Multi-stream and used by the ZNS
evaluation paper[36]: Instead of allocating zones by using compaction level only, RocksDB can leverage the
SSTable information to make better zone choices. By the deinition of SSTable compaction, when a compaction
happens, several SSTables with overlapping key ranges will be invalidated together and compacted into a single,
new SSTable. This indicates that SSTables with overlapping range should be written to the same zone. If an SSTable
has no key overlap, a new zone is allocated for the SSTable. However, as of February 2024, the improvement
brought by CAZA to the CalculateSSTWriteHint function is not relected in the RocksDB codebase[23]. LL-
compaction[81], also from HotStorage 2022, provides another zone selection algorithm for KV databases. It
focuses on keeping zones dedicated for each compaction level, but the SSTables are split into ine-grained key
ranges so that key ranges that will be compacted soon will not be mixed with other SSTables in the same zone.
However, it does not work on some databases like RocksDB because they employ a priority-driven SSTable
selection algorithm with extra factors, including age and number of deleted items, when choosing SSTables to
compact. LifetimeKV[112] also notices the possibility of having short-lived SSTables after compaction similar to
LL-compaction. It proposes two mitigations: irst, a newly-generated SSTable should not have an overlapping key
range with upper-level SSTables, so when an upper-level SSTable is chosen for compaction, the new SSTable
aforementioned will not be selected for compaction, increasing its potential lifetime to match with other SSTables
in the same level; second, an SSTable in a level will be prioritized to be chosen for compaction if it has stayed too
long in the level, which reduces the possibility of having multiple SSTables with long lifetime variety at the same
level.
WALTZ[103] tackles the KV database performance problem by looking at the problem in another way. Instead
of mainly improving the method of placing SSTables like the aforementioned works, WALTZ identiies the usage
of the zone report command as a source of performance degradation when a zone is full. When writing to a zone
fails (e.g., the zone is almost full), a zone report request for checking the current write pointer location will be
sent to check the remaining free space in the zone. This causes extra latency due to the communication overhead
between the host and the SSD. Instead of checking SSD for write pointer information, WALTZ utilizes other
zones to write the data so that the long latency caused by the zone report command can be eliminated.
4.2.4 Exploiting ZNS Internal Parallelism: Don’t Forget ZNS Parallelism[30] from HotStorage 2022 brings up
another topic of ZNS SSD: internal parallelism. Modern SSDs rely on internal parallelism for their blazing speed,
but this information is not shared with the host system for traditional SSDs. Interestingly, ZNS SSDs also do not
share this information with the host system even though ZNS SSDs are more transparent than traditional SSDs.
If a host writes to more than one zone concurrently, there is a chance that the host is unknowingly writing to the
same chip, causing performance degradation. On the other hand, some ZNS SSDs expose large zones (e.g., 2.18GB),
while others expose small zones (e.g., 96MB and utilizing a single lash chip). The SSD may assign a higher degree
of parallelism to larger zones, while small zones may not be assigned with a high degree of parallelism. It is
possible to write to multiple small zones simultaneously, but again, without knowing the zone-to-chip mapping,
the host could unknowingly write to two zones mapped to the same chips. The paper proposed an algorithm to
identify zones that are mapped to the same chips: by checking if writing to all combinations of two zones causes
performance degradation, the algorithm is able to identify these conlict groups (CGs) of zones. The system I/O
scheduler is then modiied to schedule I/O requests so that data writing to the same CG will not be written at the
same time for the maximum performance. A later work, eZNS[123], extends the idea by automatically assigning
zones that do not conlict with each other into logical zones (v-zones), which contain a variable number of zones.
The number of zones in a v-zone can shrink or expand depending on the workload requirement to maximize the
performance of each v-zone.
Small Zone RocksDB[75] also focuses on exploiting parallelism on small zone ZNS SSDs. However, instead of
implementing its solution in the I/O scheduler, it is a work focusing only on RocksDB and ZenFS. For ZNS SSDs
with small zones, the size of an SSTable is much larger than the zone size. Instead of writing SSTables sequentially
zone by zone, the paper proposes striping them sequentially to multiple diferent zones to exploit parallelism;
however, each zone will still be used to save SSTables of a single compaction level.
4.2.5 Improving ZNS Internals: There are other works that focus more than bringing ZNS support to key-value
databases. ZNS+[66] identiied some potential problems for the vanilla ZNS SSD and host-SSD protocol. When
garbage collection is performed, any valid data in the victim zone should be relocated, similar to the garbage
collection process in a traditional SSD. However, in a traditional SSD, garbage collection is triggered internally in
the SSD. The relocation traic is not observable by the host but does not require host attention either. ZNS SSD,
on the other hand, requires the host to manage the garbage collection process. The host has to relocate the valid
data from the victim zone, which incurs extra overhead by reading the valid data in the victim zone to the host
and then writing them back to the SSD. This process causes data to move not only once but twice between the
host and SSD.
To remove the external data movement between the host and the SSD during the garbage collection process,
ZNS+ proposed an extension to the ZNS protocol so that the data relocation process will be internal to the SSD.
The simple NVMe copyback function is extended, named zone_compaction, to accept noncontiguous address
ranges for valid data relocation. This is for F2FS with threaded logging, where there can be direct overwrites to
dirty segments, which is supported by adding TL_opened to the ZNS+ protocol. The SSD should also show its
internal mapping information to the host: The host then knows which chunk (the smallest unit of copyback) of
data is stored on which physical chip. When performing copyback, the host should relocate data from one chip to
the same chip to prevent inter-chip traic. However, to the best of our knowledge, this proposed extension to
ZNS is not a part of the NVMe ZNS protocol as of February 2024.
Another work, Blocks2Rocks[117], also proposed changes to the ZNS interface. With the trend of increasing
SSD page size (e.g., commonly around 16KB as of the writing of this survey), updating a small unit of data,
e.g., 4KB, causes overhead due to read-modify-write. The SSD has to invalidate the old physical page and then
write the page to a new location. Blocks2Rocks proposed an extension to support small łrocksž, which can be as
small as 16 bytes, on ZNS SSD. The NVMe command set speciication stated that a block size of 512 bytes is not
supported[21], which shows the necessity for protocol modiications. The author proposed saving small rocks
on in-SSD NVRAM, which requires about the size of one page per active ZNS zone. This space is used to bufer
the rocks written to the SSD, which will be transferred to the lash chip for consistency when the bufer is full.
However, similar to ZNS+, this work was not ratiied into the NVMe protocol.
4.2.6 Bring ZNS to New Use Cases: Although the design of ZNS and KV databases is well orchestrated for each
other, and the earliest evaluation focuses on KV databases for ZNS-related papers, there are eforts to utilize ZNS
SSDs for other kinds of workload. ZNSwap[31, 32] uses ZNS SSDs for Linux swap space. When a memory page is
swapped out, it will be written to the SSD. The paper provides diferent policies on choosing a zone to write to
for swapped-out pages. After choosing a zone, the location of the page will be written to the page table entry,
which will also be updated if the page is relocated during future GC processes. DockerZNS[67] leverages ZNS
SSDs for Docker images, which may require diferent QoS and segregation, and there may be noisy neighbors.
It provides a number of zones for each Docker image and stripes the image to several zones based on its QoS
requirement, which is done by knowing the performance a single small zone can provide. The paper also irst
identiies the conlict groups of zones[30] so the zones used for the same image will not use the same chips,
ensuring maximum utilization of internal parallelism.
There are also works on improving current ilesystems for ZNS. FS Journal in ZNS[48] works on separating
ilesystem metadata and user data, similar to FStream but on ZNS SSDs. Persimmon[132] changes F2FS so that
it is more ZNS-native by changing the metadata to append-only and improving the checkpoint logic. As we
discussed earlier in the section, the original F2FS design expects metadata and checkpoints to be in a known
address range, which means that updating metadata and checkpointing lead to in-place overwrites, causing
incompatibility with ZNS design goals. Persimmon assigns dedicated zones for frequently updated metadata for
easy cleaning, which eliminates in-place updates of ilesystem metadata. It also writes checkpoints at the end of
each zone for easier garbage collection since checkpoints will not spill to other zones.
Beyond ilesystems, RAID is also a fundamental service for applications. RAIZN[97] provides RAID support for
ZNS SSDs and exposes an SSD interface compatible with ZNS-aware applications and ilesystems. ZapRAID[152],
on the other hand, exposes a block-level volume with random read/write support. In summary, these papers
together strive for more fundamental improvements for better application performance running on top of them,
and we hope future works can show the beneit of using ZNS SSDs for general use cases in diferent environments,
including servers and even smartphones.
Requests Channel 1
ip 1
func1(req1, LBA_Range1)
ip 2
func2(req2, LBA_Range2) S S S S
Channel 2 u
p
u
p
ip 3 u
p
u
p
func3(req3, LBA_Range3) e e e e
r
b
r
b ip r
b
r
b
Host Controller l l l l
ip
Channel 3 o o o o
c c c c
Responses k k k k
5.1.1 Prelude: The concept of computational storage predates the era of SSDs. Active Disks[135] was published
in 2001, marking one of the irst instances of a computational storage system. The paper identiies that HDDs have
processors and RAM just like a normal computer, which means that they can also process data like standalone
computers. By processing data inside multiple HDDs before sending them to the host, the performance of Active
Disks scales better than using a single host processing data from all HDDs. However, because of the limited
computational power of HDDs at the time, the performance of a computational storage system with only one
HDD is limited; the system only works better when multiple HDDs are working together. Thankfully, modern
SSDs have better computational powers than HDDs, as one can expect based on the Moore’s Law. SSDs like
Samsung PM1725 have dual-core processors with a frequency of 750 MHz[80]. A single SSD can also easily
INSIDER [136] ➉
ZNS [35]
ATC ’19 VAULT ’19
surpass the throughput of the HDD array presented in Active Disks by exploiting its internal parallelism, which
means computational storage can be achieved by using a single SSD.
The structure of a typical computational storage-enabled SSD is shown in Figure 8. The SSD allows a series of
functions (also known as tasklets) to be run on SSDs, and the data to be returned to the host will be processed irst
before sending back to the host. Not only are fewer data transferred to the host, but the host can also immediately
use the processed data without the need to process them on the host again. Some works may also use extra FPGA
instead of SSD processors for computing. However, reprogramming the SSD irmware or the FPGA is usually
challenging. Most commercially available SSDs are blackboxes without the ability to reprogram their irmware,
whereas attaching FPGA to SSDs also requires nontrivial eforts. Therefore, many works aim to make coding for
computational storage easier and more general without the need to reprogram the irmware or use FPGA. The
relationship between diferent prior works can be found in Figure 9, and the summary of the related papers can
be found in Table 4.
5.1.2 SmartSSD-based Approaches: SmartSSD[88] is the irst work to allow an SSD to be easily programmed for
diferent computational tasks. Users can write C code on the host, then cross-compile it to ARM-architecture
binary, and inally embed the program (named tasklet) to the SSD irmware. The tasklet is then executed to
perform the given task, after which the host polls for results. This requires modiication of the SATA protocol
with additional vendor-speciic commands. The target application scenario for the original SmartSSD is the
Hadoop MapReduce type of workload, but it also enabled the ability for arbitrary tasks to be run on SSD for
near data processing. QuerySmartSSD[60] leverages the ability to create tasklets and perform query processing,
showing better performance and lower energy consumption compared to a traditional SSD. YourSQL[80] is also
based on SmartSSD and provides early iltering of SQL query results, which drastically reduces the number of
I/Os. However, SmartSSD has a limitation: the tasklet loading process is not dynamic, meaning that a tasklet can
only be loaded to the SSD internals oline. Biscuit[65] further enhanced the ability of the original SmartSSD with
dynamic task loading and unloading, which means that users can dynamically load their tasks on SSD instead
of coupling the task code into the SSD irmware. It also supports the C++ 11 standard and libraries with a few
exceptions, providing more lexibility for developers to write and deploy their tasklets.
5.1.3 FPGA-Assisted Platforms: INSIDER[136] builds a general computational storage platform using FPGA to
accelerate in-storage computing tasks. It provides a POSIX-like ile I/O APIs to the user. Users can register tasks
and bind them to real iles. When reading/writing such real iles, the user will be able to read the corresponding
virtual iles with the results processed by the bounded tasks. Users can also choose to program in C++ or Verilog,
depending on the users’ needs.
So far, all works we discussed above assume the host is a single physical machine. However, modern cloud
infrastructures may use virtual machines to provide the service to diferent clients. The Hardware-based Virtual-
ization Mechanism for Computational Storage Devices[99] provides the ability to share a single computational
storage system using the standard single-root I/O virtualization (SR-IOV) layer. It also features an architecture
which decouples SSD and FPGA, which allows scaling the number of SSDs in the same system.
5.1.4 Special Use Cases: Some works focus on a special use case rather than developing a general platform.
PolarDB Meets Computational Storage[43] is a paper focusing on bringing distributed databases with SQL support
to computational storage. To achieve this task, the storage engine, the distributed ilesystem PolarFS, and the
computational storage driver have to be modiied. The storage engine passes additional information, including
data ofset, table schema, and scan conditions of the SQL query, to the PolarFS. Based on those information,
PolarFS brings data from diferent drives, and the computational storage driver will optimize the scan conditions
and split them into smaller subtasks before they are passed to the drives for better resource utilization. The
underlying block structure is also changed to simplify the FPGA implementation of data scan.
RecSSD[154] focuses on integrating recommender systems on computational storage. However, its operators
are embedded in SSD FTL, which means that it lacks the ability to dynamically load/unload diferent recommender
models. Another work, GLIST[108] (short for Graph Learning In-STorage), target on graph learning, which is also
used for recommender systems but can also be used in other use cases. To overcome this issue, GLIST provides a
set of APIs to directly operate on the graph components (e.g., edge, vertex) and is able to directly analyze graphs
using pre-trained models, which can be dynamically loaded by the host. This addresses the limitation of RecSSD
and allows the inal result to be sent to the host without the need for large data movements outside the SSD.
GenStore[119] brings computational SSD into genome sequence analysis. It also ilters out unwanted data to
reduce the data size to be transferred from SSDs to the host. GenStore has two modes: accelerator mode and
regular mode. The accelerator mode allows the SSD to perform in-storage processing while the regular mode
allows the SSD to be used as a regular SSD.
KV-CSD[128] has a unique combination of two worlds: Computational storage on ZNS SSDs. By using a
Linux-based SoC and implementing the key-value store in the SoC, KV-CSD has a similar architecture to KV-SSDs,
where a host-level application uses the device as a key-value database.
5.1.5 Energy Concerns: Many papers in the ield of computational storage discuss their energy eiciency
compared to traditional computer systems[60, 80, 88]. ActiveFlash[39, 148] focuses on modeling the energy
consumption of computational SSDs. The authors also created a prototype using OpenSSD with common
scientiic data processing functions in SSD, including max, mean, standard deviation, and linear progression.
5.1.6 Toward a Standardized Computational SSD Design: We have observed a dozen of prior works that focused
on computational storage with SSDs. However, there was no industry standardization efort that allows users to
create tasklets without regard to the underlying storage device. Thankfully, the NVMe standard is now looking
into the possibility of using eBPF for computational storage[68]. This can be traced back to XRP[170, 171], which
shows the ability to use eBPF in NVMe drivers. The paper shows that given the increasing speed of storage
devices, the upper layers above the SSD in the host now account for almost 50% of the total overhead, of which the
ilesystem accounts for 80% of the overhead. The closest layer to the SSD, being the NVMe driver, only accounts
for about 2% of the total overhead. This means moving the data processing location to the NVMe driver can
signiicantly reduce latency by about 50%. The BPF function, which is saved in a pointer in the bio request, will
be invoked when an NVMe request completes. Although this work should not be categorized as computational
storage, this work shows close relationships with two later works, LambdaIO[163] and Delilah[68], both of which
use BPF functions for in-storage processing.
LambdaIO[163] argues that using eBPF has limitations to general in-storage computing architecture because
eBPF requires a static veriier. eBPF does not support pointer access and dynamic-length loops, which are common
but error-prone features. The authors present �-IO and sBPF, where s stands for storage. The �-IO part provides a
set of APIs by extending common ile APIs, allowing users to provide functions to perform in SSD. sBPF addresses
the limitation of no pointer accesses and dynamic-length loops in the standard eBPF. The combination of these
two provides a platform for developers to develop in-storage computing functions using familiar frameworks.
Delilah[68] is a similar work but has more limitations, e.g., no veriication of the provided eBPF programs. Lastly,
Recent NVMe technical proposal TP4091 about computational storage was created for a uniied computational
storage stack[68], and Samsung created a new generation of SmartSSD based on TP4091[124]. We sincerely hope
that the interface for computational storage on SSDs can be fully standardized in the near future.
Requests Channel 1
ip 1
PUT(key1, value1)
ip 2
PUT(key2, value2) S S S S
Channel 2 u u u u
p p
ip 3 p p
GET(key1) e e e e
r
b
r
b ip r
b
r
b
Host Controller l l l l
Channel 3 o o o o
c c ip c c
Responses k k k k
Status: SUCCESS
0 0 ip 1 1
1 2 5 6
Mapping Channel 4
Status: SUCCESS Ch 0, Die 1,
ip
key1
Blk 3, Pg 5
Value of key1: value1 Ch 3, Die 5,
ip
key2
Blk 7, Pg 4
Key-value SSDs provide a key-value interface instead of the traditional block interface. The host directly
communicates with the SSD with a key-value interface similar to how applications use key-value databases.
The SSD internally functions as a key-value database, and the FTL is in charge of the mapping management
from keys to values, as shown in Figure 10[89]. In some sense, KV-SSD can be considered as a special kind of
computational storage since the SSD is designed to handle one task only[166]. Figure 11 shows the relationship
between KV-SSD-related works, and Table 5 shows the summary of KV-SSD-related works.
5.2.1 Prelude: Before the standardized KV-SSD[89], there are some other papers focusing on creating SSDs with
better database support on the hardware level. One of the earliest attempts is the X-FTL[87], which exposes a
standard SSD interface but with extended SATA and ilesystem operations. A transaction ID can be provided
to the read/write request for consistency, and two new operations, commit() and abort(), are provided with
transaction ID as the argument. The FTL is also extended with an extra mapping table to record which pages are
for which transaction, which works like a list of SQLite rollback journals, and the page mapping from the extra
mapping table will relect on the normal mapping table when the transaction is committed or discarded in case
X-FTL [87]
SIGMOD ’13
LOCS [151]
EuroSys ’14
KV-SSD ➉
[89]
KV-Storage [12]
SNIA API v1.0
SYSTOR ’19
2019
KV-Storage [15]
PinK [74] KVMD [129] StripeFinder [116]
SNIA API v1.1
ATC ’20 FAST ’20 HotStorage ’20
2020
ISKEVA [169]
LCTES 2022
Dotori [61]
VLDB ’23
the transaction is aborted. However, X-FTL still requires a modiied version of SQLite running on top of the host
to communicate with the underlying modiied ilesystem and device.
Although X-FTL works on improving SSDs for better SQLite support, later works focus on key-value databases.
This is because key-value databases have some responsibilities that overlap with SSD FTL: namespace management
manages the mapping from a key to a value, which is similar to the FTL mapping table, which maps a logical
address to a physical address; Meanwhile, key-value databases use a log-like writing mechanism, which is similar
to how SSDs perform data writes and updates. It is a natural move to remove the extra layer of overhead for
better performance[160].
KAML[79] later extended similar mechanisms to key-value databases. KAML SSD is the shorthand for key-
addressable, multi-log SSD, which well summarizes the characteristics of KAML. A library, libkaml, provides
a list of APIs similar to what traditional key-value databases usually provide, including the support for read,
update, update, commit, and abort. Diferent namespaces are also supported, which can be considered as tables
or iles based on demand. libkaml then communicates with the device driver and the device to ind out the
value data page(s) a key is associated with. A key is 64-bit in size, but the data size for the value can vary. This
approach removes the requirement of running a database on the host, as applications can directly use the SSD as
a key-value database.
Name Description
X-FTL
Extends the SATA protocol and ilesystem for better SQLite transactional support on SSD.
[87]
KAML
Provides a new, key-value-based interface for the SSD; applications can directly use the SSD as a key-value database.
[79]
LOCS
Uses Open-channel SSD for LevelDB. The paper provides policies to improve performance when performing I/O requests.
[151]
FlashKV
Allows a leaner I/O stack for LevelDB by using an Open-channel SSD with a custom user library and driver.
[165]
KV-SSD Provides a leaner I/O stack than KAML by posing the key-value APIs directly from the device. The irst vendor’s attempt
[89] to create a key-value SSD.
PinK Solves the long tail latency caused by bloom ilters because of its probabilistic nature. Instead of using bloom ilters, PinK
[74] pins top several levels in the DRAM for faster access speed.
KVMD Provides the ability to use several KV-SSDs for better performance and reliability. Similar to RAID in traditional block-
[129] interface SSDs.
StripeFinder Improves the overhead caused by the erasure coding approach in KVMD. However, the paper claims that it is better to use
[116] replication instead of erasure coding for workloads with many small objects.
Extends the use case of KV-SSD to OpenMP applications by translating OpenMP API calls into KV-SSD operation calls. It
KV-SiPC
also dynamically changes the number of parallel compute threads and parallel data access threads based on CPU and SSD
[33]
utilization.
KEVIN Uses KV-SSD as the foundation for a general ilesystem. Translates ile I/O-related system calls into KV-SSD operations
[98] and extends KV-SSD interface for better transaction support.
GPUKV Allows direct peer-to-peer access between GPU and KV-SSD when the workload is performed on GPU instead of CPU.
[85] Reduces I/O overhead due to the data movement from KV-SSD to GPU via user space.
KVRAID Stores small key-value objects by packing them together to reduce the overhead caused by an excessive number of keys
[133] associated with small key-value objects.
ISKEVA Extracts video features in FTL and saves the information to KV-SSD, with the extra support of iltering data within the
[169] SSD to reduce the data to be transferred to host when performing a query.
Dotori Provides extra KV-SSD features, including transactions, versioning, snapshots, and range queries; the paper also proposed
[61] OAK-tree, a new way of organizing B+-trees tailored for KV-SSDs.
Two papers, LOCS[151] and FlashKV[165], couple key-value databases with Open-channel SSDs for better
performance. Both are based on LevelDB and provide custom user libraries to to support the key-value database
running on top of the host. The libraries cooperate with the Open-channel driver in the kernel to save the
actual data on SSD. The main diference between the two is that LOCS utilizes ile-level parallelism, whereas
FlashKV leverages channel-level parallelism, which shows better read performance when only a single SSTable
is accessed[165]. KAML, LOCS, and FlashKV together build the foundation before the irst vendor attempt of
creating a KV-SSD[89], which provides an even leaner I/O stack than KAML; and unlike LOCS and FlashKV,
users do not need to write their own management policies and FTL.
5.2.2 Internal Performance Improvements: The creation of KV-SSD allows the SSD to be used directly as a
key-value database. However, there is more space for improvement. PinK[74] identiies the probabilistic nature
of bloom ilters and improves performance by pinning the top levels of LSM-trees. Since bloom ilters are
probabilistic, it improves the average latency, but it does not improve the tail latency. Reconstructing the bloom
ilter also results in high CPU overhead, which is less of a problem on a computer with faster CPUs, but it is
a more serious problem on KV-SSDs. Instead of using the bloom ilter, PinK pins the highest several levels of
LSM-trees in the KV-SSD DRAM, which is feasible because the inquiry overhead is bounded by � ℎ ✍ 1✝, where ℎ
is the height of the LSM tree, which is bounded; the size of the top levels of LSM trees, which keep the hottest keys,
are also relatively small as argued by the authors. The tail latency can be improved with these two optimizations
tailored for KV-SSDs.
5.2.3 Other KV-SSD Usage Scenarios: Although KV-SSDs are for key-value stores intuitively, KEVIN[98] takes a
step further by leveraging KV-SSD for a more general use case: It builds a ilesystem to be used by any type of
application, hence named key-value indexed solid-state drive. By extending the KV-SSD interface with transaction
support, the KV-SSD can be used to achieve consistency at the hardware level. Common system calls regarding
ile/folder creation, e.g., mkdir(), creat(), unlink(), and readdir(), are translated into KV-SSD operations
including GET(), SET(), DELETE(), and ITERATE(), so that existing applications do not need to change their
code.
While some papers try to generalize the use cases of KV-SSDs, others focus on bringing KV-SSDs to other
speciic use cases. One use case is programs based on OpenMP with high concurrency. It is not uncommon to
have high concurrency for key-value databases. For example, Facebook shows that they have billions of GET()
requests in a period of 14 days, which translates to at least 800 queries per second[44]. Therefore, a key-value
system should be able to handle a vast number of requests concurrently. KV-SiPC[33] provides an approach to
address OpenMP workloads with several program threads. Existing applications do not need to modify their
code to migrate from traditional block-interface SSDs to KV-SSDs for their applications. KV-SiPC changes the
OpenMP internals so it uses the key-value APIs for upper-level applications, and it also adapts the number of
parallel compute threads and parallel data access threads based on CPU and SSD utilization.
The existence of KV-SSD reduces the level of the I/O stack from the host to the SSD. However, most papers
assume that the workload is running on the CPU. If the workload runs on another device (e.g., GPU), there is
another I/O stack to transfer data to the target device. To further reduce the number of levels in the I/O stack,
GPUKV[85] is proposed to allow direct communication between the GPU and the storage. Instead of bringing
data from the KV-SSD to the host OS and then from the host OS to the GPU, GPUKV creates the data path from
KV-SSD to GPU directly using the PCIe peer-to-peer feature without going through the user space. This approach
removes the heavy user space I/O stack for GPU workloads.
Another work, ISKEVA[169], uses KV-SSD as an engine for video metadata. Videos may have metadata
associated with the video ile itself, and extra features may be added to the ile (e.g. if an object exists in a video).
A feature extractor is integrated into the SSD FTL, and the extracted features are saved in the KV-SSD. The
KV-SSD with ISKEVA supports extra query lags for data iltering so that only iltered results will be returned to
the host, eliminating the requirement of the host to perform feature extraction and result iltering.
5.2.4 Feature Improvements: Although the creation of KV-SSD led to the irst KV-SSD standard[12, 74, 89],
interface improvements are brought up for more features. KVMD[129] allows the creation of an array of KV-SSD
devices, similar to RAID in traditional block-interface SSDs. However, since data is stored in key-value mappings
in KV-SSDs, it is replicated/erasure-coded based on the key of the value. The value of a key can be mapped
to several devices for better reliability and performance. However, the approach of using erasure coding for
reliability in KVMD causes signiicant data replication. This is because the erasure coding must be made on
the key-value namespace, and the value sizes are usually only several times greater than the key sizes (e.g.,
smaller than 6:1 and sometimes about 1:1 as reported by Facebook[44]), causing a lot of extra overhead for
the data stored. StripeFinder[116] aims to enhance spatial eiciency when using erasure coding. By sharing as
much metadata across diferent keys, it achieves a lower spatial overhead with a value-to-key ratio around 12:1
compared to KVMD. However, the author concludes that even with StripeFinder, the overhead of using erasure
coding remains signiicant for realistic workloads, and it is better to just use replication in this case. KVRAID[133]
further addresses the issue. Since small and large data objects have similar IOPS, but the metadata overhead
associated with smaller data objects is greater than larger data objects, KVRAID packs several small logical data
objects from the host into a large physical data object to mitigate the metadata overhead. The data will be written
to the device when a suicient number of objects are accumulated in SSD DRAM; however, to bound the I/O
latency, there is also a timeout mechanism where the currently accumulated data will be written to the device
despite the number of objects accumulated. This approach eiciently stores small objects (i.e., 128 to 4096 bytes)
on KV-SSD arrays.
Dotori[61] provides more features to the KV-SSD, including transactions, versioning, snapshots, and range
queries. It also provides better indexing support for KV-SSD by using their proposed OAK-tree, a B+-tree tailored
for KV-SSDs. The authors also call for the standardization of these features by implementing them in SSD (instead
of on-host like Dotori) and extending the KV-SSD interface. Features like transactions are already used by some
prior works (e.g., KEVIN[98]), showing the usefulness of these features for extended KV-SSD use cases.
7 The Future
Table 6. Comparison between several related SSD interface enhancements for data placement[64, 131].
So far, we have discussed several enhancements to storage abstractions, including TRIM, Multi-stream, FDP,
Open-channel SSD, ZNS, computational storage, KV-SSD, and byte-addressable SSD. TRIM is widely adopted by
modern SSDs. Multi-stream and Open-channel SSD are now obsolete in practice. ZNS, computational storage,
and KV-SSD are established research areas. FDP and byte-addressable SSD with CXL are on the rise. In this
section, we would like to provide our thoughts on the following questions:
Exploring the potential for storage abstraction extensions to enter consumer markets can also drive increased
demand. Zoned storage, for instance, has been successfully integrated into the UFS standard, which is pre-
dominantly used in smartphones. FDP and CXL can adopt similar strategies to target consumer markets. For
FDP, gaining ilesystem support is a prerequisite to ensure seamless integration without requiring user or
application-side coniguration. An FDP-enabled SSD with ilesystem support should exhibit similar or better write
ampliication factor (WAF) and SSD longevity compared to FStream[134], a worthy pursuit given the declining
trend of maximum SSD program-erase cycles[91, 118]. CXL may also ind applications in personal computers or
even smartphones, especially as interest grows in on-device machine learning model training[76, 141, 156]. The
limited resources of personal computers or smartphones present a signiicant opportunity for CXL to overcome
the memory wall on these devices.
With backing from both creators and users, FDP and CXL hold signiicant potential from their inception.
However, numerous questions remain before their widespread adoption can be realized. Addressing these
questions can provide valuable insights for other researchers in the ield. For instance, conducting feasibility
studies can identify the best use cases for these new enhancements[82, 162]. Another approach is to analyze
similar past enhancements and apply those insights to the current context. Comparing related enhancements can
also help determine if the new enhancement ofers greater beneits. By thoroughly studying feasibility, beneits,
and demand, we hope that FDP and CXL will achieve widespread industry adoption in the near future.
7.4.1 Learning from Universal Flash Storage: UFS has been a critical interface for mobile and embedded storage
solutions. Mobile and embedded devices are more energy-aware than traditional computers, which means UFS
has to be energy-eicient. While focusing on performance improvements, new UFS standards also aim for lower
power consumption. For example, UFS 4.0 has a power eiciency that is reportedly 46% better than UFS 3.1 while
doubling the read performance[138]. Although our paper mostly focuses on the storage interface for traditional
computers (especially NVMe), we are seeing more interactions between UFS and NVMe where one learns from
the other (e.g., integrating Multi-stream and ZNS support on UFS). As UFS is mostly used for mobile devices, it
has to be power-smart; storage protocols for traditional computers may also learn from UFS in the future in light
of more concerns about energy eiciency.
7.4.2 Integrating Other Enhancements with Compute Express Link: CXL breaks the memory wall by utilizing
SSDs as main memory. Although CXL is now a popular topic, there are still many questions to answer. SSDs
have internal activities like garbage collection and wear leveling, which occupy internal bandwidth and CPU
resources[78, 167]. Without the communication between the CXL components and the SSD, CXL performance
may be severely impacted by these SSD-internal tasks. SSDs are inherently slower than DRAMs already, and
researchers should do anything they can to make SSDs faster when being used as main memory. A possible
direction is to have special storage abstraction extensions so that CXL components may coordinate with SSDs
to prevent these internal tasks when possible. A similar design has been done for SSD RAID systems[109], and
we believe such a design will be strongly beneicial for better CXL performance. CXL also supports diferent
interleave granularities and interleave ways when placing data within an address range[20]. Researching the
impact of diferent interleave granularities/ways is also needed since they can result in tradeofs similar to what
one can see from the design of cache systems.
Another possible research direction is to directly integrate SmartSSDs with CXL. CXL Consortium deined
three diferent types of CXL devices; the CXL Type 2 (T2) devices are considered to be accelerator-like devices
with their own memory that hosts can access[40]. This is coincidentally similar to SmartSSDs, which have the
ability to perform calculations on the data they store. The host can then oload some time-insensitive tasks to
the CXL-enabled SmartSSD when needed.
Flexible Data Placement and Zoned Namespace Storage may also help on CXL. We have discussed the advantages
brought by ZNSwap, which uses ZNS SSDs for memory swap space[31, 32]. By placing data carefully, FDP and
ZNS devices should perform better than traditional SSDs. Performance isolation between diferent tenants may
also be achieved with FDP. Considering a CXL memory pool shared by multiple tenants, separating data from
diferent tenants will be beneicial for better garbage collection eiciency and performance isolation between
diferent tenants. This could be a huge task because CXL systems can be complex, and there may also be diferent
design choices (e.g., segregation by program context[96], process, or (virtual) machine). Still, this approach should
be able to improve performance and reduce disturbances from noisy neighbors nonetheless.
8 Conclusion
In this survey, we discussed the shortcomings of the traditional storage abstraction and explored several enhance-
ments that address them. We categorized them into four diferent categories, each with a diferent philosophy.
For each enhancement, we discussed its history and relationships with other enhancements from various per-
spectives, including source code interpretation and design concept comparison, along with the ecosystem and
research eforts made by both industry and academia. Finally, we identiied the future for existing and emerging
enhancements by relecting on partially failed attempts and proposing possible new research directions. We hope
this paper lays a cornerstone for exploring the current landscape and inspires future research on enhancements
to the SSD storage abstraction.
References
[1] 2007. Notiication of Deleted Data Proposal for ATA8-ACS2. https://t13.org/system/iles/Documents/2007/e07154r0-Notiication%
20for%20Deleted%20Data%20Proposal%20for%20ATA-ACS2_2.doc
[2] 2009. Serial ATA: Meeting Storage Needs Today and Tomorrow. https://web.archive.org/web/20120417133358/http://www.serialata.
org/documents/SATA-Rev-30-Presentation.pdf. (Archived on 04/17/2012, Accessed on 01/21/2024).
[3] 2011. NVMe 1.0. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_0e.pdf
[4] 2012. NVMe 1.1. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_1b-1.pdf
[5] 2014. NVMe 1.2. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_2a.pdf
[6] 2015. lightnvm: Support for Open-Channel SSDs. https://github.com/torvalds/linux/commit/
cd9e9808d18fe7107c306f6e71c8be7230ee42b4
[7] 2015. OpenChannelSSD. http://lightnvm.io
[8] 2017. f2fs: apply write hints to select the type of segments for bufered write. https://github.com/torvalds/linux/commit/
a02cd4229e298aadbe8f5cf286edee8058d87116
[9] 2017. Merge branch ’for-4.13/block’ of git://git.kernel.dk/linux-block. https://github.com/torvalds/linux/commit/
c6b1e36c8fa04a6680c44fe0321d0370400e90b6
[10] 2017. NVMe 1.3. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3a-20171024_ratiied.pdf
[11] 2019. Compute Express Link. https://docs.wixstatic.com/ugd/0c1418_d9878707bbb7427786b70c3c91d5fbd1.pdf
[12] 2019. Key Value Storage API Speciication Version 1.0. https://www.snia.org/sites/default/iles/technical-work/kvsapi/release/SNIA-
Key-Value-Storage-API-v1.0.pdf
[13] 2019. NVMe 1.4. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratiied.pdf
[14] 2020. Compute Express Link. https://www.computeexpresslink.org/_iles/ugd/0c1418_14c5283e7f3e40f9b2955c7d0f60bebe.pdf
[15] 2020. Key Value Storage API Speciication Version 1.1. https://www.snia.org/sites/default/iles/technical-work/kvsapi/release/SNIA-
Key-Value-Storage-API-v1.1.pdf
[16] 2021. NVM Express Key-Value Command Set Speciication 1.0. https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Value-
Command-Set-Speciication-1.0-2021.06.02-Ratiied-1.pdf. https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Value-
Command-Set-Speciication-1.0-2021.06.02-Ratiied-1.pdf (Accessed on 01/22/2024).
[40] Kurtis Bowman. 2023. Compute Express Link (CXL) Device Ecosystem and Usage Models. https://computeexpresslink.org/wp-
content/uploads/2023/12/CXL_FMS-Panel-2023_FINAL.pdf. (Accessed on 07/27/2024).
[41] Judy Brock, Bill Martin, Javier Gonzalez, Klaus B. Jensen, Fred Knight, Yoni Shternhell, Matias Bjùrling, and Paul Suhler. 2022.
TP4076a Zoned Random Write Area 2022.01.19 Ratiied. https://nvmexpress.org/wp-content/uploads/NVM-Express-2.0-Ratiied-
TPs_20230111.zip. (Accessed on 02/27/2024).
[42] Neil Brown. 2017. A block layer introduction part 1: the bio layer [LWN.net]. https://lwn.net/Articles/736534/. (Accessed on 01/20/2024).
[43] Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, Zhenjun Liu,
Feng Zhu, and Tong Zhang. 2020. POLARDB Meets Computational Storage: Eiciently Support Analytical Workloads in Cloud-Native
Relational Database. In 18th USENIX Conference on File and Storage Technologies (FAST ’20). USENIX Association, Santa Clara, CA,
29ś41. https://www.usenix.org/conference/fast20/presentation/cao-wei
[44] Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du. 2020. Characterizing, Modeling, and Benchmarking RocksDB Key-Value
Workloads at Facebook. In 18th USENIX Conference on File and Storage Technologies (FAST ’20). USENIX Association, Santa Clara, CA,
209ś223. https://www.usenix.org/conference/fast20/presentation/cao-zhichao
[45] Marc Carino. 2013. [PATCH v3 0/3] Introduce new SATA queued commands - Marc C. https://lore.kernel.org/all/1376023752-3105-1-
git-send-email-marc.ceeeee@gmail.com/. (Accessed on 06/22/2024).
[46] Chandranil Chakraborttii and Heiner Litz. 2021. Reducing Write Ampliication in Flash by Death-Time Prediction of Logical Block
Addresses. In Proceedings of the 14th ACM International Conference on Systems and Storage (Haifa, Israel) (SYSTOR ’21). Association for
Computing Machinery, New York, NY, USA, Article 11, 12 pages. https://doi.org/10.1145/3456727.3463784
[47] Da-Wei Chang, Hsin-Hung Chen, and Wei-Jian Su. 2015. VSSD: Performance Isolation in a Solid-State Drive. ACM Trans. Des. Autom.
Electron. Syst. 20, 4, Article 51 (09 2015), 33 pages. https://doi.org/10.1145/2755560
[48] Young-in Choi and Sungyong Ahn. 2022. Separating the File System Journal to Reduce Write Ampliication of Garbage Collection on
ZNS SSDs. Journal of Multimedia Information System 9, 4 (2022), 261ś268. https://doi.org/10.33851/JMIS.2022.9.4.261
[49] CXL Consortium. 2023. Our Members - Compute Express Link. https://computeexpresslink.org/our-members/. (Accessed on
07/02/2024).
[50] Jonathan Corbet. 2010. The best way to throw blocks away [LWN.net]. https://lwn.net/Articles/417809/. (Accessed on 06/24/2024).
[51] Lukas Czerner. 2010. [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support - Lukas Czerner. https://lore.kernel.org/linux-ext4/1285342559-
16424-1-git-send-email-lczerner@redhat.com/. (Accessed on 06/22/2024).
[52] Lukas Czerner. 2010. [PATCH 1/3] Add ioctl FITRIM. - Lukas Czerner. https://lore.kernel.org/linux-ext4/1281094276-11377-2-git-send-
email-lczerner@redhat.com/. (Accessed on 06/22/2024).
[53] Emily Desjardins. 2011. JEDEC Announces Publication of Universal Flash Storage (UFS) Standard | JEDEC. https://www.jedec.org/news/
pressreleases/jedec-announces-publication-universal-lash-storage-ufs-standard. https://www.jedec.org/news/pressreleases/jedec-
announces-publication-universal-lash-storage-ufs-standard (Accessed on 02/05/2024).
[54] Peter Desnoyers. 2014. Analytic Models of SSD Write Performance. ACM Trans. Storage 10, 2, Article 8 (03 2014), 25 pages. https:
//doi.org/10.1145/2577384
[55] Sarah M. Diesburg and An-I Andy Wang. 2010. A survey of conidential data storage and deletion methods. ACM Comput. Surv. 43, 1,
Article 2 (12 2010), 37 pages. https://doi.org/10.1145/1824795.1824797
[56] Western Digital. 2023. linux/fs/zonefs/zonefs.h at f2661062f16b2de5d7b6a5c42a9a5c96326b8454 · torvalds/linux. https://github.com/
torvalds/linux/blob/f2661062f16b2de5d7b6a5c42a9a5c96326b8454/fs/zonefs/zonefs.h#L126. (Accessed on 06/24/2024).
[57] Western Digital. 2024. btrfs | Zoned Storage. https://zonedstorage.io/docs/ilesystems/btrfs. (Accessed on 06/24/2024).
[58] Western Digital. 2024. f2fs | Zoned Storage. https://zonedstorage.io/docs/ilesystems/f2fs. (Accessed on 06/24/2024).
[59] Western Digital. 2024. ZoneFS | Zoned Storage. https://zonedstorage.io/docs/ilesystems/zonefs. (Accessed on 06/24/2024).
[60] Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. 2013. Query processing on
smart SSDs: opportunities and challenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of
Data (New York, New York, USA) (SIGMOD ’13). Association for Computing Machinery, New York, NY, USA, 1221ś1230. https:
//doi.org/10.1145/2463676.2465295
[61] Carl Dufy, Jaehoon Shim, Sang-Hoon Kim, and Jin-Soo Kim. 2023. Dotori: A Key-Value SSD Based KV Store. Proc. VLDB Endow. 16, 6
(02 2023), 1560ś1572. https://doi.org/10.14778/3583140.3583167
[62] Jake Edge. 2019. Issues around discard [LWN.net]. https://lwn.net/Articles/787272/. (Accessed on 06/23/2024).
[63] Jake Edge. 2023. Zoned storage and ilesystems [LWN.net]. https://lwn.net/Articles/932748/. (Accessed on 06/24/2024).
[64] Javier González. 2023. FDP and ZNS for NAND Data Placement: Landscape, Trade-Ofs, and Direction. Flash Memory Summit
2023 https://www.lashmemorysummit.com/English/Conference/Program_at_a_Glance_Tue.html. https://www.linkedin.com/posts/
javigon_nand-data-placement-landscape-trade-ofs-activity-7096569850801606656-wQR- (Accessed on 02/27/2024).
[65] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon,
Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: a framework for near-data processing of big data workloads. In
Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA ’16). IEEE Press, 153ś165.
https://doi.org/10.1109/ISCA.2016.23
[66] Kyuhwa Han, Hyunho Gwak, Dongkun Shin, and Jooyoung Hwang. 2021. ZNS+: Advanced Zoned Namespace Interface for Supporting
In-Storage Zone Compaction. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). USENIX
Association, 147ś162. https://www.usenix.org/conference/osdi21/presentation/han
[67] Yejin Han, Myunghoon Oh, Jaedong Lee, Seehwan Yoo, Bryan S. Kim, and Jongmoo Choi. 2023. Achieving Performance Isolation in
Docker Environments with ZNS SSDs. In 2023 IEEE 12th Non-Volatile Memory Systems and Applications Symposium (NVMSA ’23). 25ś31.
https://doi.org/10.1109/NVMSA58981.2023.00016
[68] Niclas Hedam, Morten Tychsen Clausen, Philippe Bonnet, Sangjin Lee, and Ken Friis Larsen. 2023. Delilah: eBPF-oload on Com-
putational Storage. In Proceedings of the 19th International Workshop on Data Management on New Hardware (DaMoN ’23). 70ś76.
https://dl.acm.org/doi/abs/10.1145/3592980.3595319
[69] Christoph Hellwig. 2022. Merge tag ‘for-5.18/write-streams-2022-03-18’ of git://git.kernel.dk/. . . . https://github.
com/torvalds/linux/commit/561593a048d7d6915889706f4b503a65435c033a. https://github.com/torvalds/linux/commit/
561593a048d7d6915889706f4b503a65435c033a [Accessed 08-Dec-2023].
[70] Hans Holmberg. 2021. Initial commmit for ZenFS - westerndigitalcorporation/zenfs@7f8e885. https://github.com/
westerndigitalcorporation/zenfs/commit/7f8e885d670205cfdc91a8b2b34ca5b492f42d43. (Accessed on 01/29/2024).
[71] Jian Huang, Anirudh Badam, Laura Caulield, Suman Nath, Sudipta Sengupta, Bikash Sharma, and Moinuddin K. Qureshi. 2017.
FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs. In 15th USENIX Conference on File and
Storage Technologies (FAST ’17). USENIX Association, Santa Clara, CA, 375ś390. https://www.usenix.org/conference/fast17/technical-
sessions/presentation/huang
[72] Joo-Young Hwang, Seokhwan Kim, Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, Sangyeun Cho, and Youjip Won.
2024. ZMS: Zone Abstraction for Mobile Flash Storage. In 2024 USENIX Annual Technical Conference (ATC ’24). USENIX Association,
Santa Clara, CA, 173ś189. https://www.usenix.org/conference/atc24/presentation/hwang
[73] Choulseung Hyun, Jongmoo Choi, Donghee Lee, and Sam H Noh. 2011. To TRIM or not to TRIM: Judicious triming for solid state drives.
In Poster presentation in the 23rd ACM Symposium on Operating Systems Principles (SIGOPS ’11). https://sigops.org/s/conferences/sosp/
2011/posters/summaries/sosp11-inal16.pdf
[74] Junsu Im, Jinwook Bae, Chanwoo Chung, Arvind, and Sungjin Lee. 2020. PinK: High-speed In-storage Key-value Store with Bounded
Tails. In 2020 USENIX Annual Technical Conference (ATC ’20). USENIX Association, 173ś187. https://www.usenix.org/conference/
atc20/presentation/im
[75] Minwoo Im, Kyungsu Kang, and Heonyoung Yeom. 2022. Accelerating RocksDB for Small-Zone ZNS SSDs by Parallel I/O Mechanism. In
Proceedings of the 23rd International Middleware Conference Industrial Track (Quebec, Quebec City, Canada) (Middleware ’22). Association
for Computing Machinery, New York, NY, USA, 15ś21. https://doi.org/10.1145/3564695.3564774
[76] Apple Inc. 2024. Introducing Apple’s On-Device and Server Foundation Models - Apple Machine Learning Research. https://
machinelearning.apple.com/research/introducing-apple-foundation-models. (Accessed on 06/25/2024).
[77] JEDEC. 2023. Zoned Storage for UFS | JEDEC. https://www.jedec.org/standards-documents/docs/jesd220-5. (Accessed on 06/24/2024).
[78] Ziyang Jiao, Janki Bhimani, and Bryan S. Kim. 2022. Wear leveling in SSDs considered harmful. In Proceedings of the 14th ACM Workshop
on Hot Topics in Storage and File Systems (Virtual Event) (HotStorage ’22). Association for Computing Machinery, New York, NY, USA,
72ś78. https://doi.org/10.1145/3538643.3539750
[79] Yanqin Jin, Hung-Wei Tseng, Yannis Papakonstantinou, and Steven Swanson. 2017. KAML: A Flexible, High-Performance Key-Value
SSD. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA ’17). 373ś384. https://doi.org/10.1109/
HPCA.2017.15
[80] Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel D. G. Lee, and Jaeheon Jeong. 2016. YourSQL: a
high-performance database system leveraging in-storage computing. Proc. VLDB Endow. 9, 12 (08 2016), 924ś935. https://doi.org/10.
14778/2994509.2994512
[81] Jeeyoon Jung and Dongkun Shin. 2022. Lifetime-Leveling LSM-Tree Compaction for ZNS SSD. In Proceedings of the 14th ACM Workshop
on Hot Topics in Storage and File Systems (Virtual Event) (HotStorage ’22). Association for Computing Machinery, New York, NY, USA,
100ś105. https://doi.org/10.1145/3538643.3539741
[82] Myoungsoo Jung. 2022. Hello Bytes, Bye Blocks: PCIe Storage Meets Compute Express Link for Memory Expansion (CXL-SSD).
In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (Virtual Event) (HotStorage ’22). Association for
Computing Machinery, New York, NY, USA, 45ś51. https://doi.org/10.1145/3538643.3539745
[83] Myoungsoo Jung and Mahmut Kandemir. 2013. Revisiting widely held SSD expectations and rethinking system-level implications.
SIGMETRICS Perform. Eval. Rev. 41, 1 (06 2013), 203ś216. https://doi.org/10.1145/2494232.2465548
[84] Myoungsoo Jung and Mahmut Kandemir. 2013. Revisiting widely held SSD expectations and rethinking system-level implications. In
Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (Pittsburgh, PA, USA)
(SIGMETRICS ’13). Association for Computing Machinery, New York, NY, USA, 203ś216. https://doi.org/10.1145/2465529.2465548
[85] Min-Gyo Jung, Chang-Gyu Lee, Donggyu Park, Sungyong Park, Jungki Noh, Woosuk Chung, Kyoung Park, and Youngjae Kim. 2021.
GPUKV: An Integrated Framework with KVSSD and GPU through P2P Communication Support. In Proceedings of the 36th Annual
ACM Symposium on Applied Computing (Virtual Event, Republic of Korea) (SAC ’21). Association for Computing Machinery, New York,
NY, USA, 1156ś1164. https://doi.org/10.1145/3412841.3441990
[86] Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. 2014. The Multi-streamed Solid-State Drive. In 6th USENIX
Workshop on Hot Topics in Storage and File Systems (HotStorage ’14). USENIX Association, Philadelphia, PA. https://www.usenix.org/
conference/hotstorage14/workshop-program/presentation/kang
[87] Woon-Hak Kang, Sang-Won Lee, Bongki Moon, Gi-Hwan Oh, and Changwoo Min. 2013. X-FTL: Transactional FTL for SQLite Databases.
In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (SIGMOD ’13).
Association for Computing Machinery, New York, NY, USA, 97ś108. https://doi.org/10.1145/2463676.2465326
[88] Yangwook Kang, Yang-suk Kee, Ethan L. Miller, and Chanik Park. 2013. Enabling cost-efective data processing with smart SSD. In
2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST ’13). 1ś12. https://doi.org/10.1109/MSST.2013.6558444
[89] Yangwook Kang, Rekha Pitchumani, Pratik Mishra, Yang-suk Kee, Francisco Londono, Sangyoon Oh, Jongyeol Lee, and Daniel D. G.
Lee. 2019. Towards Building a High-Performance, Scale-in Key-Value Storage System. In Proceedings of the 12th ACM International
Conference on Systems and Storage (Haifa, Israel) (SYSTOR ’19). Association for Computing Machinery, New York, NY, USA, 144ś154.
https://doi.org/10.1145/3319647.3325831
[90] Yunji Kang and Dongkun Shin. 2021. mStream: stream management for mobile ile system using Android ile contexts. In Proceedings
of the 36th Annual ACM Symposium on Applied Computing (Virtual Event, Republic of Korea) (SAC ’21). Association for Computing
Machinery, New York, NY, USA, 1203ś1208. https://doi.org/10.1145/3412841.3442115
[91] Bryan S. Kim, Jongmoo Choi, and Sang Lyul Min. 2019. Design Tradeofs for SSD Reliability. In 17th USENIX Conference on File and Storage
Technologies (FAST ’19). USENIX Association, Boston, MA, 281ś294. https://www.usenix.org/conference/fast19/presentation/kim-bryan
[92] Jaegeuk Kim. 2015. [PATCH 5/5] f2fs: introduce a batched trim - Jaegeuk Kim. https://lore.kernel.org/all/1422401503-4769-5-git-send-
email-jaegeuk@kernel.org/. (Accessed on 06/23/2024).
[93] Joohyun Kim, Haesung Kim, Seongjin Lee, and Youjip Won. 2010. FTL design for TRIM command. In The Fifth International Workshop
on Software Support for Portable Storage (IWSSPS ’10). 7ś12.
[94] Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO Complying SSDs Through OPS Isolation. In 13th USENIX Conference on
File and Storage Technologies (FAST ’15). USENIX Association, Santa Clara, CA, 183ś189. https://www.usenix.org/conference/fast15/
technical-sessions/presentation/kim_jaeho
[95] Taejin Kim, Sangwook Shane Hahn, Sungjin Lee, Jooyoung Hwang, Jongyoul Lee, and Jihong Kim. 2018. PCStream: Automatic Stream
Allocation Using Program Contexts. In 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’18). USENIX
Association, Boston, MA. https://www.usenix.org/conference/hotstorage18/presentation/kim-taejin
[96] Taejin Kim, Duwon Hong, Sangwook Shane Hahn, Myoungjun Chun, Sungjin Lee, Jooyoung Hwang, Jongyoul Lee, and Jihong Kim. 2019.
Fully Automatic Stream Management for Multi-Streamed SSDs Using Program Contexts. In 17th USENIX Conference on File and Storage
Technologies (FAST ’19). USENIX Association, Boston, MA, 295ś308. https://www.usenix.org/conference/fast19/presentation/kim-taejin
[97] Thomas Kim, Jekyeom Jeon, Nikhil Arora, Huaicheng Li, Michael Kaminsky, David G. Andersen, Gregory R. Ganger, George Amvrosiadis,
and Matias Bjùrling. 2023. RAIZN: Redundant Array of Independent Zoned Namespaces. In Proceedings of the 28th ACM International
Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS
2023). Association for Computing Machinery, New York, NY, USA, 660ś673. https://doi.org/10.1145/3575693.3575746
[98] Jinhyung Koo, Junsu Im, Jooyoung Song, Juhyung Park, Eunji Lee, Bryan S. Kim, and Sungjin Lee. 2021. Modernizing File System through
In-Storage Indexing. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). USENIX Association,
75ś92. https://www.usenix.org/conference/osdi21/presentation/koo
[99] Dongup Kwon, Dongryeong Kim, Junehyuk Boo, Wonsik Lee, and Jangwoo Kim. 2021. A Fast and Flexible Hardware-based Virtualization
Mechanism for Computational Storage Devices. In 2021 USENIX Annual Technical Conference (ATC ’21). USENIX Association, 729ś743.
https://www.usenix.org/conference/atc21/presentation/kwon
[100] Miryeong Kwon, Donghyun Gouk, Changrim Lee, Byounggeun Kim, Jooyoung Hwang, and Myoungsoo Jung. 2020. DC-Store:
Eliminating Noisy Neighbor Containers using Deterministic I/O Performance and Resource Isolation. In 18th USENIX Conference on
File and Storage Technologies (FAST ’20). USENIX Association, Santa Clara, CA, 183ś191. https://www.usenix.org/conference/fast20/
presentation/kwon
[101] Miryeong Kwon, Sangwon Lee, and Myoungsoo Jung. 2023. Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation
CXL-SSD. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems (Boston, MA, USA) (HotStorage ’23).
Association for Computing Machinery, New York, NY, USA, 24ś30. https://doi.org/10.1145/3599691.3603406
[102] Hee-Rock Lee, Chang-Gyu Lee, Seungjin Lee, and Youngjae Kim. 2022. Compaction-Aware Zone Allocation for LSM Based Key-Value
Store on ZNS SSDs. In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (Virtual Event) (HotStorage ’22).
Association for Computing Machinery, New York, NY, USA, 93ś99. https://doi.org/10.1145/3538643.3539743
[103] Jongsung Lee, Donguk Kim, and Jae W. Lee. 2023. WALTZ: Leveraging Zone Append to Tighten the Tail Latency of LSM Tree on ZNS
SSD. Proc. VLDB Endow. 16, 11 (07 2023), 2884ś2896. https://doi.org/10.14778/3611479.3611495
[104] Kitae Lee, Dong Hyun Kang, Daeho Jeong, and Young Ik Eom. 2018. Lazy TRIM: Optimizing the journaling overhead caused
by TRIM commands on Ext4 ile system. In 2018 IEEE International Conference on Consumer Electronics (ICCE ’18). 1ś3. https:
//doi.org/10.1109/ICCE.2018.8326258
[105] Manjong Lee. 2022. Re: [PATCH 2/2] block: remove the per-bio/request write hint. - Manjong Lee. https://lore.kernel.org/all/
20220309133119.6915-1-mj0123.lee@samsung.com/. https://lore.kernel.org/all/20220309133119.6915-1-mj0123.lee@samsung.com/
[Accessed 20-Nov-2022].
[106] Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim, and Arvind. 2016. Application-Managed Flash. In 14th USENIX Conference
on File and Storage Technologies (FAST ’16). USENIX Association, Santa Clara, CA, 339ś353. https://www.usenix.org/conference/fast16/
technical-sessions/presentation/lee
[107] Young-Sik Lee, Sang-Hoon Kim, Jin-Soo Kim, Jaesoo Lee, Chanik Park, and Seungryoul Maeng. 2013. OSSD: A case for object-based
solid state drives. In 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST ’13). 1ś13. https://doi.org/10.1109/
MSST.2013.6558448
[108] Cangyuan Li, Ying Wang, Cheng Liu, Shengwen Liang, Huawei Li, and Xiaowei Li. 2021. GLIST: Towards In-Storage Graph Learning.
In 2021 USENIX Annual Technical Conference (ATC ’21). USENIX Association, 225ś238. https://www.usenix.org/conference/atc21/
presentation/li-cangyuan
[109] Huaicheng Li, Martin L. Putra, Ronald Shi, Xing Lin, Gregory R. Ganger, and Haryadi S. Gunawi. 2021. IODA: A Host/Device
Co-Design for Strong Predictability Contract on Modern Flash Storage. In Proceedings of the ACM SIGOPS 28th Symposium on
Operating Systems Principles (Virtual Event, Germany) (SOSP ’21). Association for Computing Machinery, New York, NY, USA, 263ś279.
https://doi.org/10.1145/3477132.3483573
[110] Shaohua Li. 2017. Stream - facebook/rocksdb@eefd75a. https://github.com/facebook/rocksdb/
commit/eefd75a228fc2c50174c0a306918c73ded22ace7. https://github.com/facebook/rocksdb/commit/
eefd75a228fc2c50174c0a306918c73ded22ace7 (Accessed on 01/29/2024).
[111] Sangwoo Lim and Dongkun Shin. 2019. DStream: Dynamic Memory Resizing for Multi-Streamed SSDs. In 2019 34th International
Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC ’19). 1ś4. https://doi.org/10.1109/ITC-CSCC.2019.
8793432
[112] Biyong Liu, Yuan Xia, Xueliang Wei, and Wei Tong. 2023. LifetimeKV: Narrowing the Lifetime Gap of SSTs in LSMT-based KV Stores
for ZNS SSDs. In 2023 IEEE 41st International Conference on Computer Design (ICCD ’23). 300ś307. https://doi.org/10.1109/ICCD58817.
2023.00053
[113] Jiahao Liu, Fang Wang, and Dan Feng. 2019. CostPI: Cost-Efective Performance Isolation for Shared NVMe SSDs. In Proceedings of the
48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP ’19). Association for Computing Machinery, New York, NY,
USA, Article 25, 10 pages. https://doi.org/10.1145/3337821.3337879
[114] Canonical Ltd. 2019. Ubuntu Manpage: fstrim - discard unused blocks on a mounted ilesystem. https://manpages.ubuntu.com/
manpages/xenial/en/man8/fstrim.8.html. (Accessed on 06/22/2024).
[115] Mingchen Lu, Peiquan Jin, Xiaoliang Wang, Yongping Luo, and Kuankuan Guo. 2023. ZoneKV: A Space-Eicient Key-Value Store for
ZNS SSDs. In 2023 60th ACM/IEEE Design Automation Conference (DAC ’23). 1ś6. https://doi.org/10.1109/DAC56929.2023.10247926
[116] Umesh Maheshwari. 2020. StripeFinder: Erasure Coding of Small Objects Over Key-Value Storage Devices (An Uphill Battle). In 12th
USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’20). USENIX Association. https://www.usenix.org/conference/
hotstorage20/presentation/maheshwari
[117] Umesh Maheshwari. 2021. From Blocks to Rocks: A Natural Extension of Zoned Namespaces. In Proceedings of the 13th ACM Workshop
on Hot Topics in Storage and File Systems (Virtual, USA) (HotStorage ’21). Association for Computing Machinery, New York, NY, USA,
21ś27. https://doi.org/10.1145/3465332.3470870
[118] Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. 2020. A Study of SSD Reliability in Large Scale Enterprise
Storage Deployments. In 18th USENIX Conference on File and Storage Technologies (FAST ’20). USENIX Association, Santa Clara, CA,
137ś149. https://www.usenix.org/conference/fast20/presentation/maneas
[119] Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu
Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu. 2022. GenStore: a
high-performance in-storage processing system for genome sequence analysis. In Proceedings of the 27th ACM International Conference
on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’22). Association for
Computing Machinery, New York, NY, USA, 635ś654. https://doi.org/10.1145/3503222.3507702
[120] Bill Martin, Judy Brock, Dan Helmick, Robert Moss, Mike Allison, Benjamin Lim, Jiwon Chang, Ross Stenfort, Young Ahn, Wei Zhang,
Sumit Gupta, Amber Hufman, Chris Sabol, Paul Suhler, Mahinder Saluja, John Geldman, Mark Carlson, Walt Hubis, Dan Hubbard,
Steven Wells Santosh Kumar, Andrés Baez, Kwok Kong, Erich Haratsch, Anu Murthy, Hyunmo Kang, Yeong-Jae Woo Mike James, and
Yoni Shternhell. 2022. TP4146 Flexible Data Placement 2022.11.30 Ratiied. https://nvmexpress.org/wp-content/uploads/NVM-Express-
Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 31094ś31116. https:
//proceedings.mlr.press/v202/sheng23a.html
[142] Yoni Shternhell and Matias Bjùrling. 2022. TP4093a Zone Relative Data Lifetime Hint 2022.08.14 Ratiied. https://nvmexpress.org/wp-
content/uploads/NVM-Express-2.0-Ratiied-TPs_20230111.zip. (Accessed on 02/27/2024).
[143] Muthian Sivathanu, Lakshmi N Bairavasundaram, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2004. Life or Death at
Block-Level. In OSDI (OSDI ’04, Vol. 4). 26ś26. https://www.usenix.org/legacy/event/osdi04/tech/full_papers/sivathanu/sivathanu.pdf
[144] Eric Slivka. 2011. Mac OS X Lion Roundup: Recovery Partitions, TRIM Support, Core 2 Duo Minimum, Focus on Security - MacRu-
mors. https://www.macrumors.com/2011/02/25/mac-os-x-lion-roundup-recovery-partitions-trim-support-core-2-duo-minimum-
focus-on-security/. https://www.macrumors.com/2011/02/25/mac-os-x-lion-roundup-recovery-partitions-trim-support-core-2-duo-
minimum-focus-on-security/ (Accessed on 01/28/2024).
[145] David Sterba. 2020. Merge tag ’for-5.6-tag’ of git://git.kernel.org/pub/scm/linux/kernel/... · torvalds/linux@81a046b. https://github.
com/torvalds/linux/commit/81a046b18b331ed6192e6fd9f6d12a1f18058cf. (Accessed on 06/23/2024).
[146] David Sterba. 2022. Re: Using async discard by default with SSDs? - David Sterba. https://lore.kernel.org/linux-btrfs/20220726213628.
GO13489@twin.jikos.cz/. (Accessed on 06/23/2024).
[147] David Sterba. 2023. Trim/discard Ð BTRFS documentation. https://btrfs.readthedocs.io/en/latest/Trim.html. (Accessed on 06/23/2024).
[148] Devesh Tiwari, Simona Boboila, Sudharshan Vazhkudai, Youngjae Kim, Xiaosong Ma, Peter Desnoyers, and Yan Solihin. 2013. Active
Flash: Towards Energy-Eicient, In-Situ Data Analytics on Extreme-Scale Machines. In 11th USENIX Conference on File and Storage
Technologies (FAST ’13). USENIX Association, San Jose, CA, 119ś132. https://www.usenix.org/conference/fast13/technical-sessions/
presentation/tiwari
[149] Linus Torvalds. 2021. Merge tag ’cxl-for-5.12’ of git://git.kernel.org/pub/scm/linux/kernel. . . . https://github.com/
torvalds/linux/commit/825d1508750c0cad13e5da564d47a6d59c7612d6. https://github.com/torvalds/linux/commit/
825d1508750c0cad13e5da564d47a6d59c7612d6 [Accessed 08-Dec-2023].
[150] Theodore Ts’o. 2016. Solved - SSD Trim Maintenance | The FreeBSD Forums. https://forums.freebsd.org/threads/ssd-trim-maintenance.
56951/#post-328912. (Accessed on 06/22/2024).
[151] Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An eicient design and
implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the Ninth European Conference on Computer
Systems (Amsterdam, The Netherlands) (EuroSys ’14). Association for Computing Machinery, New York, NY, USA, Article 16, 14 pages.
https://doi.org/10.1145/2592798.2592804
[152] Qiuping Wang and Patrick P. C. Lee. 2023. ZapRAID: Toward High-Performance RAID for ZNS SSDs via Zone Append. In Proceedings
of the 14th ACM SIGOPS Asia-Paciic Workshop on Systems (Seoul, Republic of Korea) (APSys ’23). Association for Computing Machinery,
New York, NY, USA, 24ś29. https://doi.org/10.1145/3609510.3609810
[153] Wei-Lin Wang, Tseng-Yi Chen, Yuan-Hao Chang, Hsin-Wen Wei, and Wei-Kuan Shih. 2020. How to Cut Out Expired Data with Nearly
Zero Overhead for Solid-State Drives. In 2020 57th ACM/IEEE Design Automation Conference (DAC ’20). 1ś6. https://doi.org/10.1109/
DAC18072.2020.9218610
[154] Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean Wu, David Brooks, and Gu-Yeon Wei. 2021. RecSSD: near
data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS ’21). Association for Computing
Machinery, New York, NY, USA, 717ś729. https://doi.org/10.1145/3445814.3446763
[155] SerialATA Workgroup. 2001. Serial ATA: High Speed Serialized AT Attachment Revision 1.0. https://www.seagate.com/support/disc/
manuals/sata/sata_im.pdf. (Accessed on 01/21/2024).
[156] Zheng Xu and Yanxiang Zhang. 2024. Advances in private training for production on-device language models. https://research.google/
blog/advances-in-private-training-for-production-on-device-language-models/. (Accessed on 06/25/2024).
[157] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi.
2017. Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs. ACM Trans. Storage 13, 3, Article
22 (10 2017), 26 pages. https://doi.org/10.1145/3121133
[158] Jingpei Yang, Rajinikanth Pandurangan, Changho Choi, and Vijay Balakrishnan. 2017. AutoStream: Automatic Stream Management
for Multi-Streamed SSDs. In Proceedings of the 10th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR ’17).
Association for Computing Machinery, New York, NY, USA, Article 3, 11 pages. https://doi.org/10.1145/3078468.3078469
[159] Jing Yang, Shuyi Pei, and Qing Yang. 2019. WARCIP: Write Ampliication Reduction by Clustering I/O Pages. In Proceedings of the 12th
ACM International Conference on Systems and Storage (Haifa, Israel) (SYSTOR ’19). Association for Computing Machinery, New York,
NY, USA, 155ś166. https://doi.org/10.1145/3319647.3325840
[160] Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman. 2014. Don’t Stack Your Log On My Log. In
2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW ’14). USENIX Association, Broomield, CO.
https://www.usenix.org/conference/inlow14/workshop-program/presentation/yang
[161] Pan Yang, Ni Xue, Yuqi Zhang, Yangxu Zhou, Li Sun, Wenwen Chen, Zhonggang Chen, Wei Xia, Junke Li, and Kihyoun Kwon. 2019.
Reducing Garbage Collection Overhead in SSD Based on Workload Prediction. In 11th USENIX Workshop on Hot Topics in Storage and
File Systems (HotStorage ’19). USENIX Association, Renton, WA. https://www.usenix.org/conference/hotstorage19/presentation/yang
[162] Shao-Peng Yang, Minjae Kim, Sanghyun Nam, Juhyung Park, Jin yong Choi, Eyee Hyun Nam, Eunji Lee, Sungjin Lee, and Bryan S.
Kim. 2023. Overcoming the Memory Wall with CXL-Enabled SSDs. In 2023 USENIX Annual Technical Conference (ATC ’23). USENIX
Association, Boston, MA, 601ś617. https://www.usenix.org/conference/atc23/presentation/yang-shao-peng
[163] Zhe Yang, Youyou Lu, Xiaojian Liao, Youmin Chen, Junru Li, Siyu He, and Jiwu Shu. 2023. �-IO: A Uniied IO Stack for Computational
Storage. In 21st USENIX Conference on File and Storage Technologies (FAST ’23). USENIX Association, Santa Clara, CA, 347ś362.
https://www.usenix.org/conference/fast23/presentation/yang-zhe
[164] Hwanjin Yong, Kisik Jeong, Joonwon Lee, and Jin-Soo Kim. 2018. vStream: Virtual Stream Management for Multi-streamed SSDs.
In 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’18). USENIX Association, Boston, MA. https:
//www.usenix.org/conference/hotstorage18/presentation/yong
[165] Jiacheng Zhang, Youyou Lu, Jiwu Shu, and Xiongjun Qin. 2017. FlashKV: Accelerating KV Performance with Open-Channel SSDs.
ACM Trans. Embed. Comput. Syst. 16, 5s, Article 139 (09 2017), 19 pages. https://doi.org/10.1145/3126545
[166] Jian Zhang, Yujie Ren, and Sudarsun Kannan. 2022. FusionFS: Fusing I/O Operations using CISCOps in Firmware File Systems.
In 20th USENIX Conference on File and Storage Technologies (FAST ’22). USENIX Association, Santa Clara, CA, 297ś312. https:
//www.usenix.org/conference/fast22/presentation/zhang-jian
[167] Xiangqun Zhang, Shuyi Pei, Jongmoo Choi, and Bryan S. Kim. 2023. Excessive SSD-Internal Parallelism Considered Harmful. In
Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems (Boston, MA, USA) (HotStorage ’23). Association for
Computing Machinery, New York, NY, USA, 65ś72. https://doi.org/10.1145/3599691.3603412
[168] Yuqi Zhang, Ni Xue, and Yangxu Zhou. 2021. Automatic I/O Stream Management Based on File Characteristics. In Proceedings of the
13th ACM Workshop on Hot Topics in Storage and File Systems (Virtual, USA) (HotStorage ’21). Association for Computing Machinery,
New York, NY, USA, 14ś20. https://doi.org/10.1145/3465332.3470879
[169] Yi Zheng, Joshua Fixelle, Nagadastagiri Challapalle, Pingyi Huo, Zhaoyan Shen, Zili Shao, Mircea Stan, and Vijaykrishnan Narayanan.
2022. ISKEVA: in-SSD key-value database engine for video analytics applications. In Proceedings of the 23rd ACM SIGPLAN/SIGBED
International Conference on Languages, Compilers, and Tools for Embedded Systems (San Diego, CA, USA) (LCTES 2022). Association for
Computing Machinery, New York, NY, USA, 50ś60. https://doi.org/10.1145/3519941.3535068
[170] Yuhong Zhong, Haoyu Li, Yu Jian Wu, Ioannis Zarkadas, Jefrey Tao, Evan Mesterhazy, Michael Makris, Junfeng Yang, Amy Tai, Ryan
Stutsman, and Asaf Cidon. 2022. XRP: In-Kernel Storage Functions with eBPF. In 16th USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’22). USENIX Association, Carlsbad, CA, 375ś393. https://www.usenix.org/conference/osdi22/presentation/
zhong
[171] Yuhong Zhong, Hongyi Wang, Yu Jian Wu, Asaf Cidon, Ryan Stutsman, Amy Tai, and Junfeng Yang. 2021. BPF for storage: an
exokernel-inspired approach. In Proceedings of the Workshop on Hot Topics in Operating Systems (Ann Arbor, Michigan) (HotOS ’21).
Association for Computing Machinery, New York, NY, USA, 128ś135. https://doi.org/10.1145/3458336.3465290
SATA [1]
introduced TRIM
2007
NVMe1.0 ➉ [3]
2011
NVMe1.1 [4]
2012
OSSD [107] SmartSSD [88] QuerySmartSSD [60] Active Flash [148] X-FTL [87]
MSST ’13 MSST ’13 SIGMOD ’13 FAST ’13 SIGMOD ’13
SDF [127]
➉ Multi-stream➉ [86] LOCS [151] NVMe1.2 [5]
ASPLOS ’14 HotStorage ’14 EuroSys ’14 2014
NVMe1.3 [10] Linux Kernel v4.13-rc1 [9] Linux Kernel v4.15-rc1 [8]
LightNVM [38] AutoStream [158] FlashKV [165] KAML [79]
supported Multi-stream supported Multi-stream F2FS supported write hints
FAST ’17 SYSTOR ’17 ToECS ’17 HPCA ’17
2017 2017 2017
KV-Storage [12]
ZNS [35]
➉ PCStream [96] StoneNeedle [161] DStream [111] WARCIP [159] NVMe1.4 [13] INSIDER [136] KV-SSD [89]
➉ CXL1.0 [11]
SNIA API v1.0
VAULT ’19 FAST ’19 HotStorage ’19 ITC-CSCC ’19 SYSTOR ’19 2019 ATC ’19 SYSTOR ’19 2019
2019
KV-Storage [15]
DTR-FTL [153] PolarDB Comp. Storage [43] KVMD [129] PinK [74] StripeFinder [116] CXL2.0 [14]
SNIA API v1.1
DAC ’20 FAST ’20 FAST ’20 ATC ’20 HotStorage ’20 2020
2020
NVMe 2.0 [17] Linux Kernel v5.9-rc1 [27] Linux Kernel v5.12-rc1 [149]
ZNS+ [66] Blocks2Rocks [117] ZNS Eval [36] ML-DT [46] FileStream [168] NVMeKVSSD [16] RecSSD [154] HW Virt Comp. Storage [99] GLIST [108] KV-SiPC [33] KVRAID [133] GPUKV [85] KEVIN [98]
ratified ZNS and KV-SSD supported NVMe ZNS supported CXL 2.0
OSDI ’21 HotStorage ’21 ATC ’21 SYSTOR ’21 HotStorage ’21 2021 ASPLOS ’21 ATC ’21 ATC ’21 SYSTOR ’21 SYSTOR ’21 SAC ’21 OSDI ’21
2021 Jul 2021 Feb 2021
Persimmon [132] WALTZ [103] RAIZN [97] ZapRAID [152] ZNSwap [32] DockerZNS [67] eZNS [123] LifetimeKV [112] ZoneKV [115] Delilah [68] LambdaIO [163] KV-CSD [128] Dotori [61] Cache in Hand [101] CXL3.1 [20]
ICCD ’23 VLDB ’23 ASPLOS 2023 APSys ’23 ToS ’23 NVMSA ’23 OSDI ’23 ICCD ’23 DAC ’23 DaMoN ’23 FAST ’23 CLUSTER ’23 VLDB ’23 HotStorage ’23 2023
Fig. 12. Complete genealogy tree of the surveyed papers (Figure 1).