skip to main content
research-article
Open access

Critical Data Backup with Hybrid Flash-Based Consumer Devices

Published: 15 December 2023 Publication History

Abstract

Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However, with the widespread use of hybrid storage, these methods will result in severe problems, including significant performance and endurance degradation. This is caused by the fact that the different characteristics of flash memory in hybrid storage are not considered, e.g., performance, endurance, and access granularity. To address these problems, a critical data backup (CDB) design is proposed to ensure critical data reliability at a low cost. The basic idea is to accumulate two copies of critical data in the fast memory first to make full use of its performance and endurance. Then, one copy will be migrated to the slow memory in the stripe to avoid the write amplification caused by different access granularity between them. By respecting the different characteristics of flash memory in hybrid storage, CDB can achieve encouraging performance and endurance improvement compared with the state-of-the-art. Furthermore, to avoid performance and lifetime degradation caused by the backup data occupying too much space of fast memory, CDB Pro is designed. Two advanced schemes are integrated. One is making use of the pseudo-single-level-cell (pSLC) technique to make a part of slow memory become high-performance. By supplying some high-performance space, data will be fully updated before being evicted to slow memory. More invalid data are generated which reduces eviction costs. Another is to categorize data into three types according to their different life cycles. By putting the same type of data in a block, the eviction efficiency is improved. Therefore, both can improve device performance and lifetime based on CDB. Experiments are conducted to prove the efficiency of CDB and CDB Pro. Experimental results show that compared with the state-of-the-arts, CDB can ensure critical data reliability with lower device performance and lifetime loss whereas CDB Pro can diminish the loss further.

1 Introduction

During the last few decades, consumer devices have been widely developed, such as personal computers and smartphones, which are always equipped with large-capacity flash-based storage. With the development of flash memory, the density of widely used flash memory in consumer devices has increased from one bit per cell (SLC) to four bits per cell (QLC) [23] and will increase even more in the future. This leads to the reliability of flash memory decreasing [5, 28, 29, 41]. Therefore, the critical data on consumer devices, such as file system metadata and storage metadata, should be well protected to avoid system crashes or device failures. Especially widely used consumer devices are difficult for users to recover.
Existing works on enhancing data reliability mainly focus on data-redundant schemes, such as redundant arrays of independent disks (RAID) [6, 38, 43]. The standard RAID levels include RAID-0 to RAID-6, which are designed for different hardware and software demands. Among them, RAID-1, RAID-4, RAID-5, and RAID-6 have been adopted in flash-based devices by previous works [25, 30, 32, 36, 37]. Parity-based RAID methods, including RAID-4, RAID-5, and RAID-6, calculate and store the parity of a stripe of data [25, 30, 36, 37]. RAID-4 and RAID-5 can recover one unit of data while RAID-6 can recover two units of data for each stripe. This leads to high computation costs and requires at least three independent units for RAID-4 and RAID-5 and four for RAID-6. Unfortunately, consumer devices cannot meet these software and hardware requirements since they are commonly equipped with a weak controller and limited independent units to keep a low cost. In contrast, RAID-1 improves reliability by storing duplicates of data in the storage. If one copy of data is broken, it can be recovered by another duplicate. Since RAID-1 doesn’t need to calculate parity and only requires at least two independent units, it is suitable for consumer devices. MacFadden et al. [32] have proposed a RAID-1-like method for consumer devices. They proposed SIRF-1 to store two copies of data in different channels to improve data reliability. However, since SIRF-1 backs up all data, it will sacrifice half the capacity of the devices, which is unacceptable to users. In addition, existing schemes are designed for devices with non-hybrid architecture. Currently, state-of-the-art consumer devices [14, 15, 16, 17, 19] prefer to use high-density flash-based hybrid storage, which is constructed with a small-size front-end fast flash and a large-size back-end slow flash. By combining them inside one single device, hybrid storage provides a promising cost-efficient storage solution with high capacity, high performance, and low cost. Intel 670P [16] and Intel Optane H10 [14] are two representative hybrid storage architectures. Simply extending existing mechanisms to improve reliability will result in several problems for consumer devices equipped with hybrid storage.
From our evaluations and analysis, current mechanisms will lead to performance and lifetime degradation. The reason is that traditional mechanisms are designed without considering the characteristics of those two kinds of storage, especially performance and access granularity features. Specifically, when hybrid storage serves a request with a backup tag, the data and its duplicate will be written to the fast memory and slow memory, respectively. To avoid critical data loss caused by sudden power loss, real-time backup is required to write them into storage. The request will not be completed until two copies of critical data have been written. Therefore, users will suffer from the poor performance of slow memory. To make things worse, if the data is smaller than the write granularity of slow memory, its duplicate must be padded with invalid data in advance, which will cause serious write amplification. Besides these problems, traditional RAID1-like mechanisms always back up all data. Actually, there is no need to reserve part of unnecessary data, such as some temporary files.
In this article, a critical data backup (CDB) scheme is proposed to ensure data reliability in hybrid flash-based consumer devices. To avoid the capacity cost, CDB only backs up the designated critical data, including system metadata to avoid system crashes and storage metadata to avoid device failures. During backup, the characteristics of the two kinds of memory in hybrid storage are fully considered to avoid performance and lifetime degradation issues. Specifically, CDB separates the backup into two processes: accumulation and migration. (1) During the accumulation process, two copies of the designated critical data will be written to the fast memory to provide real-time backup. Note that real-time backup is necessary since consumer devices are prone to suffer sudden power loss. In this way, the slow memory will not affect the write performance, and the write amplification in the slow memory, which is caused by small write requests, is avoided. (2) During the migration process, one copy of the accumulated critical data is migrated from the fast memory to the slow memory in batches. The migration granularity and timing have been designed carefully during this process. As a result, higher critical data reliability is ensured and the fast memory will be released without influencing other requests.
One problem is overlooked in the above design: one copy of each critical dataset occupying the fast memory will reduce its available space and cause frequent data eviction, which leads to performance and endurance degradation. To solve this problem, CDB Pro was designed. For one aspect, CDB Pro adopts the pSLC technique to make part of the slow memory become high-performance. By using them as a part of the write cache, the amount of invalid data generated by data updates in the write cache will be increased. This leads to the reduction of the write cache space recycling cost. For another aspect, CDB Pro considers the data types for efficient eviction. There are three types of data based on life cycles, including critical data that should be reserved in the fast memory, critical data that should be migrated to the slow memory, and non-critical data. By storing the same type of data in a block, the eviction efficiency will be improved further. In conclusion, the negative impact of backing up critical data is avoided by CDB Pro. The main contributions of this article are as follows.
This article exposes the problem of adopting existing data backup methods on hybrid storage, including performance and lifetime degradation.
CDB is proposed to back up designated critical data in consumer devices equipped with hybrid flash storage while the characteristics of hybrid storage are considered thoroughly. Specifically, three main processes, accumulation, migration, and recovery, have been designed completely.
CDB Pro is designed to solve problems caused by backing up critical data further. Two advanced schemes are integrated: pSLC region management and critical data-guided greedy eviction.
To the best of our knowledge, this is the first work to back up the designated critical data in hybrid storage. To verify the state-of-the-art methods and proposed schemes with hybrid storage, a high-density flash-based hybrid storage simulator, HSsim, is implemented. HSsim is developed based on SSDsim [12].
Experimental results show that CDB outperforms the state-of-the-art by 1.94\(\times\) on average in terms of write request latency and reduces the write amplification factor (WAF) by 40% at most. CDB Pro improves the write performance and reduces WAF by 53.5% and 35.25% on average compared with CDB.
The rest of the article is organized as follows. Section 2 presents the background. In Section 3, the motivation of this work is presented. Sections 4 and 5 introduce the design of CDB and CDB Pro, respectively. Their implementation and overheads are discussed in Section 6. Section 7 presents the experiment. Related works are introduced in Section 8. Section 9 concludes the article.

2 Background

2.1 High-Density Flash-Based Hybrid Storage

Figure 1(a) shows the basic architecture of high-density flash-based hybrid storage. Hybrid storage is composed of two kinds of storage, an ultra-low latency memory (fast memory) and a high-density memory (slow memory). Inside each memory chip, there are multiple dies, which can provide parallelism [9, 12].
Fig. 1.
Fig. 1. Comparison among different backup schemes on hybrid storage.
There are two types of architectures for hybrid storage. One is the architecture of Intel Optane H10 [14] and H20 [17], which is a new architecture. It places different kinds of storage in different channels. In general, fast memory has high performance, great endurance, high reliability, and low capacity. Slow memory features the opposite. Besides these, the access granularity of fast memory is 4KB and slow memory is 16KB in common. Therefore, fast memory can provide high performance and slow memory can provide large capacity. Hybrid flash storage can make consumer devices more cost-effective.
Another architecture uses pseudo-single level cell (pSLC) technique based on the high-density flash [16, 19]. pSLC technique [13] converts triple-level cell (TLC) or quad-level cell (QLC) NAND flash memory to SLC-like cells (1-bit per cell) to provide high performance and reliability. Both architectures are managed similarly. As shown in Figure 1(a), to fully utilize the potential of hybrid storage, the device uses fast memory as a buffer for slow memory [16, 19, 39]. When data are issued from the host system, they are written to the fast memory first (Steps 1 and 2). Once the fast memory is full, data are migrated to the slow memory. On the other hand, if data are accessed from the slow memory, they will be migrated back to the fast memory (Step 3). To construct the hybrid architecture, this article will use XL [18] as the fast memory and QLC flash [23] as the slow memory. Their capacity is set by manufacturers taking cost-effectiveness into consideration. For example, the capacity of the fast memory and slow memory in Optane H20 [17] can be 32GB and 512GB or 32GB and 1 TB. In this article, the capacity is set according to the size of the workloads to simulate different storage conditions (empty or full).

2.2 Data Backup on Flash Storage

Due to the high-density flash memory’s proneness to errors, backing up data in flash storage is crucial to protect against device failure or system breakdown caused by critical data corruption. RAID, one of the most popular methods for backing up data, has been widely adopted by hard disk drives (HDDs) [6]. Having the advantage of high performance, flash memory gradually replaces HDDs and RAID is employed in flash-based storage or flash- and disk-based hybrid storage. Apart from RAID organization composed of multiple devices, RAID-1, RAID-4, RAID-5, and RAID-6 were also introduced recently across channels or chips within flash storage to improve the data reliability of a single device [32, 35, 38]. In parity-based RAID methods (i.e., RAID-4, RAID-5, and RAID-6), several data chunks and one or two parity chunks comprise a stripe. Once any data chunk of a stripe updates, the parity must be recalculated and updated. This will introduce extra reads and writes. A lot of work in flash-based RAID focuses on reducing parity update overheads [25, 30, 36, 37]. In contrast, RAID-1 consists of data mirroring without parity or striping [32]. By storing two copies of data in storage, it recovers the broken data from another copy of the data.
The differences between them are listed in Table 1. Although RAID-1 is less effective in space utilization compared with parity-based RAID methods, it is much simpler to implement without calculating and writing parity costs. Additionally, RAID-1 only needs at least two independent units to store data, while RAID-4, RAID-5, and RAID-6 need at least three or four independent units to calculate and store data parity. In addition, parity-based RAID methods require a more powerful controller than RAID-1 to compute parity. With the above considerations, RAID-1 is obviously more suitable for consumer devices. This is because consumer devices are generally equipped with a weak controller and a limited number of independent units. Furthermore, the capacity of consumer devices has increased to 512 GB or even 1 TB. It is more than enough even when using part of the storage to back up critical data. Therefore, this article focuses on the RAID-1-like methods. The most related work is a storage internal mirroring scheme SIRF-1 [32], which is similar to RAID-1. SIRF-1 partitions channels into two mirror groups and duplicated write operations are executed in different channels in parallel. It enhances data reliability by providing data mirroring across storage channels.
Table 1.
 RAID-1RAID-4RAID-5RAID-6
Space OverheadsHighLowLowLow
Computation OverheadsLowHighHighHigh
Minimum number of Independent Units2334
ControllerWeakStrongStrongStrong
Table 1. Comparison of RAID-1, RAID-4, RAID-5, and RAID-6

3 Problem Statement

Adopting existing designs on consumer devices with hybrid storage will cause performance and lifetime degradation. In this section, we will discuss and analyze them thoroughly. SIRF-1 [32] is the most related work proposing to back up data in consumer devices with RAID-1. This work is designed among multiple channels inside flash storage without considering the hybrid architecture. In order to understand the problems of SIRF-1 regarding hybrid storage, we first extend it to hybrid storage following its principle, as shown in Figure 1(b). SIRF-1 writes two copies of data to the two storage channels when requests are issued from the host systems. When the fast memory is full or data are read from slow memory, the data will be migrated between them. However, this method will result in several problems.
First, SIRF-1 is designed to back up all data in flash storage. This is not practical for consumer devices because all data backup is high-cost and not necessary. There are a large number of log and temp files in consumer devices, which are not critical. Scarifying half of the space to protect a large part of uncritical data is unacceptable. Furthermore, if these files are backed up, the back-end slow memory will suffer heavy data written, which leads to performance degradation and lifetime reduction. Second, directly backing up data between the two types of memory in hybrid storage will introduce performance and flash lifetime problems. Problems come from two aspects. First, the two types of memory have different access performance. Since the critical data should be well protected, they should be written to the underlying storage from the device cache immediately to avoid data loss caused by sudden power loss. Therefore, the request completion signal will be sent to the host system after the two copies of data have been written. Hence, the request latency depends on the performance of the slow storage. At the same time, the subsequent requests will be stalled due to head-of-line blocking. Performance degradation will occur. Considering that the latency gap between XL and QLC is 20\(\times\) [22, 26], implementing SIRF-1 in hybrid storage with XL and QLC will lead to considerable performance degradation. As shown in Figure 2, although one copy of the 4-KB critical data was written to XL within 75 us, the complete signal of this request can only be sent to the host after 1.5 ms (❶). Second, the two types of memory have different access granularity. The write unit for fast memory such as XL is 4 KB while that of slow memory such as QLC is a wordline, which is 4 pages of data with each page size of 16 KB [4, 8]. The difference is 16 times (\(=\frac{4\times 16KB}{4KB}\)). If a small request with 4 KB is issued and backed up [20], in the worst case, the relevant duplicate will be padded to 64 KB to fit the granularity of QLC. This leads to 4 KB of data written to XL and 64 KB of data written to QLC, resulting in up to 17x (\(=\frac{4KB+64KB}{4KB}\)) write amplification, as shown in Figure 2 (❷). Previous studies on consumer devices have shown that data tends to be written randomly with a small size [24]. In this case, WAF will increase dramatically when the backup is implemented on the hybrid storage.
Fig. 2.
Fig. 2. Problems of SIRF-1.
The above analysis motivates us to propose a new data backup design for hybrid storage. Specifically, it should achieve four goals: (1) provide data real-time backup; (2) allow backup of the designated data; (3) avoid performance loss; and (4) avoid lifetime loss.

4 Critical Data Backup (CDB)

4.1 Overview

To overcome the above issues, this article proposes a new data backup scheme for consumer devices, critical data backup (CDB). In contrast to existing schemes, CDB is designed for hybrid storage, as shown in Figure 1(c). The basic idea of CDB is to write two copies of critical data to the fast region at first to provide real-time backup (Steps 1 and 2) and then migrate one copy to the slow region (Step 3).
Figure 3 shows the organization of CDB with hybrid storage, where XL flash is used as the fast region and QLC flash is used as the slow region. CDB has three main processes: accumulation, migration, and recovery implemented in the flash translation layer. (1) The first process is critical data accumulation. When the write request with a backup signal is received in hybrid storage, two copies of the data will be written to two XL dies at first to provide real-time backup. Through accumulation, the negative influence due to the poor performance and large access granularity of QLC can be avoided. (2) The second process is critical data migration. Migration is the process of migrating one copy of the backup data, which is accumulated in XL, into QLC in batches. The precious XL capacity will be released by migration without influencing other requests to be served. More importantly, since the reliability characteristics of XL and QLC are different, CDB can protect the two copies of the data well due to their different corruption times. Note that one copy of the critical data is locked in the XL region to avoid migrating the data to the QLC region. (3) The third process is device recovery. The recovery process is recovering corrupted data in consumer devices. The data exception will be found through failed reading data or self-inspection. Then recovery will be performed and the data will be recovered. With CDB, critical data can be backed up in consumer devices with hybrid storage to improve data reliability and the performance and lifetime deterioration can be effectively avoided. In the following, the three main processes are presented in detail.
Fig. 3.
Fig. 3. CDB architecture overview.

4.2 Critical Data Accumulation

To back up the critical data, the first step is to accumulate the backup data in the fast region. To realize the accumulation, several designs have been proposed. First, the mapping table should be revised to record the location of backup data. Second, the mapping table entries of the critical data and its duplicate should be managed in the mapping table cache or XL during the accumulation process. Finally, during accumulation, two copies of critical data should be placed at different dies for performance concerns, which will be written in parallel. In the following, the critical data accumulation process is presented in detail.
Mapping table design: The traditional mapping table (TMT) is designed to record the mapping between the logical page number (LPN) to the physical page number (PPN). Additionally, TMT is extended with two tags, the critical tag (CT) and backup tag (BT). The CT is used to indicate that the corresponding page is critical data. The BT is used to indicate whether the corresponding page has been backed up. To record the physical location of the duplicates of backup data, another table, a backup mapping table (BMT), is added. The BMT builds the connection between the LPN and the PPN of the duplicate of critical data.
Mapping table management: In the mapping table cache, the TMT and BMT are recorded and managed in the least recently used (LRU) scheme separately. When the mapping table cache is full, clean entries of uncritical data and critical data that have been backed up in the TMT will be selected to evict. The entries in the BMT are evicted only after the migration process. When mapping table entries are written to XL, the mapping entry of critical data and the corresponding backup mapping entry of its duplicate are combined. Therefore, each mapping entry stored in XL is composed of a PPN of the critical data, a PPN of its duplicate, a CT, and a BT. The LPN is used as an index to find the corresponding mapping entry stored in XL. Both entries in the TMT and BMT are not persisted synchronously with the data. There are two reasons. First, it will lead to more pages written if persisted synchronously. Each data update will change the corresponding mapping table entries and lead to a page written to persist them. Furthermore, at least two pages are required to persist the mapping table entries for the accumulation and migration processes, respectively. Second, the mapping table entries for the TMT and BMT can be recovered with existing solutions [47]. If a system crash happens, the SSD controller will scan the physical pages. Each physical page records its corresponding LPN in its out-of-band area (OOB). Based on this information, the SSD controller can rebuild the mapping between the LPN and the PPN.
Accumulation process: For critical data backup, the critical data and its duplicate should be placed at XL in the order of parallelization to improve performance [12]. When a request with a critical tag is issued, it is first transferred to the controller. Then, the request is partitioned into sub-requests, where each is tagged with a critical bit. At the same time, a backup sub-request is generated following each critical sub-request. After that, when they are served, the critical data and its duplicate will be written in parallel, which leads to them being stored in different dies of XL. Both the TMT and BMT should be added with entries to record the locations of critical data and backup data. During the accumulation process, the BT of the corresponding mapping entry should be unset. Note that if the size of critical requests is large, it should not be accumulated at XL. This is because the gap in the performance and access granularity between XL and QLC for a large request is small. In this case, an accumulation threshold \(T_A\) is set to avoid unnecessary accumulation on XL. If the backup request size is larger than \(T_A\), its copy will be directly written to QLC rather than XL and the corresponding BT should be set in the TMT. We have done experiments to show the impact for different \(T_A\).
Algorithm 1 shows the process of the accumulation scheme, which includes two parts: the management of the mapping table and critical data accumulation. When a critical request is issued, it is first partitioned into sub-requests following the backup sub-requests and added to the queue. Then, both the TMT and BMT are added with entries to record the location of the requests. Finally, critical data are programmed to the two different dies of XL.
Accumulation data limit: One important issue not discussed for the accumulation scheme is the amount of critical data maintained in XL. Since CDB is designed to protect data between XL and QLC, the critical data should be maintained separately in XL and QLC. When a large amount of critical data takes the most capacity of XL, the performance of hybrid storage will be impacted. To solve this issue, the amount of critical data is restricted by two aspects: the maximal amount of critical data is set to avoid the performance impact, and the selection of data belonging to critical data is important. In this work, critical data includes host systems–defined critical data and controller-based critical data. Host system–defined critical data include system kernel, file system–based metadata, and user-tagged critical data. For controller-based critical data, they include mapping tables, and metadata of garbage collection and wear leveling. All these data can be backed up for system reliability.

4.3 Critical Data Migration

Once the data are accumulated at XL, they should be migrated to QLC as soon as possible. The second step of CDB is to migrate critical data to QLC for backup. To realize the migration, there are several steps. First, the data accumulated at XL should be read based on the BMT. Second, the migration process should not conflict with host requests. Finally, when the migration is finished, the corresponding mapping table should be written back to the flash memory. In the following, the migration is presented in detail.
Migration trigger condition: There are three conditions. First, when the device is idle, the migration should be activated to avoid conflict between host requests and the migration process. Second, when the mapping table cache is full, the migration should be activated to avoid the occupation of the mapping table cache. Third, when the number of accumulated critical data in XL reaches a threshold, such as \(T_M\), the migration process can be activated to copy these data to QLC to avoid the occupation of XL. For a larger \(T_M\), more data can be accumulated and migrated in batches. Then, if the data is updated before migration, the migration cost will be reduced. However, if the capacity of XL is small, it will induce performance degradation. In the experiment, \(T_M\) is set based on migration granularity. Therefore, if one of the first and the second condition is satisfied and the third condition is met, the migration process will be triggered.
Migration data selection: When the migration process is triggered, some backup data should be selected to be migrated. Since the BMT stored in the cache is recorded according to the LRU scheme, the entry located in the tail of the BMT is not used recently. These backup data will be selected to avoid the cost caused by updating backup data stored in the QLC. After the migration process is finished, the relevant mapping entries of BMT can be written to XL and released from the mapping cache. At the same time, the BT of corresponding entries in the TMT is set.
Migration granularity: Due to the write units of XL and the QLC being different, several pages of data from XL are written to one page or wordline of the QLC. Therefore, the migration granularity should be considered. For example, the page size of XL is 4 KB, and the programming granularity of the QLC is 4 of the QLC pages, which equals 64 KB. Then, at least 16 pages of data should be read from XL for migration. In addition, the QLC is written in strips to achieve its maximum parallel units in most mobile storage [9, 10]. This leads to the migration granularity being equal to or larger than the strip size of the QLC. The strip size can be computed as follows:
\begin{equation} \begin{split} QLC\_stripe\_size= \#\_of\_QLC\_channels \times ~ \#\_of\_QLC\_chips\_per\_channel\\ \times ~\#\_of\_QLC\_dies\_per\_chip \times ~\#\_of\_QLC\_planes\_per\_die \times ~QLC\_page\_size \end{split} \end{equation}
(1)
The parameters in the above formula are the configuration of the QLC, which can be easily acquired. In addition, the migration granularity should be aligned with the physical page size of the QLC to avoid write amplification.
Migration process: Algorithm 2 shows the process of migration. Besides being triggered by the accumulation process, the migration process can also be triggered by a background detection thread, which is similar to background garbage collection. At first, three conditions are judged. If either the storage is idle or the mapping cache is full, the migration process will be activated when the amount of the accumulation data is larger than threshold \(T_M\). The backup data located in the tail of the BMT are selected and migrated to the QLC in stripes. Finally, the entries of these data in the BMT are combined with the relevant entries in the TMT and written to XL. At the same time, the BT of relevant entries in the TMT is set and these BMT entries are evicted to release the mapping cache space.

4.4 Device Recovery

When data is corrupted by system crashes or device failures, the recovery should be activated. The last step is to recover the critical data from the backup. There are two cases for critical data recovery. The first case happens when XL is corrupted. Since XL is frequently accessed by host systems, especially for the current design, XL may be corrupted due to many reasons, such as overloaded or media errors. The second case happens when the QLC is corrupted. The QLC may be corrupted due to its low reliability. With the wearing of the QLC, it may be worn out or errors. The recovery process is simple. First, the recovery process is activated when one copy of the data is found to be corrupted during access or self-inspection. Second, the corresponding mapping table entry is read. Note that if the corrupted data is in the mapping table, the corresponding backup mapping table will be first recovered. Then, the backup data is read and written to a free page. Finally, the relevant mapping table entry is updated. Note that during recovery, if the data stored in XL is corrupted, the recovery process is finished after one copy of the backup data is written to XL. Otherwise, to avoid write amplification to the QLC, the backup data are accumulated at XL, similar to the accumulation process.

5 Design of CDB Pro

5.1 Problem Statement and Motivation

The above CDB scheme protects the critical data well, avoiding performance and endurance loss during the backup process. However, although CDB restricts the amount of backup data from two aspects as illustrated in Section 4.2, XL still faces the problem caused by backup data occupying some space of it, which leads to performance and endurance loss. Since one copy of the critical data will be reserved in XL, XL will be easy to fill up and then frequently activate eviction operations to free space for subsequent writes. This leads to lots of data being read and written to the QLC. For one thing, this process is time-consuming and may block subsequent requests, leading to performance degradation. For another, frequent eviction will increase the number of written data to the QLC, which will exacerbate write amplification. Additionally, if the selected block contains some critical data that should be reserved, these data will be rewritten to XL and more eviction operations are required to free enough space. For example, assuming that 40% of XL is occupied by critical data, each block erase can recycle only 60% capacity of the block. Therefore, the number of eviction operations will increase by 1.67 (\(\frac{1}{0.6}\)) times, while the amount of additional data written to XL is 67% of the block capacity. This is especially the case if we want to allow users to protect their critical data.

5.2 Overall Architecture CDB Pro

The above analysis leads us to rethink how to reduce the impact of eviction operations. At first, we should supply some space of the write cache to allow more data to be written. Obviously, equipping larger XL flash is a high cost, which is unacceptable to users. Thanks to the pSLC technique, a part of the QLC flash can be transferred to the pSLC flash at a low cost and its performance is high enough to be used as a write cache. By constructing a large write cache by XL and pSLC, the negative impact of backing up critical data in XL is reduced. However, when eviction operations occur in XL, the critical data stored in the victim block still prevent recycling XL space. Therefore, we should arrange the data stored in XL to improve eviction efficiency.
Keeping the above observations in mind, we propose CDB Pro to mitigate the effects of eviction operations. As shown in Figure 3, CDB Pro adds two modules—pSLC region management and critical data-guided greedy eviction. The function of pSLC region management is to: (i) dynamically adjust the capacity of the pSLC region, and (ii) schedule each write request to either XL or pSLC. Critical data-guided greedy eviction organizes the data according to their criticality.

5.3 PSLC Region Management

In this section, we introduce how pSLC region management works. Figure 4 illustrates an overall architecture of pSLC region management that is composed of four modules —critical data detector (Cd-Detector), free space monitor (Fs-Monitor), requests scheduler (Rq-Scheduler), and pSLC-Regulator. Cd-Detector not only detects which request is a critical request but also records the number of critical data stored in XL. Fs-Monitor is used to collect information about the amount of free space in XL. Based on the collected information of these two modules, Rq-Scheduler and pSLC-regulator can schedule write requests to either XL or pSLC and dynamically adjust the size of the pSLC, respectively.
Fig. 4.
Fig. 4. An overall architecture of pSLC region management.
Cd-Detector and Fs-Monitor: As described in Section 4, the critical requests will be sent to the devices with a critical tag. Based on this tag, the Cd-Detector can recognize the critical data. Then, it should identify whether the request is an update request or not to calculate the number of critical data in XL. It is easy to identify by loading the corresponding mapping entry. If the corresponding mapping entry is valid, the request is an update request. It means that this critical data has been stored in XL. The recording number of critical data in XL will remain the same. Otherwise, the recording number should be added with the request size.
Besides this, the free space information is collected by Fs-Monitor. Since this information has been recorded in the current device controller that is used to trigger GC, Fs-Monitor can get this information directly.
PSLC-Regulator: To make use of the pSLC, the first challenge is how to adjust the pSLC region size. There are two concerns. If the pSLC region is too large, the capacity of the QLC will be reduced. It may cause heavy GC in QLC and lower its performance and endurance. This also may result in insufficient device capacity to store user data. If the pSLC region is too small, the problem discussed above will not be solved.
To dynamically adjust the pSLC region size, the pSLC-Regulator makes use of the information recorded by the Cd-Detector. When a pSLC region is required, the pSLC-Regulator will set the pSLC region size as the number of critical data stored in XL and record it. This is the minimum size to help the device to recover its previous performance and lifetime. When the difference between the size of critical data stored in XL and the pSLC region size exceeds the adjustment unit, the pSLC region size will be changed by the adjustment unit. Each adjustment unit is \(Size_{pSLC\_block} * N_{QLC\_plane}\), that is, one QLC block of all QLC planes is transferred to one pSLC block or vice versa. Then pSLC-Regulator updates its record. Since pSLC will sacrifice part of QLC capacity, the maximal pSLC region size follows the adjustment of real products [11, 16]. For example, 70 GB is the maximal pSLC region size for 512 GB QLC flash when the utilization of QLC flash (\(\frac{valid\ data}{QLC\ capacity}\)) is lower than 25%. Then, the maximal pSLC region size decreases linearly until the utilization is 85%. It will be 6 GB when the utilization is larger than 85%.
By providing a pSLC region that matches the size of critical data in XL, the write cache capacity for normal data writing is restored. This brings back the frequency of eviction operations to the original level. For one thing, a large write cache allows more data written before processing eviction operations. For another thing, the stored data can be fully updated by more write requests and generate more invalid data to improve eviction efficiency. Furthermore, since the plane is the smallest parallel unit of flash, the setting of the adjustment unit guarantees the maximal number of parallel units in the pSLC region.
Rq-Scheduler: As shown in Figure 4, XL and pSLC are located in different channels, which leads to the data written and GC of them being self-governed. Therefore, another challenge of constructing a large write cache via pSLC and XL is how to decide the destination of write requests. In order to make full use of the characteristics of XL and pSLC, Rq-Scheduler divides the state of the device into three stages. Based on the information recorded by Cd-Detector and Fs-Monitor, the size of critical data stored in XL (\(S_{critical}\)) and the size of XL free space (\(S_{xl\_free}\)) is compared with two thresholds, \(T_{critical}\ and\ T_{xl\_free}\), respectively. First, if \(S_{xl\_free}\) is larger than \(T_{xl\_free}\) or \(S_{critical}\) is smaller than \(T_{critical}\), the device is on Stage I. In this stage, either XL has enough space for subsequent write requests or the critical data stored in XL have a slight influence on eviction operations. Therefore, all data are written to XL to make full use of its high performance, endurance, and reliability during this stage. Second, if both \(S_{xl\_free}\) is smaller than \(T_{xl\_free}\) and \(S_{critical}\) is larger than \(T_{critical}\), the device is on Stage II. That is, critical data occupies too much XL space, which influences eviction operations significantly. In this case, pSLC should play its role. The size of pSLC is set as the size of critical data in XL as described in pSLC-Regulator. Then, the critical data will still be written to XL to guarantee their reliability, while the non-critical data will be scheduled to the pSLC region to avoid frequent eviction. Third, if \(S_{pslc\_free}\) is lower than the predefined threshold to trigger GC, the device is on Stage III. In this case, the pSLC region is insufficient to survey subsequent write requests. If the non-critical data are still written to the pSLC region without any limitation, frequent GC may occur in it. To avoid this problem, the GC efficiency of XL and pSLC are compared. Specifically, the GC efficiency of two regions is represented by the maximal number of invalid pages of all blocks. If the GC efficiency of XL is higher, the subsequent uncritical data are rescheduled to XL and vice versa. This leads to the space of the write cache being recycled with high efficiency and performance and endurance are improved.
There are two thresholds, \(T_{xl\_free}\) and \(T_{critical}\), in the Rq-scheduler. For \(T_{xl_free}\), we set it to the eviction threshold. That means XL has insufficient space when XL requires the eviction process to recycle space. For \(T_{critical}\), we set it based on the proportion of critical data in all data, which is 40% capacity of XL in this article. In a real deployment, manufacturers can set it based on the analysis of users’ data.

5.4 Critical Data-Guided Greedy Eviction

Normally, the eviction efficiency of XL is highly correlated with the data access characteristics, e.g., hotness. With the increase in the number of critical data in XL, critical data plays an important role in eviction efficiency.
Data Classification: Specifically, there are three types of data in XL with different characteristics: (1) the critical data that should be reserved in XL, such as the file system metadata and storage metadata; (2) the critical data that should be evicted to the QLC, which are the duplicates of the critical data, such as the duplicates of the file system metadata; and (3) the non-critical data, such as temporary files that have little impact on the system reliability. For the critical data that should be reserved in XL, when their located block should be evicted, they should be rewritten to XL. Therefore, the fewer such data in the victim block, the better. The critical data that should be evicted to QLC are generally migrated by the migration process introduced in Section 4.2 in a short amount of time. Therefore, when the eviction operation is processed, this kind of data is invalid. This leads to the victim block having a lot of this kind of data and being erased quickly to recycle space. Non-critical data has different characteristics. This causes them to have different lifetimes between critical data that should be kept and critical data that should be evicted. Obviously, these three kinds of data with different lifetimes should be stored in different blocks to improve eviction efficiency. Specifically, three write heads of each plane are introduced. They point to three different blocks of each plane to store three kinds of data, respectively.
Critical Data-guided Greedy Eviction Process: Figure 5 illustrates an example of how critical data-guided greedy eviction groups different kinds of data. Traditionally, different kinds of data are interleaved together, as shown in blocks 0 to 2. For example, if block 0 is selected as the victim block, half pages that are the reserved critical data will be rewritten to another XL block. At the same time, a quarter of pages that are non-critical data will be evicted to a QLC block and the rest of the pages are invalid data that can be directly erased. Therefore, erasing block 0 will free half the capacity of a block. The eviction efficiency is low.
Fig. 5.
Fig. 5. An example of critical data-guided greedy eviction.
By adopting critical data-guided greedy eviction, different kinds of data will be stored in different blocks. For example, blocks 0’ to 2’ are used to store the reserved critical data, the evicted critical data, and the non-critical data, respectively. When an eviction operation is required to process, block 2’ will be selected first since the number of invalid pages in it is the largest. Since all data has been migrated to QLC, block 2’ can be directly erased. Additionally, if some pages are still valid in block 2’, critical data migration of CDB is triggered. In this case, the eviction process is still fast. This is because, based on the migration trigger condition in Section 4.3, the number of these migrated data is smaller than \(T_M\). The maximal cost is several stripe writes of the QLC, which is acceptable. If another eviction operation is required, block 1’ will be selected. The non-critical data in it will be evicted to the QLC. Then, the capacity of an XL block can be recycled. In this case, the migration process of CDB will not be activated since it will delay the execution of the eviction process and has no improvement on eviction efficiency. It is worth mentioning that block 0’ will not be selected to erase until it has the most invalid pages among all blocks. This is because the data stored in it is reserved critical data that cannot be evicted. Therefore, no space will be recycled by erasing block 0’ with zero invalid data. It should be noticed that the eviction granularity of critical data-guided greedy eviction also is the stripe size of the QLC. In this case, all parallel units of the slow memory are utilized to improve performance and since it is aligned with the access units of the slow memory, the write amplification is avoided. If the number of valid data in the victim block is less than the stripe size of the slow memory, they will be stored in storage DRAM temporarily and another victim block will be selected. Part of the valid data in the second victim block will be combined with the valid data in the first victim block to construct a stripe write to the slow memory. Then, the first victim block will be erased to recycle space.
Influence on GC when the device utilization is low: The above process is triggered when the device utilization is high. When the device utilization is low, the valid data in the victim block will be moved to another block in the same area. In contrast to the critical data-guided greedy eviction process, the number of invalid data is the most important metric for selecting the victim block like the traditional greedy GC [3, 44]. Therefore, the block with the most invalid data will be selected no matter what kind of data is stored in it. Due to data classification, data with similar lifetimes will be stored in the same block, like multi-stream SSD methods [2, 21], so that GC efficiency can also be improved when device utilization is low. However, due to the fact that there has been a lot of invalid data when the device utilization is low, the improvement is slight.

5.5 Combining with CDB

Based on the above design, CDB is designed to back up the critical data, while CDB Pro is proposed to manage data. The proposed components in CDB Pro can be well combined with the CDB design. First, the pSLC region management dynamic adjusts the size of the pSLC and schedules write requests to either XL or the pSLC. For the critical data, the backup process is still accumulation and migration, as designed in CDB. For the non-critical data, the scheduling destination depends on the device state, which has no impact on the original CDB design. Second, critical data-guided greedy eviction groups different kinds of data into different blocks. It combines the data characteristics caused by CDB design for data classification. Therefore, CDB Pro is well collaborative with CDB design.

6 Implementation and Discussion

In this section, the implementation of CDB and CDB Pro on hybrid storage is presented. In this article, the hybrid architecture is constructed with XL and QLC flash, which is taken as an example. However, it’s not specific. CDB and CDB Pro do not require any changes as long as there is a difference in performance and granularity between fast and slow memory in a hybrid architecture. For example, Optane memory [14, 17] can also be used as fast memory. Its access granularity is 4 KB and its performance is better than that of the TLC or QLC flash. For the implementation of CDB, three schemes are added to the flash controller. First, for the accumulation, the mapping table and critical data are maintained in XL. The host system needs to add a tag to indicate critical requests. Note that the current interfaces, such as NVMe, have many reserved bits, one of which can be used as a critical tag [7]. The changes to the host system are minimal, which only need to tag the requests for file system metadata and the operating system. The tag is not set by default. Second, for the migration, it is processed inside storage. Note that we will try to delay the migration if XL has free space. However, if the threshold is approached and current storage is idle, the migration can be activated. Finally, for the recovery, the storage is first recovered if the mapping table is lost. Then, the system is recovered.
The proposed schemes have storage and computation costs, which will be discussed as follows. The first is storage cost, which includes two parts. The first aspect is the space overhead of the mapping table stored in XL. The LPN and PPN are set as 4 bytes and the logical page size is 4 KB, which can represent a 16-TB (\(2^{32}*2^{12}\) byte) storage. As a 4-byte entry can represent a 4-KB logical page, the original size of the whole mapping table is \(\frac{storage\ capacity}{1000}\). To store the backup data mapping entries in XL, the size of the whole mapping table will double, \(\frac{storage\ capacity}{500}\). As the XL capacity in hybrid storage is dozens of GB in general, the overhead of the mapping table, several hundred megabytes, is acceptable. The second aspect is the space overhead of cached mapping tables in the mapping cache. The space overhead of each entry in the traditional cached mapping table is 2 bits (critical bit and backup bit) larger than before, which is negligible. For the BMT, its capacity can be limited based on the mapping table cache space by manufacturers. Since each entry of the BMT is composed of one LPN and one PPN of duplicate, which is 8 bytes, the cached BMT size is one five-hundredth of accumulated critical data, which is acceptable. Once it is full, migration can be activated to release the space. The computation overhead mainly comes from the lookup of accumulated LPNs and addition operations of the counter. Both operations are simple; thus, the computation cost is marginal. Because we only back up critical data, which is a small part of the whole, the additional energy overhead is expected to be negligible.
For the implementation of CDB Pro, two schemes, pSLC region management and critical data-guided greedy eviction, are added. In the pSLC region management, four components are proposed. Among them, Cd-Detector needs no additional cost to identify the critical data on the basis of CDB whereas the number of critical data stored in XL is additionally required to be recorded. Assuming that the XL capacity is 32 GB and the page size is 4 KB, 4 bytes (\(\gt \!\! log_2{\frac{32\ GB}{4\ KB}}\)) is enough, which is negligible. Since the capacity of free space has been recorded in the current flash controller, Fs-monitor causes no additional cost. Rq-scheduler only needs two comparison calculations, whose computation cost is acceptable. Thanks to the pSLC technique, which can adjust the pSLC region capacity flexibly [40], the pSLC-regulator is low cost. In critical data-guided greedy eviction, each XL plane needs two additional write heads to manage different kinds of data to the corresponding block. Each write head requires \(log_2{N_{block\_per\_plane}}\), which is 2 bytes at most. Therefore, the maximal overall overhead is \(2 \times N_{XL\_plane}\) bytes, which is negligible.
CDB and CDB Pro can improve data reliability by duplicating data such as existing RAID-1-like approaches. To better differentiate CDB and CDB Pro from existing RAID-1-like approaches, we list the key differences between them as follows. First, RAID-1-like approaches duplicate all data. However, CDB and CDB Pro only back up the designated data. Second, RAID-1-like methods store data and their duplicates in different channels at the same time without considering the performance and granularity differences between the different storage in the hybrid architecture. In contrast, CDB avoids performance and endurance losses caused by the differences in the performance and access granularity between the fast memory and slow memory through the accumulation and migration processes during backing up data. Furthermore, pSLC region management and critical data-guided greedy eviction in CDB Pro can further mitigate the negative impacts caused by the space occupied by backing up data. Therefore, CDB and CDB Pro can well match the features of hybrid architectures.

7 Experiment

7.1 Experimental Setup

To evaluate the performance and lifetime of CDB and CDB Pro, a simulator for consumer devices equipped with hybrid storage, HSsim, is developed. HSsim is significantly extended based on SSDsim [12]. It can emulate the storage of consumer devices used in this article, including storage layout, mapping table, accumulation, migration, and recovery. Table 2 shows some key parameters in the simulation [22, 26]. In the evaluation, the workloads collected from real devices are used. Specifically, we use a mobile workload, which is collected from mobile devices lasting two hours. Other workloads are collected from the corresponding applications. The characteristics of workloads are shown in Table 2, including the number of reads and writes, their average request sizes, and their total request sizes. For each workload, 40% of LPNs are labeled as critical randomly by default. Specifically, the LPNs of all data are collected and deduplicated at first. Then, 40% of the remaining unique LPNs are selected as critical LPNs. As a result, the corresponding data of the critical LPNs are always the critical data. To avoid complexity in the experiment, we assume that XL and the QLC have enough space for all data writes and backups.
Table 2.
 XLQLC # of Reads# of WritesAvg.R.sizeAvg.W.sizeTotal.R.sizeTotal.W.size
Channel11Mobile1517974820219.47 KB9.76 KB2886.22 MB459.42 MB
Chip24Earth75901198050.32 KB89.62 KB372.98 MB1048.48 MB
Page4KB16KBFacebook112702413022.93 KB13.16 KB252.36 MB310.11 MB
tR4\(\mu\)s85\(\mu\)sTwitter68006626021.74 KB8.23 KB144.37 MB532.54 MB
tPROG75\(\mu\)s1630\(\mu\)sWeChat14700655008.84 KB7.62 KB126.9 MB487.41 MB
Table 2. Simulation Parameters and Workload Characteristics [22, 26]

7.2 Experimental Results

7.2.1 When the Device Utilization Is Low.

To show the advantages of CDB, the following three schemes are evaluated. Baseline: there is no data to be backed up and all data are written to XL. SIRF-1 [32]: This is the most related work. We extend it to hybrid storage by writing the critical data to XL and its duplicate to QLC, as shown in Figure 1(b). CDB: This is the proposed work, which is designed to back up designated critical data on hybrid storage, including the accumulation process and migration process. CDB+CDB Pro: This is the proposed work with all schemes, including the accumulation process, the migration process, pSLC region management, and critical data-guided greedy eviction.
Performance: Figure 6 shows the normalized write latency among the evaluated schemes when the device utilization is low. First, compared with the baseline, SIRF-1 deteriorates the write performance by 3.63\(\times\). This demonstrates that the write performance of SIRF-1 is limited by the poor performance of the QLC since one copy of critical data should be written to the QLC directly. Second, CDB decreases the write latency by 1.94\(\times\) compared with SIRF-1. For one thing, CDB can fix the bottleneck resulting from the poor performance of the QLC by accumulating two copies of critical data to XL. For another, thanks to the parallel write mechanism of flash memory, CDB requires marginal performance overhead to back up data whose write latency is only around 28% longer than that of the baseline. Third, with the average write request size decreasing, we can find that the write performance gap between SIRF-1 and CDB is growing. For example, the performance gap between SIRF-1 and CDB on Earth, which has the biggest average write request size among all evaluated workloads, is only 47%, whereas it is 345% on WeChat, whose average request size is the smallest. Since the parallel unit of the QLC is larger than XL, small requests cannot make full use of the parallel capability of the QLC to narrow the performance gap between them. Fourth, CDB+CDB Pro has a similar performance to that of CDB. This is because XL has sufficient space to store data in this case. Specifically, for pSLC region management, the device is on Stage I. All data will be written to XL, which is same to CDB. For critical data-guided greedy eviction, since low device utilization means that there are lots of invalid data, data classification has a slight improvement in GC efficiency. Additionally, the normalized read latency among the evaluated three schemes is similar, as shown in Figure 7. This is because SIRF-1, CDB, and CDB Pro have no impact on the data reading process.
Fig. 6.
Fig. 6. Normalized write latency when the device utilization is low.
Fig. 7.
Fig. 7. Normalized read latency when the device utilization is low.
WAF: In this part, the WAF is measured by dividing the total size of internal data written by the total size of host write requests. Figure 8 shows the results. The WAF of SIRF-1 is up to 2.13 on average whereas the ideal is 1.4 given that 40% of requests are backed up. The main reason is that one copy of small critical data requests is rescheduled to the QLC, which requires padding with lots of invalid data. In contrast, CDB performs much better with 1.59 WAF on average. This is because these data have been accumulated into a big request whose size aligns with the parallel unit of the QLC in XL before being written to the QLC. Note that both SIRF-1 and CDB have a similar WAF for Earth workload. This is because Earth’s request sizes are large. In this case, backing up data to the QLC flash does not incur large write amplification. Overall, compared with SIRF-1, CDB reduces the WAF by 25% on average and 40% at most. Due to the same reason in Performance, CDB+CDB Pro has a similar WAF to that of CDB.
Fig. 8.
Fig. 8. Normalized WAF when the device utilization is low.

7.2.2 When the Device Utilization is High.

To explore the effect of the proposed design in this case, we set the XL capacity as 80% of the write footprint of each workload. To demonstrate the effectiveness of CDB Pro, three additional schemes are evaluated. CDB+PSLC: This is adding pSLC region management on top of CDB. CDB+GE: This is the addition of critical data-guided greedy eviction on top of CDB. CDB+CDB Pro: This is the proposed scheme that combines both pSLC region management and critical data-guided greedy eviction with CDB.
Performance: Figure 9 shows the average write latency of all schemes, which is normalized to the baseline. We can find that CDB still performs better than SIRF-1 significantly. The write performance improvement is 53.5% on average and 1.27\(\times\) at most. This demonstrates that the accumulation process of CDB has little impact on filling XL. This is because the accumulated data will be migrated with the parallel unit of QLC in a short amount of time. However, CDB does not perform the best since one copy of critical data is stored in XL permanently. The available space of XL is decreased. CDB Pro solves this problem well. Compared with CDB, CDB+CDB Pro can improve the write performance by 22.8% on average and reduce the write latency almost to the baseline. From the experimental results of CDB+PSLC and CDB+GE, we can break CDB Pro down. For one thing, compared with CDB, CDB+PSLC can reduce the write latency by 8.81% on average. This reveals that pSLC region management can improve write performance by directly scheduling data to the pSLC region and frequent data eviction from XL is avoided. For another, CDB+GE can improve the write performance by 16.73% on average compared with CDB. This indicates that critical data-guided greedy eviction can reduce the cost of data eviction from XL by isolating different kinds of data. At the same time, the read performance of all schemes is also evaluated, which is presented in Figure 10. Since the data reading process remains the same regardless of whether the data is critical or not, the read performance among all schemes is similar.
Fig. 9.
Fig. 9. The write latency when the device utilization is high.
Fig. 10.
Fig. 10. The read latency when the device utilization is high.
WAF: When the device utilization is high, the available space of XL is insufficient, which leads to an increased WAF of all schemes. Figure 11 shows the WAF of evaluated schemes which are normalized to the baseline. Compared with the write amplification caused by accumulation and migration, the benefit of CDB is still dominant. CDB can reduce the WAF by 22.5% on average and 46.64% at most compared with SIRF-1. However, CDB still faces the problem of critical data occupying XL space, which decreases its available space. CDB Pro can alleviate this problem. Compared with CDB, CDB+CDB Pro can reduce the WAF by 35.25% on average and 46.38% at most. The reason mainly comes from two aspects. First, pSLC region management supplies some space for data updates before data eviction is executed. More invalid data will be generated, which leads to the number of evicted data decreased. Therefore, CDB+PSLC can reduce the WAF by 12.67% on average and 24.58% at most compared with CDB. Second, critical data-guided greedy eviction can improve eviction efficiency by putting the data with similar invalid times into the same block. For one thing, the block with few valid data will be erased first, which reduces the number of evicted data. For another, the block with the data that needs to be evicted will be erased second. This avoids moving the data that needs to be reserved in XL. As a result, CDB+GE can further reduce the WAF by 27.4% on average and 33.56% at most on the basis of CDB. Note that although CDB+CDB Pro can reduce the WAF significantly compared with SIRF-1 and CDB, its WAF is still 84% larger than the baseline. Fortunately, the endurance of flash storage in consumer devices is beyond what the user needs. It is worth sacrificing a little endurance to protect critical data and the user experience.
Fig. 11.
Fig. 11. The normalized WAF when the device utilization is high.

7.2.3 Sensitivity Analysis.

In this part, the percentages of critical data and the accumulation thresholds \(T_A\) are studied. Since they are highly related to CDB, we analyze them when the device utilization is low. Since CDB and CDB+CDB Pro have similar results due to the same reason illustrated in Section 7.2.1, the experimental results on CDB+CDB Pro are not presented to avoid redundancy. First, to show the characteristics of performance and the WAF on the percentages of critical data, we adjust the ratio of critical data from 0% to 100% for all of the evaluated workloads. Figure 12(a) presents the normalized write performance, where the write latency of SIRF-1 rises much more rapidly than that of CDB. Figure 12(b) shows that the WAF of CDB increases gradually whereas that of SIRF-1 grows more steeply. Both results indicate that CDB has better performance. Second, to figure out the optimal \(T_A\), which decides the maximal size of backup data that will be accumulated in XL, we set \(T_A\) from 0 to no limitation (NL). As shown in Figure 13, when \(T_A\) is 64 KB, both performance and WAF perform the best.
Fig. 12.
Fig. 12. The sensitive study of the number of backup data for (a) write latency and (b) WAF.
Fig. 13.
Fig. 13. The sensitive study of the accumulation threshold for (a) write latency and (b) WAF.

8 Related Work

To improve the reliability of SSDs, parity-based RAID methods (RAID-4, RAID-5, and RAID-6) are widely used due to their low space overhead. Some of them focus on reducing the overheads of computing and updating parity [27, 36, 42, 46]. For example, [27] proposed to use a buffer to store the modification part of data to avoid frequently updating parity. [46] designed an age-driven parity distribution scheme to guarantee the wear-leveling among flash SSDs. The rest are designed to improve the reliability of SSDs further [1, 33, 45]. For example, [1] and [33] proposed to unevenly distribute the parity to avoid all devices wearing out together, which leads to corrupted data not being recovered. However, most of them are suitable for large systems with SSD arrays and cannot be used in consumer devices [34]. This is because consumer devices are generally equipped with a single SSD and a weak controller, which cannot calculate the parity.
RAID-1, which is simple to implement, is recommended to adopt on consumer devices. For adopting RAID-1-like methods on SSDs to improve data reliability, current studies [32, 36, 42] focus on two aspects. First, some works proposed backing up all data in a single SSD or SSD array. By making full use of parallelism of them, these works can provide high data reliability with slight performance degradation. For example, Macfadden et al. [32] introduce a scheme SIRF-1 to back up data across channels in a single SSD. By storing two copies of data on different channels at the same time, SIRF-1 can improve read performance with little write performance degradation. Second, some works proposed mirroring some data to reduce the parity cost of RAID-4 or RAID-5. For example, Wang et al. [42] and Pan and Xie [36] adopted a CR5M scheme to mirror the data when the mirroring chip that is used to mirror data is ready to serve. Otherwise, the corresponding parity calculated by RAID-5 is updated. However, all of them are unrealistic for consumer devices. Specifically, the first kind of work sacrifices half the space, which is too much for users to accept. At the same time, with hybrid storage being used in consumer devices widely, existing works will lead to performance and lifetime decrement. The second kind of work still requires a large number of chips and a strong controller to compute the parity. Consumer devices cannot meet the requirement.

9 Conclusion

This work proposed a critical data backup scheme, CDB, to improve data reliability in consumer devices equipped with hybrid storage. CDB can back up designated data considering the characteristics of different storage in hybrid storage. Specifically, CDB separates the backup process into an accumulation process and migration process. During the accumulation process, if the request size is smaller than the accumulation threshold \(T_A\), two copies of the designated critical data are written to fast storage. Otherwise, two copies of the critical data will be written to fast and slow storage, respectively. During the migration process, one copy of critical data stored in the fast storage will migrate to the slow storage in batches. By making use of the differences between the two kinds of storage on performance and access granularity, CDB can improve performance and lifetime compared with the state-of-the-art. Furthermore, to avoid the problems caused by too much critical data occupying the fast storage, CDB Pro is proposed. PSLC region management can convert part of the slow storage to fast storage. By scheduling some data to this space, pSLC region management can compose a larger cache to reduce the eviction cost between the fast storage and the slow storage. Critical data-guided greedy eviction puts the data with similar invalid times into a block. This leads to higher eviction efficiency. Experimental results with real workloads show that CDB can improve performance and lifetime dramatically compared with the state-of-the-art and CDB Pro can bring more benefits on the basis of CDB.

References

[1]
Mahesh Balakrishnan, Asim Kadav, Vijayan Prabhakaran, and Dahlia Malkhi. 2010. Differential raid: Rethinking raid for SSD reliability. ACM Transactions on Storage (TOS) 6, 2 (2010), 1–22.
[2]
Janki Bhimani, Jingpei Yang, Zhengyu Yang, Ningfang Mi, N. H. V. Krishna Giri, Rajinikanth Pandurangan, Changho Choi, and Vijay Balakrishnan. 2017. Enhancing SSDs with multi-stream: What? why? how?. In IEEE 36th International Performance Computing and Communications Conference (IPCCC’17). IEEE, 1–2.
[3]
Werner Bux and Ilias Iliadis. 2010. Performance of greedy garbage collection in flash-based solid-state drives. Elsevier Performance Evaluation 67, 11 (2010), 1172–1186.
[4]
Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, and Erich F. Haratsch. 2017. Vulnerabilities in MLC NAND flash memory programming: Experimental analysis, exploits, and mitigation techniques. In IEEE 23rd International Symposium on High Performance Computer Architecture (HPCA’17). 49–60.
[5]
Lanlan Cui, Fei Wu, Xiaojian Liu, Meng Zhang, Renzhi Xiao, and Changsheng Xie. 2021. Improving LDPC decoding performance for 3D TLC NAND flash by LLR optimization scheme for hard and soft decision. ACM Trans. Des. Autom. Electron. Syst. 27, 1, Article 5 (2021), 20 pages.
[6]
A. David, G. G. Patterson, and R. H. Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). In ACM 14th Special Interest Group on Management of Data (SIGMOD’88). 109–116.
[8]
Wu Fei, Lu Zuo, Zhou You, X. He, and C. Xie. 2018. OSPADA: One-shot programming aware data allocation policy to improve 3D NAND flash read performance. In IEEE 36th International Conference on Computer Design (ICCD’18). 51–58.
[9]
Congming Gao, Liang Shi, Cheng Ji, Yejia Di, Kaijie Wu, Chun Jason Xue, and Edwin Hsing-Mean Sha. 2018. Exploiting parallelism for access conflict minimization in flash-based solid state drives. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (TCAD) 37, 1 (2018), 168–181.
[10]
Congming Gao, Liang Shi, Chun Jason Xue, Cheng Ji, Jun Yang, and Youtao Zhang. 2019. Parallel all the time: Plane level parallelism exploration for high performance SSDs. In IEEE 35th Symposium on Mass Storage Systems and Technologies (MSST’19). 172–184.
[11]
Ben Gu, Longfei Luo, Yina Lv, Changlong Li, and Liang Shi. 2021. Dynamic file cache optimization for hybrid SSDs with high-density and low-cost flash memory. In IEEE 39th International Conference on Computer Design (ICCD’21). 170–173.
[12]
Yang Hu, Hong Jiang, Dan Feng, Lei Tian, Hao Luo, and Shuping Zhang. 2011. Performance impact and interplay of SSD parallelism through advanced commands, allocation strategy and data granularity. In ACM 11th Proceedings of the International Conference on Supercomputing (ICS’11). 96–107.
[13]
hyperstone. 2019. How pseudo-SLC mode can make 3D NAND flash more reliable. (2019). https://www.hyperstone.com/en/How-pseudo-SLC-mode-can-make-3D-NAND-flash-more-reliable-2524.html
[19]
Micron Inc. 2018. Micron Crucial P1 Product. https://www.crucial.com/products/ssd/p1-ssd-series
[20]
Cheng Ji, Li-Pin Chang, Liang Shi, Chao Wu, Qiao Li, and Chun Jason Xue. 2016. An empirical study of file-system fragmentation in mobile storage systems. USENIX 8th Workshop on Hot Topics in Storage and File Systems (HotStorage), 76–80.
[21]
Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. 2014. The multi-streamed \(\lbrace\)Solid-State\(\rbrace\) drive. In USENIX 6th Workshop on Hot Topics in Storage and File Systems (HotStorage’14). 1–5.
[22]
A. Khakifirooz, S. Balasubrahmanyam, R. Fastow, K. H. Gaewsky, and P. Kalavade. 2021. 30.2 A 1Tb 4b/Cell 144-Tier floating-gate 3D-NAND flash memory with 40MB/s program throughput and 13.8Gb/mm 2 bit density. In IEEE 16th International Solid-State Circuits Conference (ISSCC’21). 1–3.
[23]
D. Kim, H. Kim, et al.2020. A 1Tb 4b/cell NAND flash memory with tPROG=2ms, tR=110µs and 1.2Gb/s High-Speed IO Rate. In IEEE 15th International Solid-State Circuits Conference (ISSCC’20). 218–220.
[24]
Hyojun Kim, Nitin Agrawal, and Cristian Ungureanu. 2012. Revisiting storage for smartphones. In ACM Transactions on Storage (TOS’12). 1–25.
[25]
Jaeho Kim, Jongmin Lee, Jongmoo Choi, Donghee Lee, and Sam H. Noh. 2013. Improving SSD reliability with RAID via elastic striping and anywhere parity. In IEEE 43rd International Conference on Dependable Systems and Networks (DSN’13). 1–12.
[26]
Toshiyuki Kouchi, Noriyasu Kumazaki, et al.2020. 13.5 A 128Gb 1b/Cell 96-Word-Line-Layer 3D flash memory to improve random read latency with tPROG=75µs and tR=4µs. In IEEE 15th International Solid-State Circuits Conference (ISSCC’20). 226–228.
[27]
Jun Li, Zhibing Sha, Zhigang Cai, François Trahay, and Jianwei Liao. 2020. Patch-based data management for dual-copy buffers in RAID-enabled SSDs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3956–3967.
[28]
Qiao Li, Liang Shi, Yufei Cui, and Chun Jason Xue. 2020. Exploiting asymmetric errors for LDPC decoding optimization on 3D NAND flash memory. IEEE Transactions on Computers (TC) 69, 4 (2020), 475–488.
[29]
Qiao Li, Min Ye, Tei-Wei Kuo, and Chun Jason Xue. 2021. How the common retention acceleration method of 3D NAND flash memory goes wrong?. In ACM 13th Workshop on Hot Topics in Storage and File Systems (HotStorage’21). 1–7.
[30]
Yongkun Li, Biaobiao Shen, Yubiao Pan, Yinlong Xu, Zhipeng Li, and John C. S. Lui. 2017. Workload-aware elastic striping with hot data identification for SSD RAID arrays. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (TCAD) 36, 5 (2017), 815–828.
[31]
Longfei Luo, Dingcui Yu, Liang Shi, Chuanmin Ding, Changlong Li, and Edwin H.-M. Sha. 2022. CDB: Critical data backup design for consumer devices with high-density flash based hybrid storage. In 59th ACM/IEEE Design Automation Conference (DAC’22). 391–396.
[32]
Michael S. MacFadden, Richard Shelby, and Tao Xie. 2015. SIRF-1: Enhancing reliability of single flash SSD through internal mirroring for mission-critical mobile applications. In IEEE/ACM 15th International Symposium on Cluster, Cloud and Grid Computing (CCGrid’15). 343–351.
[33]
Alistair A. McEwan and Irfan F. Mir. 2012. Age distribution convergence mechanisms for flash based file systems. J. Comput. (JCP) 7, 4 (2012), 988–997.
[34]
Sangwhan Moon and A. L. Narasimha Reddy. 2013. Don’t let \(\lbrace\)RAID\(\rbrace\) raid the lifetime of your \(\lbrace\)SSD\(\rbrace\) array. In USENIX 5th Workshop on Hot Topics in Storage and File Systems (HotStorage’13). 1–5.
[35]
Yongseok Oh, Jongmoo Choi, Donghee Lee, and Sam H. Noh. 2014. Improving performance and lifetime of the SSD RAID-based host cache through a log-structured approach. ACM Special Interest Group on Operating Systems (SIGOPS) Operating Systems Review (OSR) 48, 1 (2014), 90–97.
[36]
Wen Pan and Tao Xie. 2018. A mirroring-assisted channel-RAID5 SSD for mobile applications. ACM Transactions on Embedded Computing Systems (TECS) 17, 4 (2018), 1–27.
[37]
Yubiao Pan, Yongkun Li, Yinlong Xu, and Zhipeng Li. 2015. Grouping-based elastic striping with hotness awareness for improving SSD RAID performance. In IEEE 45th International Conference on Dependable Systems and Networks (DSN’15). 160–171.
[38]
Zhaoyan Shen, Lei Han, Chenlin Ma, Zhiping Jia, Tao Li, and Zili Shao. 2021. Leveraging the interplay of RAID and SSD for lifetime optimization of flash-based SSD RAID. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (TCAD) 40, 7 (2021), 1395–1408.
[39]
Liang Shi, Jianhua Li, Chun Jason Xue, Chengmo Yang, and Xuehai Zhou. 2011. ExLRU: A unified write buffer cache management for flash memory. In ACM 9th International Conference on Embedded Software (EMSOFT’11). 339–348.
[40]
Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, and Edwin Hsing-Mean Sha. 2021. Understanding and optimizing hybrid SSD with high-density and low-cost flash memory. In IEEE 39th International Conference on Computer Design (ICCD’21). 236–243.
[41]
Yi Wang, Jiangfan Huang, Jing Yang, and Tao Li. 2019. A temperature-aware reliability enhancement strategy for 3-D charge-trap flash memory. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (TCAD) 38, 2 (2019), 234–244.
[42]
Yu Wang, Wei Wang, Tao Xie, Wen Pan, Yanyan Gao, and Yiming Ouyang. 2014. CR5M: A mirroring-powered channel-RAID5 architecture for an SSD. In IEEE 30th Symposium on Mass Storage Systems and Technologies (MSST’14). 1–10.
[43]
Chun-Feng Wu, Martin Kuo, Ming-Chang Yang, and Yuan-Hao Chang. 2022. Performance enhancement of SMR-based deduplication systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (TCAD) 41, 9 (2022), 2835–2848.
[44]
Yudong Yang, Vishal Misra, and Dan Rubenstein. 2015. On the optimality of greedy garbage collection for SSDs. ACM SIGMETRICS Performance Evaluation Review (PER) 43, 2 (2015), 63–65.
[45]
Wei Yi, Zhaolin Sun, Hui Xu, Jietao Diao, Nan Li, and Mingqian Wang. 2014. Dual RAID: A scheme for high reliable all flash array. In IEEE 11th International Joint Conference on Computer Science and Software Engineering (JCSSE’14). 218–222.
[46]
Du Yimo, Liu Fang, Chen Zhiguang, and Ma Xin. 2011. WeLe-RAID: A SSD-based RAID for system endurance and performance. In Springer 8th Network and Parallel Computing (NPC’11). 248–262.
[47]
You Zhou, Qiulin Wu, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. 2021. \(\lbrace\)Remap-SSD\(\rbrace\): Safely and efficiently exploiting \(\lbrace\)SSD\(\rbrace\) address remapping to eliminate duplicate writes. In USENIX 19th Conference on File and Storage Technologies (FAST’21). 187–202.

Cited By

View all
  • (2024)Zoned-WB: WriteBooster Design with Zoned Storage for User Experience on Smartphones2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781350(1-4)Online publication date: 9-Nov-2024

Index Terms

  1. Critical Data Backup with Hybrid Flash-Based Consumer Devices

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 1
    March 2024
    500 pages
    EISSN:1544-3973
    DOI:10.1145/3613496
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 December 2023
    Online AM: 06 November 2023
    Accepted: 23 October 2023
    Revised: 01 September 2023
    Received: 05 June 2023
    Published in TACO Volume 21, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. RAID-1
    2. high-density NAND flash memory
    3. hybrid SSDs
    4. critical data backup

    Qualifiers

    • Research-article

    Funding Sources

    • NSFC
    • Shanghai Science and Technology Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,439
    • Downloads (Last 6 weeks)157
    Reflects downloads up to 21 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Zoned-WB: WriteBooster Design with Zoned Storage for User Experience on Smartphones2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781350(1-4)Online publication date: 9-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media