1.
Anthony Cavin
17 July 2024 at 12:11:33
Source or text to be updated: it’s
probably not millions of sensors
for a typical industrial situation
2. Anthony Cavin
17 July 2024 at 12:14:05
Source to be added. I think it’s
from Deloitte survey: https://
www2.deloitte.com/us/en/insights/
topics/analytics/insight-driven-
organization.html
3. Anthony Cavin
17 July 2024 at 12:16:23
Could you rephrase this part for ReductStore: An Efficient Time-Series Database for IoT
clarity? and Edge Computing
4. Anthony Cavin Abstract:
17 July 2024 at 12:18:24
The article explores various data storage solutions, focusing on their suitability for different
Is it possible to find another
applications, particularly in the context of time series data and unstructured data management.
example than the smart grid? We
Traditional time series databases like InfluxDB, OpenTSDB, and TimescaleDB are highlighted for
are more focused towards Industry their strengths in handling structured data with high write throughput and complex queries. InfluxDB,
4.0 (or Industrial IoT “IIoT”), with for instance, is noted for its analytics and real-time monitoring capabilities, while OpenTSDB excels
the storage of high-frequency in long-term storage and analysis of massive datasets. The aim of this paper is to explore the solution
sensors data as an example. that supports unstructured data for IOT, edge computing, and AI applications. An important challenge
in these sorts of networks is effectively managing an increasing amount of data from many sources
and diverse forms of time series data in order to meet the performance demands of applications. Time
series data management in the Internet of Things (IoT) is crucial for optimizing operations and has
emerged as a significant area of research. Data management plays a vital role in the Internet of Things
(IoT). The major objective of this work is to conduct a comprehensive assessment and analysis of the
current data management methods utilized in the Internet of Things (IoT) field.
Introduction:
Every day, enormous volumes of raw data are being produced due to the continuing
expansion of many applications, including the World Wide Web, e-commerce, the Internet of
1 Things (IoT), and others. A typical industrial situation involves the interaction of hundreds of
devices, each of which has millions of sensors, which together generate billions of data points
constantly [1]. Countless sensors embedded in a wide variety of IoT devices provide massive
amounts of time series data. Data management is essential for intelligent analysis, and it
should be done both in the cloud and at the edge for control in real time [2]. The
administration of time series data is subject to new obligations as a result. Simultaneously,
80% of the data available is unstructured, with a significant portion being generated in real-
2 time [3]. However, only 18% of organizations are presently capable of making use of this
opportunity.
3 Intelligent infrastructure necessitates the storage of data that is resilient, and new applications
will run queries on this data over the course of time. These queries are often handled by
time-series databases; however, cloud-based time-series storage might be prohibitively
expensive as a result of its inherent complexity [4]. The proliferation of smart devices
presents an opportunity to shift resilient time-series data storage and analytics to the edge of
the network. This potential is made possible by the amount of computational power and
memory that is accessible in edge computing.
4 The edge paradigm is more suitable for infrastructure like the smart grid compared to the
cloud [5]. This is an especially promising area for IoT devices to monitor, improve, and
sustain the essential infrastructure of contemporary civilization [6]. For data-intensive
applications, Binary Large Objects, or Blobs, are a storage paradigm that is becoming more
5. Anthony Cavin
17 July 2024 at 12:39:13
Here we need to check. It would
be good to have a source or to
remove if it’s not accurate.
and more common [7]. A comprehensive amount of control over the data layout is afforded to
the user by the low-level, fine-grained binary access mechanisms that Blob delivers. This
makes it possible to optimize for individual applications, something that structured storage
systems like relational databases or key-value stores can't do.
There is currently no simple solution for managing linked data streams and sets in complex,
large-scale distributed applications and ensuring that they remain synchronized with their
respective indexes [8]. In an effort to meet these new problems, storage and file systems are
undergoing a new generation of development and optimization. It is of the utmost importance
to find reliable ways to handle the abundance of data linked to AI models. When metadata is
well-structured and readily available, it improves the overall performance of AI systems by
making data retrieval and use more efficient.
Traditional DB systems:
The use of SQL-based data querying has become less efficient as the volume of data
continues to increase. In particular, the management of larger databases has become a
significant difficulty [9]. A growing number of businesses are relying on non-relational
databases (NoSQL) to store their massive volumes of unpredictable data. Due to the
following reasons NoSQL becoming increasingly popular among businesses that collect vast
amounts of unstructured data:
➢ The primary motivation for transitioning to NoSQL databases is the requirement for
efficient storage, scalability, and performance in handling large volumes of data.
➢ NoSQL databases provide better scalability than traditional databases by
concentrating on the analytical processing of massive datasets on commodity
hardware [10].
➢ Data created by sophisticated apps, cloud computing, and intelligent devices can be
stored and analyzed using NoSQL through a variety of interfaces [9]. This data can
include user-generated content, private information, and spatial data.
➢ Big Data analysis with these NoSQL data models is straightforward and easy to
implement, and it doesn't require complex SQL optimization techniques.
5 ➢ Due to the inefficiency of relational databases in handling clusters and huge data
analytics, corporations are looking into the necessity of adopting the NoSQL
revolution.
➢ The primary catalyst for the emergence of NoSQL databases was the availability of
numerous databases that developers could utilize independently of traditional legacy
systems.
Problems with latency and bandwidth could arise from centralized data processing. One way
to address these difficulties and guarantee faster and more secure data processing is by
distributing processing jobs closer to the data source [11]. It is frequently not feasible to store
massive volumes of unstructured data. Filtering and other data reduction procedures are
crucial for efficient long-term storage management because they allow just the most relevant
information to be retained.
Analysis of Existing Solutions:
InfluxDB [12] is largely acknowledged to be the most prominent time-series database that is
centralized around the world. A sequential series is used to store data points that have
timestamps attached to them. Both fields and tags can be included in a data point's
composition. Tags are not required for every data point, but fields are required for every data
point. Furthermore, InfluxDB enables series-specific retention policies, which include the
setup of data retention duration, data transfer to long-term storage, and data replication
frequency. An Open-Source core that is provided by InfluxData is free of charge and contains
InfluxDB, Kapacitor, Telegraf, and Chronograf (referred to as the TICK Stack) [13]. Joins are
not supported by InfluxDB; therefore, you need either build your schema in such a way that
they are not required. In the event that this is not possible and joins cannot be avoided, then
utilizing influxDB is not a good choice from a practical standpoint. Since InfluxDB does not
support CRUD operations, it is not a good choice for use cases that necessitate frequent
updates and deletions.
OpenTSDB [14] utilizes tags to enhance the dimensionality of its interface. The storage
foundation of the system is based on HBase, however, it lacks the capability to perform
analytics during the writing process. In addition, the system only provides millisecond
resolution, which is inadequate for real-time applications such as power management. In
order to make the tables more manageable, TimescaleDB [14], which functions as an
extension of PostgreSQL, separates them into small pieces. Additionally, it processes the data
in a column-oriented style, which enables the data from each data source (column) to be
controlled individually. Storage overhead is further decreased by compressing data chunks
according to time ranges before storing them.
Table 1: Comparison of Traditional System
System Data Structure Query Language Performance Application
InfluxQL, a SQL-like Best for system that
Good in high write
Structured TSD query language, and needs analytics and
InfluxDB throughput and
with tags and fields Flux, a powerful data real-time monitoring
complex queries
scripting language. with structured data.
Suitable for long-term
Stores data in a key- storage and analysis of
SQL with support for Scales well
value format using massive time series
aggregations and horizontally, making
Open TSDB HBase, which is datasets, such as server
down sampling it ideal for very
suitable for large- metrics
large datasets.
scale data storage.
Optimized for high Ideal for applications
Uses SQL for
relational data ingestion rates and needing relational data
querying, leveraging
TimescaleDB model of complex queries features along with
the full capabilities of
PostgreSQL through hyper tables time series data
PostgreSQ
and chunking management
6. Anthony Cavin
17 July 2024 at 12:26:46
I think many databases can
respond to events; Is it possible
to change this part?
7. Anthony Cavin
17 July 2024 at 12:23:04
Here we can speak about high-
frequency vibration sensors. For
example at 10kHz, we need to
store data at sub-milliseconds
(every 100 microseconds) and it’s
more efficient to store in chunks.
For reference, we can even site
the article: https://
Best for monitoring
www.reduct.store/blog/how-to- Uses a flat file Provides a simple API Good for small to
and visualization of
Graphite storage system for for querying and big datasets with
store-vibration-sensor-data time series data. rendering data. simple visualization.
application
performance metrics.
8. Anthony Cavin Can handle large Fits flexible schema
Employs a rich query
17 July 2024 at 12:27:36 Uses a flexible, datasets and high and different data
MongoDB document-oriented language that supports throughput with storage demands,
Is it possible to clarify or be more data model.
ad-hoc queries and
appropriate including time series
specific for this part? It could be indexing.
indexing. data.
more specific to unstructured
Optimized for large-
data. Stores data as Stores data as objects scale object storage Ideal for applications
Minio objects in buckets, in buckets, similar to with high needing scalable and
9. Anthony Cavin similar to AWS S3 AWS S3 availability and resilient object storage.
redundancy.
17 July 2024 at 12:29:25
We can say “ReductStore is a Provides an API for Scales horizontally Suitable for cloud
Stores data as
data access but no and supports high storage, big data, and
time series object store” instead OpenIO objects with a flat
native query availability and backup solutions.
of popular distributed datastore. namespace.
language. redundancy.
10. Anthony Cavin
17 July 2024 at 12:30:29
We can rephrase without Challenging the Traditional Systems:
mentioning researchers.
6 The above-mentioned systems are all effective but lack specificity; they allow for versatile
access techniques such as tags, which result in a notable decrease in performance.
11. Anthony Cavin
17 July 2024 at 12:30:56
Additionally, several systems lack the capability to accurately measure and respond to events
We collaborated quite a lot in the 7 that occur in real-time, namely those that occur within a timeframe of less than one
industry 4.0 sector 8 millisecond, such as the power grid. For the most part, traditional cloud computing privacy
security mechanisms do not apply to edge devices, and edge nodes are unable to carry out
algorithmic operations. It is not possible for existing related projects to successfully merge
cloud computing and edge computing.
9 The popular distributed datastore known as ReductStore was developed with the goal of
achieving high write throughput without compromising read efficiency. The ReductStore, is a
specialized system created to effectively handle and arrange data that is organized based on
time. The system is composed of buckets, entries, blocks, and records. Buckets are
receptacles that store data entries and possess configurable attributes such as maximum block
size, maximum record count, and quota types to govern storage restrictions. ReductStore is
designed to manage massive volumes of "blob" data, which is unstructured data. Its
adaptability and ease of integration with other applications are due to its support for an HTTP
API. Applications in edge computing that are performance-sensitive greatly benefit from
ReductStore's design, which prioritizes efficient storage and retrieval of binary data.
10 ReductStore’s researchers focused on unstructured time-series database systems for the
deployment of artificial intelligence and edge computing services in order to boost the overall
performance. Researchers, programmers, designers, and operations engineers of ReductStore
11 collaborated with end-user especially the IoT-powered industry to construct a platform from
the bottom up. They gives customers and organizations access to a wide variety of
opportunities, such as increased efficiency and a reduction in the amount of time needed for
queries of unstructured time series data. The data of high-quality prospects can be accessed 10
times faster due to decenterizled edge computing and AI.
A custom time series database for unstructured data was built by the engineers of
ReductStore. When compared to TimescaleDB and InfluxDB, they stand out due to
significantly better performance for data sizes of 10kB and above (optimal range is 10kB to
10MB). ReductStore's time-based indexing makes it incomparable to other S3 storage
options, such as MinIO. There are a lot of features in Reductstore, such as the ability to filter
data by labels or metadata, a strict FIFO quota that is great for edge computing, and the
ability to automatically replicate data to a remote instance.
Analysis of ReductStore:
ReductStore is optimized for unstructured binary data, whereas traditional solutions like InfluxDB,
OpenTSDB, and TimescaleDB are designed for structured time series data. Solutions like Minio and
OpenIO are geared towards object storage rather than time series data specifically. While other
databases offer more complicated query languages or APIs that are suited to particular data
models, ReductStore makes use of a plain HTTP application programming interface (API) for
data access. For fast writes of binary data, ReductStore is the way to go with its forward
writing and batching. While object storage systems prioritise scalability and redundancy,
traditional time series databases shine when it comes to querying and analysing structured
data.
ReductStore efficiently manages large datasets with features that optimize both storage and
access. The solution provides robust data replication, ensuring data availability and
consistency across different storage buckets. It also minimizes network overhead, particularly
in environments with high latency, by batching data retrieval operations. Additionally,
ReductStore’s real-time FIFO quota system ensures that disk space is optimally managed,
preventing shortages that could disrupt operations.
Table 2: Comparative Analysis of ReductStore with Other Solutions
12. Anthony Cavin
17 July 2024 at 12:31:52
Do you mean for cloud and
edge?
VS TimeScale VS MongoDB VS MiniIO
Record Size Read Speed Write Speed Read Speed Write Speed Read Speed Write Speed
(%) (%) (%) (%) (%) (%)
10 MB +850% +1300% +65% +158% +52% +217%
1 MB +855% +1075% +112% +137% +170% +333%
100 KB +217% +205% +198% +155% +548% +214%
Conclusion:
12 After careful consideration, we have determined that the systems that we have selected are
representative of the most advanced unstructured storage solutions for cloud computing.
ReductStore is well-suited for scenarios involving edge computing and the Internet of Things
(IoT) that require the effective storage and rapid retrieval of massive quantities of binary data
because of its emphasis on unstructured data, efficient batching, and FIFO quota
management. ReductStore offers a unique solution tailored for unstructured time series data,
making it a strong contender for edge computing applications where managing large blobs
efficiently is critical. Its simplicity, performance optimizations, and focus on unstructured
data set it apart from more traditional time series databases like InfluxDB, OpenTSDB, and
TimescaleDB, as well as object storage solutions like Minio and OpenIO.
For applications requiring efficient handling of binary data with straightforward integration
and high write performance, ReductStore is an excellent choice. However, for scenarios
needing detailed analytics and real-time monitoring of structured data, traditional time series
databases remain highly effective solutions. Similarly, for large-scale object storage needs,
Minio and OpenIO offer robust and scalable options. Finally, whilst conventional time series
databases are still great for analytics on structured data and real-time monitoring,
ReductStore provides a strong substitute for unstructured data in edge computing. As
companies face new challenges in efficiently managing and analyzing massive amounts of
unstructured data, they are increasingly turning to unstructured databases.
References:
1. Munirathinam, Sathyan. "Industry 4.0: Industrial internet of things (IIOT)." In Advances in
computers, vol. 117, no. 1, pp. 129-164. Elsevier, 2020.
2. Ghosh, Ananda Mohon, and Katarina Grolinger. "Edge-cloud computing for Internet of Things
data analytics: Embedding intelligence in the edge with deep learning." IEEE Transactions on
Industrial Informatics 17, no. 3 (2020): 2191-2200.
3. Mishra, Suyash, and Anuranjan Misra. "Structured and unstructured big data analytics."
In 2017 International Conference on Current Trends in Computer, Electrical, Electronics and
Communication (CTCEEC), pp. 740-746. IEEE, 2017.
4. Wang, Zhiqi, and Zili Shao. "TimeUnion: An Efficient Architecture with Unified Data Model for
Timeseries Management Systems on Hybrid Cloud Storage." In Proceedings of the 2022
International Conference on Management of Data, pp. 1418-1432. 2022.
5. Li, Junlong, Chenghong Gu, Yue Xiang, and Furong Li. "Edge-cloud computing systems for
smart grid: state-of-the-art, architecture, and applications." Journal of Modern Power Systems
and Clean Energy 10, no. 4 (2022): 805-817.
6. Krentz, Timothy, Abhishek Dubey, and Gabor Karsai. "Short paper: Towards an edge-located
time-series database." In 2019 IEEE 22nd International Symposium on Real-Time Distributed
Computing (ISORC), pp. 151-154. IEEE, 2019.
7. Matri, Pierre, Alexandru Costan, Gabriel Antoniu, Jesús Montes, and María S. Pérez. "Týr:
Efficient Transactional Storage for Data-Intensive Applications." PhD diss., Inria Rennes
Bretagne Atlantique; Universidad Politécnica de Madrid, 2016.
8. Wu, Lengdong, Liyan Yuan, and Jiahuai You. "Survey of large-scale data management
systems for big data applications." Journal of computer science and technology 30, no. 1
(2015): 163-183.
9. Ali, Wajid, Muhammad Usman Shafique, Muhammad Arslan Majeed, and Ali Raza.
"Comparison between SQL and NoSQL databases and their relationship with big data
analytics." Asian Journal of Research in Computer Science 4, no. 2 (2019): 1-10.
10. Raj, Pethuru. "A detailed analysis of nosql and newsql databases for bigdata analytics and
distributed computing." In Advances in Computers, vol. 109, pp. 1-48. Elsevier, 2018.
11. Stolpe, Marco. "The internet of things: Opportunities and challenges for distributed data
analysis." Acm Sigkdd Explorations Newsletter 18, no. 1 (2016): 15-34.
12. Jama Mohamud, Nuh, and Mikael Söderström Broström. "Assessing Query Execution Time
and Implementational Complexity in Different Databases for Time Series Data." (2024).
13. Naqvi, Syeda Noor Zehra, Sofia Yfantidou, and Esteban Zimányi. "Time series databases and
influxdb." Studienarbeit, Université Libre de Bruxelles 12 (2017): 1-44.
14. Wang, Chen, Jialin Qiao, Xiangdong Huang, Shaoxu Song, Haonan Hou, Tian Jiang, Lei Rui,
Jianmin Wang, and Jiaguang Sun. "Apache IoTDB: A time series database for IoT
applications." Proceedings of the ACM on Management of Data 1, no. 2 (2023): 1-27.