Casper Monitoring
Casper Monitoring
net/publication/373638145
CITATIONS                                                                                                 READS
0                                                                                                         312
4 authors, including:
All content following this page was uploaded by Amar Abane on 03 September 2023.
   Abstract—Network management relies on extensive monitoring         conditions for collecting each type of data are different, as
of network state to analyse network behavior, design optimiza-        well as the data format as discussed later.
tions, plan upgrades, and conduct troubleshooting. Network
                                                                         Hence, a need arises for a more general and flexible network
monitoring collects various data from network devices through
different protocols and interfaces such as NETCONF and Syslog,        monitoring methodology. In this paper, we propose a network
and from monitoring tools such as Zeek and Osquery. To unify          monitoring platform design that addresses these requirements.
and automate the monitoring workflow across the network, this         The platform allows to gather all necessary network data from
paper identifies and discusses the data collection requirements for   diverse sources, without the bottlenecks posed by centralized
network management, reviews different monitoring approaches,
                                                                      monitoring. The platform enables flexible and automated data
and proposes an efficient data collection platform that addresses
the requirements through an extensible and lightweight protocol.      collection with minimal communication overhead.
The platform design is demonstrated through an adaptive col-             This paper is structured as follows. Section II identifies
lection of data for network management based on digital twin          the main requirements in modern network monitoring. In
technology.                                                           Section III, popular monitoring solutions and approaches are
   Index Terms—network monitoring, network management,
publish-subscribe, digital twin network
                                                                      discussed. Section IV presents the design of the proposed data
                                                                      collection platform. Section V discusses a use case for the
                                                                      platform considering the emerging concept of Digital Twin
                       I. I NTRODUCTION
                                                                      for network management. Section VI concludes the paper.
   Recent advancements in network management have led to
the development of Network Management Systems (NMS)                          II. N ETWORK M ONITORING R EQUIREMENTS
with inventory management, network topology visualization,
configuration assistance, and network diagnostics. However,              Network monitoring starts with data collection, where in-
as network complexity and service diversity increase, con-            formation is requested from network devices and mapped
figuration and management errors become more frequent and             into an information model, either tool-specific or general
identifying the root cause of issues becomes more challenging.        (such as JSON). The formatted data is transmitted to the
   To address these issues, various network analysis and trou-        management station where it undergoes aggregation, filtering,
bleshooting tools have emerged to improve network manage-             and representation according to the network data model. The
ment by processing all kinds of network data with artifi-             network data is then used for various management purposes
cial intelligence (AI) and machine learning (ML) techniques.          and some of it may be stored for auditing and long-term
Therefore, data collected from the network is crucial for the         analysis.
effectiveness of these techniques. This data may include device          A network monitoring platform should facilitate data col-
configuration and status, alarms, topology, port/link status, ac-     lection, aggregation, and storage, including integration of
tivity logs, traffic and flow statistics, and user information, and   tools that request the data [1]. Moreover, the data to collect
service performance. Network data is typically collected using        varies in type, frequency, volume, and sources [4]. Hence, a
monitoring tools through a multistage process that involves           suitable monitoring must be able to handle these diverse data
measuring, transmitting, aggregating, presenting, and storing         types with a uniform workflow for efficient processing and
the data [1]. However, the current monitoring approaches have         presentation of data. The workflow should also be extensible
limitations in gathering network data (see Section III). For          to support new monitoring tools and parameters.
example, measurement platforms [2] concentrate solely on                 To minimize resource consumption, the platform must have
the communication performance metrics, such as end-to-end             a lightweight design with minimal communication and pro-
latency. Telemetry interfaces such as NetFlow are inadequate          cessing overheads. Increasing monitoring frequency can lead
in capturing device configurations. Management protocols              to higher resource consumption and data generation. Hence,
are restricted to device configuration data and do not cater          the platform should have the ability to dynamically adapt mon-
to devices’ performance data, or have a limited support of            itoring frequency and metrics based on resource availability.
telemetry such as gNMI [3]. Furthermore, the frequency and            Monitoring flexibility includes the scheduling of probes and
the ability to choose between periodic, on-demand, and event-      policies. Several network monitoring tools leverage M-Lab
based monitoring. To improve efficiency, the data delivery         servers for measurement coordination and data ingestion.
model should be leveraged to avoid duplicate messages.                mPlane [7] is a scalable infrastructure for distributed Inter-
   Each data item obtained from monitoring should have             net measurement. The platform offers flexibility in monitoring
unique identification and associated metadata about the origi-     through its support for single, iterative, and coordinated mea-
nating request [4]. This information is used by storage systems    surements, and enables dynamic integration of user-defined
to retrieve data when needed. Enriching collected data with        measurements through a probe’s capability description and
user-specific labels is also useful to improve search capabili-    request mechanism. However, its point-to-point communica-
ties.                                                              tion design limits the potential of its workflow and message
   Security is critical and encompasses authenticating data        scheme.
sources and consumers, and managing which users are autho-            The authors in [4] propose a data collection method for
rized to access each set of collected data. Whereas securing       Digital Twin Network (DTN), where the data streaming com-
monitoring workflow is manageable in environments with a           ponent informs the DTN of the data it can collect from network
limited set of data sources and consumers, it becomes more         devices. The DTN sends commands to the data streaming
challenging as the platform flexibility increases. A schema to     component to request the desired data. However, this approach
define and enforce policies is required to provide fine-grained    does not address other critical considerations such as efficient
control over authorization and access control while keeping a      data delivery and data identification.
reasonable complexity for certificate and key management.
                                                                   C. Standardization efforts
                     III. BACKGROUND                                  The standardization of network monitoring is being ad-
   Four broad categories of monitoring approaches gained           vanced through the efforts of consortiums and working groups.
recognition in recent years. These approaches are inspiring        One such effort is the gNMI protocol [3], which offers a
for the design of a network monitoring platform that meet the      vendor-neutral interface for device management. It provides a
demands outlined above.                                            unified service for both configuration and telemetry, enabling
                                                                   clients to exchange capabilities and retrieve data or subscribe
A. Internet measurement                                            to events from devices. However, the use of the same interface
                                                                   for both telemetry and configuration may result in suboptimal
   There are several platforms that offer public monitoring on     data delivery. While monitoring data can tolerate best-effort
the global scale of Internet. One such platform is RIPE Atlas      delivery with some data loss and duplication, management
[5], which leverages probe devices hosted by users across          commands necessitate reliable and consistent data delivery.
the Internet to collect data on network connectivity using         Additionally, gNMI currently lacks support for essential net-
predefined probes. The data collected is made publicly acces-      work diagnostic tools such as Ping and Traceroute, despite
sible and users can conduct custom measurements. Each probe        their availability through the gNOI protocol, the gNMI com-
relies on registration servers to identify its controller, which   plement for network operation.
manages the probe by sending a schedule for measurements
and receiving results. The results are then centrally processed,   D. Cloud monitoring and logging
enriched, and stored.                                                 The utilization of monitoring tools integrated within cloud
   Another popular network measurement toolkit is Perf-            platforms [8] has become widespread for monitoring appli-
SONAR [2], designed to identify end-to-end network problems        cations and resources, including virtual private cloud (VPC)
through measures such as bandwidth utilization, latency, and       networks. These monitoring tools gather performance data,
packet loss. Its architecture is divided into three layers, with   resource utilization metrics, and logs from various sources
various types of probes at the lowest layer, web services to       including the cloud provider’s systems, managed products,
invoke probes in the middle layer, and a user API at the highest   applications, and VMs with agents installed. The collected data
layer to trigger measurements and access results.                  is pushed and processed through a monitoring suite, where it
   While RIPE Atlas and PerfSONAR offer valuable network           undergoes filtering, ingestion, labeling, and storage. The stored
monitoring capabilities, their functionality is limited by pre-    data can be further analyzed, visualized, and processed through
defined probes and do not offer an architectural foundation        user-defined alerts and metrics.
for efficient data distribution among multiple producers and          The data collection process is typically achieved via HTTP
consumers.                                                         endpoints to which the monitored sources continuously push
                                                                   data. Although this approach is simple, with a centralized
B. Measurement facilitators
                                                                   REST API, it does not offer a control over data collection
   Several platforms provide flexible large scale network moni-    beyond filtering the data at ingestion stage. On the other hand,
toring solutions by addressing specific aspects such as storage,   cloud monitoring tools benefit from the security provided by
interoperability, or scalability.                                  cloud platforms through the use of flexible identity and access
   M-Lab [6] is a server infrastructure that facilitates mea-      management (IAM). IAM allows for precise control over users
surement data exchange through effective resource allocation       and services access to data and resources.
                  IV. P ROPOSED P LATFORM                             •   Aggregators are services that also act as clients to other
                                                                          services. They collect data from multiple services, break-
   The proposed platform is named ”CaSpeR”, which                         ing down a complex data collection task into simpler
stands for Capability-Specification-Result/Receipt, reflecting            ones, and producing aggregated results. Depending on
the message sequence that outlines the data collection work-              their level of intelligence, aggregators may also provide
flow. This section describes the design of the platform. For the          automated iterative monitoring, data transformation and
purpose of clarity, technical considerations such as encoding             correlation, etc.
format and a detailed discussion of message structure have
been omitted.                                                          In this platform, services broadcast capability messages to
                                                                    describe the data they are able to collect and the information
A. Overview                                                         required for data retrieval. Each data collection task should be
                                                                    represented by a separate capability. Clients receive capabil-
   The data collection platform design aims to streamline the       ities and use them to request data by sending a specification
acquisition of heterogeneous data through three key features.       message to the relevant service. The service responds with
Firstly, it encompasses source discovery, data request and          a receipt message indicating acceptance or rejection. If the
retrieval, and automated data processing to efficiently describe    specification is accepted, the collected data is disseminated
and integrate the data. Secondly, the design offers a scalable      through one or multiple result messages. The service executes
solution through a flexible scheme that balances the level          the specification to the best of its ability and may adjust the
of granularity in data request and the associated overhead.         execution.
Lastly, the architecture is designed for easy implementation           Clients and services interact in the platform without es-
and minimal impact on network resources by allowing for             tablishing end-to-end sessions. Messages are exchanged via
seamless integration with the management and control plane.         publish-subscribe topics. This model is chosen for its effi-
   The architecture of the proposed platform shares the design      ciency in disseminating messages to large groups of clients and
principles of the mPlane [7]. These principles include adopting     services, reducing data duplication, and minimizing control
a unified protocol for data description, requests, and results.     messages.
Its protocol facilitates the discovery of monitoring capabilities      Multiple services can offer the same capability, and a single
and enables seamless coordination of their execution. Addi-         client can submit specifications to multiple services. Similarly,
tionally, the architecture leverages self-contained and idempo-     a single service can distribute results to multiple clients. This
tent messages, ensuring that every message carries sufficient       decoupled interaction allows each service to manage the local
information to identify the monitoring task it relates to and       execution of specifications to optimize resource utilization.
can be easily detected and ignored in case of duplicity.
   While these design principles simplify the architecture and      C. Message types
provide flexibility in controlling monitoring tasks, they do not
address all the requirements for effective data collection. To          Each message conveys all necessary information for its
fully realize the potential of this approach, crucial enhance-      processing, including the derivation of the topic name to
ments have been introduced to increase flexibility, enhance         receive or publish the next message (see SectionIV-F). Figure 1
data semantics, and improve data distribution. These enhance-       depicts an abstracted structure of the message types. The type
ments include: (i) the use of a publish-subscribe model for         attribute refers to the nature of the data collection task being
exchanging messages, which reduces communication overhead           described by the capability, which can range from real-time
and enables diverse data dissemination options compared to          measurements (measure) to reading static data (collect) or
point-to-point protocols, (ii) allowing data sources to manage      database retrieval (query). The endpoint is a structured name
the local execution of monitoring tasks through request ag-         that contains the namespace in which the capability is defined
gregation and adjustment based on the solicitation level, and       (e.g., /casper/useast-1/datacenter-1), the name of the capability
(iii) providing expressive data description through the use of      (e.g., probe-port), and the identifier of the service or group of
semantics and application-defined labels.                           services providing the capability (e.g., switch-1).
                                                                        Execution parameters supported by the capability are listed
B. Workflow                                                         in the parameters section, which is a map containing parameter
                                                                    names and types. The allowed temporal scope is defined in
  The platform comprises two main components that com-              the schedule section, which is a formatted string indicating
municate through messaging: services and clients. The service       start and stop time, period, etc. Parameters and schedule are
collects data and the client requests it.                           filled with actual values by the specification message. The
  Three types of services are considered in the platform:           result-keys section defines the metrics or attributes that can
  •   Probe services (or agents) perform basic data collection      be returned by the capability, and the specification message
      tasks, such as track the status of a component, run           selects the metrics requested from the service. In result
      measurements, or read data from a device.                     messages, result-values is a two-dimensional array containing
  •   Sink services interface with a data store to save and         values corresponding to the result-keys. Remaining fields will
      retrieve data results or provide graphical visualization.     be introduced in later sections.
                                                                                   The receipt informs the client of the expected result-keys
                                                                                and the topic on which the result messages will be published. If
                                                                                the service performs schedule adjustment, it updates the nonce
                                                                                in the receipt, and publishes the results for all specifications
                                                                                that are aggregated in the same task, either via the same topic
                                                                                or in separate topics for each specification. The service has
                                                                                the option to skip schedule adjustment or to perform it and
                                                                                still publish results in separate topics.
Fig. 1. Structure of the main messages. (=) means that the field and its
value are copied from the previous message, (+) means that the field can
be added in the current message, (|) means that the field is kept from the      E. Result Management
previous message and its value is defined/updated in the current message, (∧)
denotes a field with a specific value for each message. Combination of signs       The receipt is used by the clients to associate the results with
represents an alternative.                                                      a specific operation id. To present results in a concise format,
                                                                                the service may opt to split the result-keys across multiple
                                                                                result messages, a process referred to as result splitting. In
   Figure 2 displays the relationships between messages. A                      this case, the service updates the result-keys in each message to
specification carries all relevant information from the ref-                    match the corresponding result-values and includes the original
erenced capability. A receipt includes information from the                     operation fingerprint for identification purposes (see Figure 1).
linked specification and updated information about the ex-                         The flow section is used to control the publishing of results.
pected result messages. A result contains information from                      The service can set the flow to ”stream” in the capability to
the specification used to generate it. The interrupt, redemption,               indicate real-time streaming of results or ”batch” to indicate
result, and termination messages all provide information on the                 that results will be published once the operation is completed,
task requested by the original specification. Beside capability,                either through a single or multiple messages. Depending on the
specification, result, and receipt, the workflow includes other                 nature of the operation and the available resources, a service
message types. An Interrupt is used to inform a service                         can enforce one flow option, or allow the client to select
to terminate the execution of a specification. A client asks                    the delivery mode in the specification. Upon receipt of result
a service to resend results of a specification by sending a                     messages, the client can organize and reassemble the results
Redemption message. A Termination message is published by                       based on the operation type, fingerprint, and id.
a service to inform specification execution is terminated. An                      The metadata section helps in handling the results. The
Exception is sent by clients/services to signal workflow errors.                metadata type can be set to ”point” to indicate that each result
                                                                                message represents a single point of data from the operation.
D. Data Collection Management                                                   In this case, the client can reconstruct the full operation data
   An operation is the data collection process requested by a                   using the operation id. If the metadata type is set to ”table”, it
specification. Each operation has a unique fingerprint, which                   indicates that each result message contains the complete data
is a hash of the type, endpoint, parameters, and result-keys                    collected during the operation. The metadata format specifies
defined in the specification. The fingerprint is used to group                  how result-keys and result-values should be displayed in a
messages related to a specific operation. If the fingerprint                    chart, using chart definition languages such as Vega-lite [9].
cannot be computed from a message due to a modified field,                      The metadata labels carry user-defined key-value information
its value is explicitly included in the message.                                for tagging results. Labels defined by the service are included
   Each operation has an implicit id which is generated by                      in subsequent specifications, receipts, and results. User-defined
combining the fingerprint with a client-generated nonce, al-                    labels are set in the specification, kept in the corresponding
lowing the client to differentiate between multiple executions                  receipt but not in the results as they may be shared among
of the same operation. The combination of the fingerprint and                   multiple clients.
id, along with the client’s identity information, is known as
the session, and is used by the service to manage operations                    F. Messaging topics
execution.                                                                         Figure 2 displays the relationships between message types
   The use of the fingerprint, id, and session in messages                      and the topics where they are published. Capabilities
between the service and client allows for a balance between                     are published in the ”capability” topic, while the
resource consumption, monitoring accuracy, and scalability.                     specification, interrupt, and redemption messages are
The service can adjust the requested operation as long as it                    published in the topic derived from the capability’s
complies with the specification. For example, if a specification                endpoint (i.e. ”<endpoint>.control”). The receipt is
requests a probe every 10 minutes, the service can fulfill it                   published in the topic derived from the specification (i.e.
with an operation that produces results every 5 minutes. The                    ”<endpoint>.receipt.<fingerprint>.<nonce>.<timestamp>”).
service can determine if a similar operation is already running                 The topics where results and termination messages
by using the fingerprint, and if so, adjust its schedule to meet                are published are derived from the receipt (i.e.
the new specification. This is known as schedule adjustment.                    ”<endpoint>.results.<fingerprint>”).
                                                                              checks the message signature and verifies that the service is
                                                                              authorized to produce messages for a given endpoint.
                                                                                 The AS manages symmetric content encryption/decryption
                                                                              keys (CK) for each namespace. Services and clients retrieve
                                                                              the CKs for namespaces they have access to based on their
                                                                              roles. Note that, with this scheme, if a client has a result-
                                                                              reader role, it can also decrypt specifications related to the
                                                                              capability. However, this does not pose a significant security
                                                                              threat since message derivation from a capability is clearly
                                                                              defined in the protocol.
G. Security
   The security of CaSpeR communications is independent of
the messaging system being used. As depicted in Figure 3, an                                    Fig. 3. CaSpeR security scheme.
Authorization Server (AS) enables the administrator/owner to
control which permissions (publish-specification, read-result,
publish-result) are granted to each identity (client and service)                V. C ASE S TUDY: A DAPTIVE DATA C OLLECTION FOR
for each capability. Clients and services are authenticated                                        D IGITAL T WIN
through their own accounts managed by the administrator                          The concept of Digital Twin Network (DTN) has emerged to
on the AS. Access control is based on roles. A role groups                    improve network management and automation using modeling,
together a set of permissions necessary for participating in the              emulation, and AI/ML techniques [11]. A DTN is a real-time
workflow.                                                                     digital representation of a physical network, which can be used
   Three base roles are defined. The specification-sender role                to design and evaluate network optimization, plan network
enables clients to request new operations and includes publish-               upgrades, conduct ”what-if” analysis, and troubleshoot the
specification and read-result permissions. The result-reader                  network [12].
role, allows clients to only access data from ongoing specifica-                 We discuss in the following how the Casper platform can be
tions and includes only the read-result permission. The result-               used to collect network data for a DTN. In this case study, the
publisher role, enables services to publish data and includes                 DTN is a client and the data is produced by various sources
the publish-result permission. Additional roles can be created                in the network acting as services.
for more fine-grained access control. The administrator/owner
can grant and revoke roles for each identity. A policy links an               A. Basic data collection
identity, a role, and a set of capabilities using the hierarchical               The DTN collects a variety of data from network equipment
naming structure of the endpoint.                                             from different vendors, which use different protocols. The
   The security scheme combines HTTPS between the AS                          platform provides a uniform interface for DTN services and
and clients/services, and self-secured encrypted messages to                  applications to access this data, hiding the protocol specifics.
provide authentication and authorization (see Figure 3). The                     To build a digital version of the network, a DTN needs
self-secured encryption scheme is similar to the role-based                   to continuously collect network topology (via port and link
security framework demonstrated in Named Data Networking                      status), device configuration, alarms and logs, and various
[10].                                                                         measurements reflecting network performance such as service
   Each client and service has a certificate signed by the                    Key Performance Indicators (KPIs) and device telemetry.
AS. Clients and services sign each message along with its                        Network performance data is collected periodically. The
endpoint, allowing the receiver to verify its authenticity. A                 capability advertised by the corresponding service has the type
service checks the signature of a specification (or interruption,             ”measure”, and uses the ”point” metadata type. Data is sent
redemption) and the client’s certificate, and uses the verified               to the DTN as a stream, with the collection frequency defined
identity to retrieve the client’s role from the AS. The service               in the specification’s schedule section.
can then accept or reject the specification (or interruption, re-                Real-time updates of network topology changes are crit-
demption) based on the permissions allowed for the namespace                  ical for effective operation of the DTN. The corresponding
to which the specification endpoint belongs. Similarly, a client              capability has the type ”measure”, and uses the ”point ”
metadata type. The ”on-event” option in the ”schedule” section        While a DTN system can automate monitoring and opera-
of the specification allows for real-time reception of topology    tion, human expertise is still crucial in production networks.
changes.                                                           To aid in this, sink services with graphical user interfaces can
   Device configuration data is locally stored on the device and   be deployed with minimal overhead as they consume copies
collected by a service, which advertises it using a ”collect”      of the data that is sent to the DTN.
capability with a ”table” metadata type. In a device, some
configurations change on a daily basis, and others change                                     VI. C ONCLUSION
rarely or less frequently [13]. To handle that, the DTN
sends two specifications for the same capability, one for the         Collecting large amounts of heterogeneous data becomes
infrequent changes and one for the frequent changes. Both          necessary for modern network management tools, whereas
specifications set the flow section to ”batch”. To collect data    it used to be an additional feature in traditional NMS and
from all devices while reducing the number of exchanged            Software-Defined Networking (SDN) solutions. Therefore,
messages, one capability can be implemented to collect data        data collection needs a dedicated workflow instead of being
from more than one device, using one column in the result-         implemented alongside control protocols as it has been so
keys for the device name and specifying the device(s) to collect   far. This need is addressed by proposing an extensible data
data from in the parameters section.                               collection platform that encapsulates the various interfaces
   Logs and alarms are parsed at the service and described         used in network monitoring.
using a collect capability with ”table” metadata type. The DTN        The platform can also be used for other telemetry purposes.
can have logs published periodically and alarms received in        For example, it is currently used for optical quantum network
real-time.                                                         metrology [15].