Skip to content

Simplify/remove /persist/status/zedagent/*#5584

Merged
eriknordmark merged 14 commits into
lf-edge:masterfrom
eriknordmark:persist2memory
Apr 3, 2026
Merged

Simplify/remove /persist/status/zedagent/*#5584
eriknordmark merged 14 commits into
lf-edge:masterfrom
eriknordmark:persist2memory

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark commented Jan 31, 2026

Description

We currently have a checkpoint of the protobuf config in /persist/checkpoint/lastconfig, which is signed by the controller. Its signature is verified before it is used, and then it is used to populate zedagent's publications even if there is no connection to the controller.
Thus the persistent publications (with files in /persist/status/zedagent) should not be needed, and getting rid of them simplifies analyzing any security impact unauthorized modifications will have to such files.

However, we need to have the ControllerCerts available since the CipherContext (used for object encryption) depend on those. This is adddressed by introducing a new /persist/checkpoint/controllercerts which contains the protobuf objects received from the controller. This file is then verified at boot (the same way as when we receive an update - that the certificate chain verifies all the way to the cert in /config/root-certificate.pem)

Both lastconfig and controllercerts have a .bak file which should ensure that even if there is a power outage when the file is written we will have a valid backup which we can use.

Note that more of the publications in /persist/status/zedagent need to be addressed.
Next is ConfigItemValueMap which might have some chicken and egg problems at bootup; need to have that published based on the checkpoint lastconfig as the agents start.

How to test and validate this PR

Since we are touching code which relate to rolling the controller certificates that needs to be tested very carefully, including any corner case.
And since the purpose of the checkpoints are to allow the device (including datastore and WiFi credentials) and app instances (including with cloud-init) to boot even if there is no network, that needs to be carefully tested.

It is not clear whether we can test the corruption of /persist/checkpoint/lastconfig or /persist/checkpoint/controllercerts, but that is why we have the .bak files (to handle inopportune power outages while the checkpoint file(s) are updated.)

Changelog notes

Removed extra copies of checkpointed state from /persist/status/ and instead solely rely on the signed protobuf encoded checkpoint in /persist/checkpoint/lastconfig. This required also checkpointing the controller certs (in /persist/checkpoint/controllercerts), and providing a backup copy of both checkpoints. The backups are needed to handle both the case of inopportune power outages when the checkpoints are written, and the fact that the lastconfig depends on the controllercerts yet those two separate checkpoint files are not updated atomically.

PR Backports

Here is the list of current LTS branches (it should be always up to date):

  • 16.0-stable To be backported.
  • 14.5-stable To be backported.
  • 13.4-stable To be backported.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 31, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 28.34%. Comparing base (2281599) to head (1a507ea).
⚠️ Report is 435 commits behind head on master.

Files with missing lines Patch % Lines
pkg/pillar/cmd/vcomlink/vcomlink.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5584      +/-   ##
==========================================
+ Coverage   19.52%   28.34%   +8.81%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      685      +95     
+ Misses       2310     1588     -722     
- Partials      121      144      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/pillar/cmd/zedagent/handleconfig.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we strip the "maybe" prefix from the function name now that we reload bootstrap config on every boot?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions Bot requested a review from milan-zededa March 25, 2026 05:08
@eriknordmark eriknordmark force-pushed the persist2memory branch 2 times, most recently from c6b38e8 to 48efd8a Compare April 1, 2026 13:28
This is used as the startup config if we crash and don't have
controller connectivity. It is saved after we have been running
with an updated config for at least 10 minutes.

Signed-off-by: eriknordmark <erik@zededa.com>
And its .bak file. Those are checkpointed protobuf files which
have their chains verified before writing them, but also when they
are read after a reboot. When we load from
/persist/checkpoint/controllercerts we also publish ControllerCerts for
use by others.
This will remove the need to for /persist/certs/server-signing-cert.pem

Signed-off-by: eriknordmark <erik@zededa.com>
Instead it is only kept in memory/pubsub and lookupControllerSigningCert
can be used to fetch it.

Signed-off-by: eriknordmark <erik@zededa.com>
The need for Touch went away when we started accepting arbitrarily old
checkpoints.

Signed-off-by: eriknordmark <erik@zededa.com>
And make sure we update the checkpoint when there are real
changes to the controller certs. This requires comparing the
set of Keys aka hashes of the certificates to avoid a falsely
detection changes due to ordering differences in the protobuf
bytes.

Signed-off-by: eriknordmark <erik@zededa.com>
They are created from the checkpointed controllercerts and lastconfig
when zedagent starts, and then they are published to other agents.

Signed-off-by: eriknordmark <erik@zededa.com>
The ConfigItemValueMap will no longer be a persistent publications
hence there will be no need to convert from old to new formats,
nor set default values. The defaults will be applied by zedagent
on startup.

A follow-on commit adds back the handling of /config/GlobalConfig/global.json
in zedagent.

Signed-off-by: eriknordmark <erik@zededa.com>
Zedagent initializes it from /persist/checkpoint/lastconfig
on startup so that other agents can get their global config.

Signed-off-by: eriknordmark <erik@zededa.com>
Since some persistent publication are no longer persistent

Signed-off-by: eriknordmark <erik@zededa.com>
And remove some use of file access in favor of pubsub calls.

Signed-off-by: eriknordmark <erik@zededa.com>
But do this in zedagent similar to how it handles bootstrap config
(the old code was in upgradeconverter).

A follow-up commit removes the use of /persist/ingested and related
sha256 comparisons.

Signed-off-by: eriknordmark <erik@zededa.com>
This no longer makes sense since we ingest into memory and not into
files in /persist. Only the DevicePortConfig gets ingested by nim
into a persistent publication and that can handle multiple ingestions
without indigestion.

Signed-off-by: eriknordmark <erik@zededa.com>
If we have a /persist/checkpoint/lastconfig we ignore
re-reading those files. They will have been taken into account
in forming /persist/status/nim/DevicePortConfigList/

Signed-off-by: eriknordmark <erik@zededa.com>
Since the load is no longer conditional on /persist/ingested/

Signed-off-by: eriknordmark <erik@zededa.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

security Provides a security fix stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants