Page MenuHomeVyOS Platform

ipsec: concurrent vti-up-down hook invocations lost-update /tmp/ipsec_vti_interfaces, stranding VTI interfaces admin-down
In progress, NormalPublicBUG

Description

I run a VyOS 2026.05.26-1327-rolling router as the WAN edge for my home network, virtualized on Proxmox, with 12 IPsec/VTI site-to-site tunnels (PSK) and iBGP-over-VTI with BFD. After a coordinated IPsec reinit storm (triggered by a DHCP restart cascade, T8950), 9 of the 12 VTI interfaces were stranded admin-DOWN (vti101-106, vti203, vti204, vti206). Routing was dead on those 9 tunnels. swanctl --list-sas showed CHILD_SAs INSTALLED on all 12; the tunnels looked healthy at the ESP/xfrm layer. journalctl -t vti-up-down showed, for each stranded interface, a down-client event with no following up-client. No later event re-added the entries those hooks had lost.

Root cause

python/vyos/utils/vti_updown_db.py maintains a flat-file DB at /tmp/ipsec_vti_interfaces (space-separated "ifspecs") that tracks which VTI interfaces should be up. Three context managers (open_vti_updown_db_for_create_or_update, open_vti_updown_db_for_update, open_vti_updown_db_readonly) open the file and construct VTIUpDownDB(f). __init__ reads the whole file into a set self._ifspecs. add()/remove() mutate it in memory. commit() does seek(0); write(' '.join(self._ifspecs)); truncate(), then brings interfaces up or down accordingly.

The updown hook src/etc/ipsec.d/vti-up-down runs once per CHILD_SA event per VTI: on up-client it calls db.add() then db.commit(); on down-client it calls db.remove() then db.commit(). src/conf_mode/vpn_ipsec.py also opens the DB and calls remove_vti_updown_db(). The wait_for_commit_lock() call in the hook is the VyOS config-commit lock; it serialises the hook against config commits, not against other hook processes.

The read-modify-write cycle has no inter-process lock. During a coordinated reinit, charon fires the hook for all N VTIs concurrently (one process per CHILD_SA event). Two processes that both reach __init__ before either calls commit() each read the same stale copy of the file. Process A writes its updated set (with vtiX added); process B overwrites A's result with its own stale-based set (without vtiX). A's add of vtiX is lost. B's commit() then sees vtiX as not-wanted and runs ip link set vtiX down. No later event re-adds it, so vtiX stays admin-down.

The same lost-update window applies to remove_vti_updown_db(), which in the original code opens the DB via open_vti_updown_db_for_update() and then calls os.unlink() as a separate step outside any lock, leaving a create/delete race with a concurrent open_vti_updown_db_for_create_or_update().

Why admin-DOWN breaks routing

The CHILD_SA stays INSTALLED (xfrm is independent of the VTI interface admin state), so swanctl --list-sas reports the tunnel healthy. On an admin-down VTI, IPv6 DAD never completes, the VTI's address stays tentative, there is no usable source address, BFD reports no local address, and the iBGP-over-VTI session stays in Active/Connect. ESP encrypts; routing is dead.

Evidence from production

9 of 12 VTI interfaces were stranded admin-DOWN: vti101, vti102, vti103, vti104, vti105, vti106, vti203, vti204, vti206. swanctl --list-sas showed CHILD_SAs INSTALLED on all 12 throughout. journalctl -t vti-up-down showed for each stranded interface a down-client event with no following up-client within the reinit window, consistent with those up-client adds being overwritten by a concurrent process before they were committed to disk.

Reproduction

The attached reproduce.py drives exactly what the up-client hook does (open_vti_updown_db_for_create_or_update() -> db.add() -> db.commit()) from 64 processes released at the same instant by a barrier, with a no-op interface supplier so it touches no real interface (the named vtiN interfaces do not exist, so commit() only rewrites the state file and runs no ip link operations). The barrier makes the lost-update fire on every run rather than occasionally.

Run python3 reproduce.py as root on a stock VyOS instance. The stock python/vyos/utils/vti_updown_db.py reports (exit code 1):

fired 64 concurrent up-client events; DB retained 1 of 64 interfaces
LOST 63 of 64 updates (these interfaces would be stranded admin-down): vti0, vti1, vti2, ... vti63

Dropping in the patched vti_updown_db.py (and clearing its __pycache__) and re-running gives (exit code 0):

fired 64 concurrent up-client events; DB retained 64 of 64 interfaces
0 lost: all updates serialised correctly

I validated this on a fresh VyOS 2026.05.26-1327-rolling VM (the live qemu ISO, 4 vCPU).

Validation

After deploying the patch described below as a local hotpatch on the home-edge router, I restarted strongswan twice (two full reinit storms across all 12 tunnels). Both times: zero VTI interfaces stranded admin-down, all 12 iBGP-over-VTI sessions re-established, BFD came up on all 12, and 12 IKE SAs came back with no duplicates. Before the patch, the same restart reliably stranded multiple VTIs.

Fix

Reuse vyos.utils.locking.Lock (a dedicated lock file under /run/vyos/lock/<name>.lock) to serialise all access to the VTI up/down DB. A new _vti_updown_db_lock() context manager acquires the lock with timeout=0 (block until acquired, so hooks wait rather than fail) and releases it in a finally block; all three public context managers wrap their bodies in it. remove_vti_updown_db() is rewritten to acquire the lock once and hold it across both the DB processing and the os.unlink() call, closing the create/delete race between a concurrent open_vti_updown_db_for_create_or_update(). Every access to the DB file goes through the three public context managers (VTIUpDownDB is never constructed outside this module), so wrapping those context managers covers all callers.

PR: (GitHub link, added after filing)

Unit tests

src/tests/test_vti_updown_db.py was added alongside the patch (10 tests covering the DB logic and the lock wiring; this module previously had no test coverage). Per-test breakdown is in the PR.

Related

  • T8950: vyos-netlinkd DHCP restart on UP-to-UP re-notifications. The DHCP restart cascade that triggers the coordinated reinit storms which expose this race.
  • T8952: IPsec site-to-site peers lack per-peer unique knob. Duplicate SA accumulation triggered by the same reinit storms, different root cause in the SA layer.
  • T6544: Added vyos.utils.locking.Lock. The lock primitive this fix reuses.
  • T7062: Strong Swan IPsec, VTI Admin Down. Open. Same symptom (VTI admin-down with no auto-recovery); this lost-update race is a likely root cause.
  • T6574: vti-up-down script brings down VTIs when Child SA is renegotiated or reestablished. Open. Same script and symptom class, on a related but distinct trigger.
  • T1876: IPSec VTI tunnels deleted after rekey and dangling around as A/D. Resolved (historical, Azure era).

Environment

  • VyOS 2026.05.26-1327-rolling
  • strongSwan 5.9.11
  • 12 IPsec/VTI site-to-site tunnels with PSK authentication
  • iBGP-over-VTI with BFD
  • Running as KVM guest on Proxmox VE

Details

Version
2026.05.26-1327-rolling
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)