vyos-netlinkd: DHCP restart on every RTM_NEWLINK(UP) without tracking previous state
In progress, HighPublicBUG
Actions

Assigned To

Authored By

	cr0ntab
	Sun, May 31, 8:12 PM

Description

I run a VyOS 2026.05.26-1327-rolling router as the WAN edge for my home network, virtualized on a proxmox cluster with set interfaces ethernet eth2 address dhcp and set interfaces ethernet eth2 address dhcpv6. After a routine KVM live migration of the router VM, the house lost internet for approximately 2 hours. I traced the outage to vyos-netlinkd restarting dhclient@eth2.service and dhcp6c@eth2.service 10 times in 15 seconds.

Live migration does not flap the guest link. But the kernel's post-migration NOTIFY PEERS gratuitous-ARP announcement is an RTM_NEWLINK message that carries IFLA_OPERSTATE=UP. vyos-netlinkd sees operstate=UP and restarts DHCP, even though the interface was already UP. I tested single-queue and multiqueue VMs on the same cluster: all generate the same NOTIFY PEERS events. Only VyOS reacts to them because only VyOS runs vyos-netlinkd.

_handle_dhcp_events() does not track per-interface previous operstate. Every UP event triggers systemctl restart dhclient@eth2.service, which sends DHCPRELEASE, flushes the address, deletes the default route via vtysh, and does a fresh DHCPDISCOVER. This also creates a feedback loop: dhclient-script-vyos runs ip link set dev eth2 up during PREINIT, emitting another RTM_NEWLINK(UP), which triggers another restart. One seed event becomes 10+ restarts in 15 seconds.

Root cause

In src/services/vyos-netlinkd, _handle_dhcp_events() handles operstate == 'UP' unconditionally:

elif operstate == 'UP':
    v6_restart = False
    interface_path = Section.get_config_path(ifname, delimiter='.')
    config_dict = op_mode_config_dict(
        ['interfaces'], key_mangling=('-', '_'), get_first_key=True
    )
    if tmp := dict_search(f'{interface_path}.address', config_dict):
        if 'dhcp' in tmp:
            cmd(f'systemctl restart {systemdV4_service}')

There is no check for whether the interface was previously DOWN. The daemon processes every RTM_NEWLINK with operstate='UP' on a DHCP-configured interface as a reason to restart, including UP-to-UP re-notifications from normal kernel events.

The feedback loop is between two VyOS components:

vyos-netlinkd restarts dhclient@eth2.service
dhclient-script-vyos runs ip link set dev eth2 up during PREINIT (standard ISC dhclient behavior)
The ip link set up emits a new RTM_NEWLINK(UP) to the netlink socket
vyos-netlinkd receives it and restarts dhclient again
Loop repeats until timing breaks the cycle

Reproduction

Any event that generates RTM_NEWLINK with IFLA_OPERSTATE=UP on a DHCP-configured interface will trigger this. The easiest to reproduce:

KVM live migration (tested on Proxmox VE 8.4, QEMU 11.0.0): live-migrate a VyOS VM that has a DHCP-configured interface. The standard post-migration NOTIFY PEERS gratuitous-ARP event carries IFLA_OPERSTATE=UP in the netlink message. vyos-netlinkd restarts dhclient, the restart's PREINIT emits another UP, and the loop runs 10+ cycles.

I confirmed this affects all VMs, not just multiqueue. A single-queue virtio-net VM also generates NOTIFY PEERS with operstate=UP during live migration. The difference is that only VyOS runs vyos-netlinkd to react to these events.

I also observed a second trigger class: a transient promiscuous-mode toggle on the WAN interface (from a tc ingress qdisc with mirred redirect to ifb0). The promisc toggle generated a single RTM_NEWLINK(UP) that seeded the same feedback loop. The initial trigger for the promisc toggle is still under investigation.

Confirmed from journald

08:15 event (ProxLB live migration, 10 restarts in 15 seconds):

08:15:21  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP      (migration seed)
08:15:21  vyos-netlinkd: Restarting dhclient@eth2.service...
08:15:24  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP      (from dhclient-script PREINIT)
08:15:24  vyos-netlinkd: Restarting dhclient@eth2.service...
   ... repeats 10x until 08:15:36 ...
08:16:03  dhclient: bound to <WAN IP>

10:08 event (spontaneous promisc toggle, 4 restarts):

10:08:12  kernel: virtio_net virtio3 eth2: entered promiscuous mode
10:08:12  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP
10:08:12  vyos-netlinkd: Restarting dhclient@eth2.service...
10:08:16  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP      (PREINIT feedback)
10:08:16  vyos-netlinkd: Restarting dhclient@eth2.service...
10:09:57  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP
10:09:57  vyos-netlinkd: Restarting dhclient@eth2.service...
10:10:00  vyos-netlinkd: RTM_NEWLINK -> eth2, state=UP
10:10:00  vyos-netlinkd: Restarting dhclient@eth2.service...

After hotpatching vyos-netlinkd with the state tracker described below, a controlled live migration produced zero DHCP restarts:

12:31:30  vyos-netlinkd: DHCP event: eth2 operstate=UP prev=UP
12:31:30  vyos-netlinkd: Suppressing DHCP restart for eth2: already UP
   ... 20+ suppressed events across eth0/eth1/eth2, zero restarts ...

Fix

Track per-interface previous operstate. Only restart DHCP on DOWN-to-UP transitions (or first boot where previous state is unknown), not on UP-to-UP re-notifications:

_iface_prev_state: dict[str, str] = {}

def _handle_dhcp_events(operstate: Optional[str], ifname: str) -> None:
    systemdV4_service = f'dhclient@{ifname}.service'
    systemdV6_service = f'dhcp6c@{ifname}.service'

    if operstate not in ['UP', 'DOWN']:
        return None

    prev = _iface_prev_state.get(ifname)
    _iface_prev_state[ifname] = operstate

    if operstate == 'UP' and prev == 'UP':
        syslog.syslog(syslog.LOG_NOTICE,
                      f'Suppressing DHCP restart for {ifname}: already UP')
        return None

    if operstate == 'DOWN':
        # ... existing DOWN handler unchanged ...

    elif operstate == 'UP':
        # First UP after DOWN (or boot where prev=None) -- restart as before
        # ... existing UP handler unchanged ...

Edge cases:

First boot (prev=None, operstate=UP): restarts DHCP. Correct, this is the initial UP.
DOWN-to-UP (prev='DOWN', operstate=UP): restarts DHCP. Correct, this is a real link recovery.
UP-to-UP (prev='UP', operstate=UP): suppressed. This is the fix.
Service restart: dict resets to empty, next UP triggers DHCP restart. Safe.

I have been running this hotpatch on my production WAN router since 2026-05-31. A controlled live migration immediately after deployment produced zero DHCP restarts (all UP-to-UP events suppressed) with no impact on normal DHCP operation.

T3852: duplicate dhclient processes on link replug (same root cause area, closed as resolved, pre-dates vyos-netlinkd)
T5686: loss of connectivity on DHCP interfaces after link flap (same symptom class)
T8486: vyos-netlinkd high CPU (different issue, same daemon, recently fixed)
T8781: vyos-netlinkd high CPU with route updates (different issue, same daemon)
T3876/T5476: design and implementation of vyos-netlinkd replacing netplug

Environment

VyOS 2026.05.26-1327-rolling
Running as KVM guest on Proxmox VE 8.4 (QEMU 11.0.0, kernel 6.8)
WAN interface configured with address dhcp and address dhcpv6 with prefix delegation
ISC dhclient 4.4.3-P1 (the version shipped with this rolling build)
Python 3.12 (pyroute2 for netlink)

Details

Version: 2026.05.26-1327-rolling
Is it a breaking change?: Perfectly compatible
Issue type: Bug (incorrect behavior)

Related Objects

Mentioned In: T8975: ipsec: concurrent vti-up-down hook invocations lost-update /tmp/ipsec_vti_interfaces, stranding VTI interfaces admin-down
T8952: ipsec: expose per-peer unique setting for site-to-site connections
Mentioned Here: T3852: DHCP client issue - interface has two dhclient processes when link is unpluged and then plug again
T3876: Replace vyos-netplug with a VyOS link state monitor service
T5476: netplug: replace Perl helper scripts with a Python equivalent
T5686: Loss of connectivity on dhcp enabled ethernet interfaces after abrupt link restarts
T8486: vyos-netlinkd causes high CPU usage
T8781: vyos-netlinkd high CPU usage with lots of route updates

Event Timeline

cr0ntab created this task.Sun, May 31, 8:12 PM

Improved reproduction steps (no virtualization infrastructure needed):

On any VyOS instance with a DHCP-configured interface (VM or bare metal):

# Watch vyos-netlinkd in one terminal:
journalctl -fu vyos-netlinkd

# In another terminal, toggle promisc mode on the DHCP interface:
sudo ip link set dev eth0 promisc on && sudo ip link set dev eth0 promisc off

Replace eth0 with whatever interface has address dhcp. On an unpatched system, vyos-netlinkd restarts dhclient, the restart's PREINIT runs ip link set dev eth0 up, which emits another RTM_NEWLINK(UP), triggering another restart. The loop runs 10+ cycles.

PR with the fix: https://github.com/vyos/vyos-1x/pull/5242

cr0ntab mentioned this in T8952: ipsec: expose per-peer unique setting for site-to-site connections.Mon, Jun 1, 3:45 AM

pasik subscribed.Mon, Jun 1, 6:22 AM

Viacheslav changed the task status from Open to In progress.Mon, Jun 1, 11:51 AM

Viacheslav assigned this task to cr0ntab.

Viacheslav triaged this task as High priority.

cr0ntab mentioned this in T8975: ipsec: concurrent vti-up-down hook invocations lost-update /tmp/ipsec_vti_interfaces, stranding VTI interfaces admin-down.Wed, Jun 10, 12:40 AM