Skip to content

Watch dhcpd_leases changes#23072

Draft
nirs wants to merge 6 commits into
kubernetes:masterfrom
nirs:watch-dhcpd-leases
Draft

Watch dhcpd_leases changes#23072
nirs wants to merge 6 commits into
kubernetes:masterfrom
nirs:watch-dhcpd-leases

Conversation

@nirs

@nirs nirs commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Replace polling when waiting for a DHCP lease with fsnotify-based file
watching. Parse the leases file only when it changes, log only on events.

Motivation

The old code polled /var/db/dhcpd_leases every 2 seconds while waiting for a
lease:

  1. Noisy logs — logs every 2 seconds, every noisy in the CI
  2. Wasted work — re-parsed the entire file even when nothing changed
  3. Still slow — a new lease could go undetected for up to 2 seconds

How it works

watcher

The watcher monitors /var/db/dhcpd_leases via fsnotify. It has no logging and
no polling loop — it just returns events to the caller.

Each call to watch() either returns immediately or blocks:

  • File missing: tries to start watching. If the file doesn't exist, returns
    fileMissing immediately. The caller decides how long to wait before retrying.
  • File exists: blocks on fsnotify events, a periodic fallback ticker (10s),
    or the deadline timer.

Events:

  • fileMissing — file does not exist, or was removed during watching (returned immediately, non-blocking)
  • fileCreated — file appeared, watching started
  • fileChanged — fsnotify detected a change (including atomic writes via rename/remove)
  • periodicCheck — 10s fallback in case fsnotify events are missed
  • deadlineExceeded — timeout

WaitForLease

WaitForLease drives the loop, handles all logging, and searches the leases
file only when the watcher reports a change.

On fileMissing: logs "waiting until file is created" once, then a "still
waiting" heartbeat every minute. Sleeps 1s before retrying.

On fileCreated / fileChanged / periodicCheck: searches the leases file
for the MAC address.

Flows

First boot, file exists (local)

The leases file exists with entries from other VMs, but our lease isn't there yet.

  1. fileCreated → search leases file → no match
  2. fileChanged → search leases file → found (dhcpd wrote the lease)

First boot, file missing (CI)

The leases file doesn't exist yet — common on CI runners.

  1. fileMissing → log "Leases file removed, waiting until … is created"
  2. fileMissing × N → silent (log "Still waiting …" every minute)
  3. fileCreated → search leases file → found

Second boot

The leases file exists and our lease is already in it.

  1. fileCreated → search leases file → found immediately

Example logs

First boot (CI) — file missing ~100s

File doesn't exist for 117 seconds — only 2 status lines during the wait
instead of "Searching for …" every 2 seconds.

I0601 21:58:49.682947   33845 main.go:144] libmachine: Nested VM detected, increasing timeout from 2m0s to 6m0s
I0601 21:58:49.682970   33845 main.go:144] libmachine: Waiting for DHCP lease for e6:f6:65:cb:17:82 (timeout 6m0s)
I0601 21:58:49.683129   33845 main.go:144] libmachine: Leases file removed, waiting until /var/db/dhcpd_leases is created ...
I0601 21:59:49.803267   33845 main.go:144] libmachine: Still waiting until /var/db/dhcpd_leases is created ...
I0601 22:00:46.881589   33845 main.go:144] libmachine: Leases file created, searching for e6:f6:65:cb:17:82 in /var/db/dhcpd_leases
I0601 22:00:46.881723   33845 main.go:144] libmachine: Found DHCP lease for e6:f6:65:cb:17:82: 192.168.65.2 in 117.140 seconds```

## Example logs - second boot (CI)

Lease already in the file from the first boot — found immediately.

I0601 22:01:45.849152 38317 main.go:144] libmachine: Nested VM detected, increasing timeout from 2m0s to 6m0s
I0601 22:01:45.849168 38317 main.go:144] libmachine: Waiting for DHCP lease for e6:f6:65:cb:17:82 (timeout 6m0s)
I0601 22:01:45.849344 38317 main.go:144] libmachine: Leases file created, searching for e6:f6:65:cb:17:82 in /var/db/dhcpd_leases
I0601 22:01:45.849425 38317 main.go:144] libmachine: Found DHCP lease for e6:f6:65:cb:17:82: 192.168.65.2 in 0.000 seconds```

Example logs - first start (local)

File exists with ~200 entries but our lease isn't there yet — we search once,
start watching, and find it ~1.3s later on the first fileChanged instead of
polling every 2 seconds.

I0531 23:59:41.611070   37847 main.go:144] libmachine: Waiting for DHCP lease for 5a:6b:43:02:61:96 (timeout 2m0s)
I0531 23:59:41.611182   37847 main.go:144] libmachine: Leases file created, searching for 5a:6b:43:02:61:96 in /var/db/dhcpd_leases
[snipped ~200 lease entries]
I0531 23:59:42.913415   37847 main.go:144] libmachine: Leases file changed, searching for 5a:6b:43:02:61:96 in /var/db/dhcpd_leases
I0531 23:59:42.913803   37847 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.27 HWAddress:5a:6b:43:02:61:96 ID:1,5a:6b:43:2:61:96 Lease:0x6a1ca396}
I0531 23:59:42.913811   37847 main.go:144] libmachine: Found DHCP lease for 5a:6b:43:02:61:96: 192.168.65.27 in 1.302 seconds

Example logs - second start (local)

Lease already in the file — found on the initial check, no watcher needed.

I0601 00:17:18.063291   38043 main.go:144] libmachine: Waiting for DHCP lease for 5a:6b:43:02:61:96 (timeout 2m0s)
I0601 00:17:18.063389   38043 main.go:144] libmachine: Leases file created, searching for 5a:6b:43:02:61:96 in /var/db/dhcpd_leases
I0601 00:17:18.063905   38043 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.27 HWAddress:5a:6b:43:02:61:96 ID:1,5a:6b:43:2:61:96 Lease:0x6a1ca529}
I0601 00:17:18.063911   38043 main.go:144] libmachine: Found DHCP lease for 5a:6b:43:02:61:96: 192.168.65.27 in 0.001 seconds

Based on #23084

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nirs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 30, 2026
@nirs nirs force-pushed the watch-dhcpd-leases branch 9 times, most recently from 11d9595 to 7d5d4ff Compare June 1, 2026 12:56
nirs added 5 commits June 1, 2026 20:54
Move DHCP lease types and functions from pkg/drivers/common into a
dedicated dhcp subpackage, preparing for deduplication of the setupIP
polling loop across vfkit, krunkit, qemu, and hyperkit drivers (kubernetes#21093)

Pure code movement with renames to fit the new package context:
- DHCPEntry → dhcp.Entry
- GetIPAddressByMACAddress → dhcp.IPAddressForMAC
- parseDHCPdLeasesFile → parseLeases
- getIPAddressFromFile → ipAddressFromFile
Replace the duplicated DHCP lease polling loops in vfkit, krunkit, and
qemu with a single dhcp.WaitForLease call.

Changes from the original polling loops:

- Switch from retry count to a timeout parameter, so the waiting
  strategy can be changed later (e.g. to fsnotify) without affecting
  callers.
- Include the NestedVM timeout multiplier inside WaitForLease so
  callers don't need to duplicate it.
- Keep firewall.UnblockBootpd in each driver's setupIP since this
  code is wrong and we don't want to move it into the new package.

How drivers are affected:

- vfkit, qemu: no behavior change — same polling logic moved to
  dhcp.WaitForLease (2s interval, 2min timeout, 3x for nested VMs).
- krunkit: now uses the same NestedVM timeout multiplier that vfkit
  and qemu had. Previously it was missing.
- hyperkit: no change — driver is deprecated and untested.
- Remove per-entry debug logging from ipAddressFromFile (200+ lines per
  poll on a typical Mac)
- Remove redundant "Found N entries" and "Found match" logs from
  ipAddressFromFile
- Move all logging to WaitForLease: one debug line per attempt with
  the MAC, path, and attempt number
- Add Infof at start ("Waiting for DHCP lease...") and on success
  ("Found DHCP lease... in N seconds"), matching WaitForSSHAccess style

Before (truncated):

    I0601 18:39:17.078574   63890 main.go:144] libmachine: Attempt 0
    I0601 18:39:17.078618   63890 main.go:144] libmachine: Searching for ca:69:e8:b4:7e:92 in /var/db/dhcpd_leases ...
    I0601 18:39:17.080875   63890 main.go:144] libmachine: unable to parse hw_address in dhcp leases file: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74": invalid MAC address: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74"
    I0601 18:39:17.082822   63890 main.go:144] libmachine: Found 212 entries in /var/db/dhcpd_leases!
    I0601 18:39:17.082832   63890 main.go:144] libmachine: dhcp entry: {Name:podman-krunkit IPAddress:192.168.64.2 HWAddress:7a:65:90:0f:bb:1c ID:1,7a:65:90:f:bb:1c Lease:0x6a1da917}
    I0601 18:39:17.082835   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.31 HWAddress:52:b7:40:f6:a0:76 ID:1,52:b7:40:f6:a0:76 Lease:0x6a1d93cd}
    I0601 18:39:17.082839   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.30 HWAddress:b2:77:60:58:94:56 ID:1,b2:77:60:58:94:56 Lease:0x6a1cd015}
    I0601 18:39:17.082845   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.28 HWAddress:de:00:f0:01:c5:b9 ID:1,de:0:f0:1:c5:b9 Lease:0x6a1ccf3e}
    I0601 18:39:17.082848   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.27 HWAddress:5a:6b:43:02:61:96 ID:1,5a:6b:43:2:61:96 Lease:0x6a1ca8e4}
    I0601 18:39:17.082851   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.26 HWAddress:a2:af:e7:af:d3:cd ID:1,a2:af:e7:af:d3:cd Lease:0x6a1ca2e8}
    I0601 18:39:17.082854   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.25 HWAddress:3e:0c:9c:58:67:a5 ID:1,3e:c:9c:58:67:a5 Lease:0x6a1ca0aa}
    I0601 18:39:17.082857   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.24 HWAddress:1e:07:57:0d:cf:60 ID:1,1e:7:57:d:cf:60 Lease:0x6a1a5f61}
    I0601 18:39:17.082860   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.23 HWAddress:66:b5:45:72:d3:5c ID:1,66:b5:45:72:d3:5c Lease:0x6a1a59f1}
    I0601 18:39:17.082863   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.22 HWAddress:26:ff:06:8b:a7:b2 ID:1,26:ff:6:8b:a7:b2 Lease:0x6a19a8ab}
    I0601 18:39:17.082868   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.21 HWAddress:c6:ec:f2:7e:b6:b4 ID:1,c6:ec:f2:7e:b6:b4 Lease:0x6a19a881}

    [snipped unhelpful 202 dhcp entires ...]

    I0601 18:39:19.084157   63890 main.go:144] libmachine: Attempt 1
    I0601 18:39:19.084210   63890 main.go:144] libmachine: Searching for ca:69:e8:b4:7e:92 in /var/db/dhcpd_leases ...
    I0601 18:39:19.084901   63890 main.go:144] libmachine: unable to parse hw_address in dhcp leases file: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74": invalid MAC address: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74"
    I0601 18:39:19.085992   63890 main.go:144] libmachine: Found 213 entries in /var/db/dhcpd_leases!
    I0601 18:39:19.086026   63890 main.go:144] libmachine: dhcp entry: {Name:minikube IPAddress:192.168.65.32 HWAddress:ca:69:e8:b4:7e:92 ID:1,ca:69:e8:b4:7e:92 Lease:0x6a1da9fe}
    I0601 18:39:19.086034   63890 main.go:144] libmachine: Found match: ca:69:e8:b4:7e:92
    I0601 18:39:19.086064   63890 main.go:144] libmachine: IP: 192.168.65.32

After:

    I0601 18:47:33.625633   64305 main.go:144] libmachine: Waiting for DHCP lease for ea:d0:8c:4d:f7:b7 (timeout 2m0s)
    I0601 18:47:33.625646   64305 main.go:144] libmachine: Searching for ea:d0:8c:4d:f7:b7 in /var/db/dhcpd_leases (attempt 0) ...
    I0601 18:47:33.625819   64305 main.go:144] libmachine: unable to parse hw_address in dhcp leases file: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74": invalid MAC address: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74"
    I0601 18:47:35.627333   64305 main.go:144] libmachine: Searching for ea:d0:8c:4d:f7:b7 in /var/db/dhcpd_leases (attempt 1) ...
    I0601 18:47:35.628264   64305 main.go:144] libmachine: unable to parse hw_address in dhcp leases file: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74": invalid MAC address: ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74"
    I0601 18:47:35.629414   64305 main.go:144] libmachine: Found DHCP lease for ea:d0:8c:4d:f7:b7: 192.168.65.33 in 2.004 seconds
    I0601 18:47:35.629452   64305 main.go:144] libmachine: Waiting until SSH server "192.168.65.33:22" is accessible
The macOS /var/db/dhcpd_leases file uses a type prefix in hw_address:

- "1,..." is an Ethernet MAC address (ARP hardware type 1)
- "ff,..." is a DUID (DHCP Unique Identifier)

The old code blindly skipped the first 2 characters (assuming "1,"),
which failed to parse DUID entries and logged a warning on every poll:

    unable to parse hw_address in dhcp leases file:
    ",f1:f5:dd:7f:0:2:0:0:ab:11:ad:75:7f:d2:b5:8d:7a:74":
    invalid MAC address: ...

Fix by splitting on "," and handling each type:

- "1": parse as Ethernet MAC address
- "ff": skip silently — these are created by VMs using systemd-networkd
  which sends a DUID instead of a MAC. Minikube VMs always use Ethernet
  so they never match these entries.
- other: warn about unknown type so we learn about new formats

Add a DUID entry to the test data (between two Ethernet entries) and
a test case verifying that Ethernet entries after a DUID are found.

Logs after this change (no more DUID warnings):

    Waiting for DHCP lease for ce:1b:6b:eb:28:a3 (timeout 2m0s)
    Searching for ce:1b:6b:eb:28:a3 in /var/db/dhcpd_leases (attempt 0) ...
    Searching for ce:1b:6b:eb:28:a3 in /var/db/dhcpd_leases (attempt 1) ...
    Found DHCP lease for ce:1b:6b:eb:28:a3: 192.168.65.34 in 2.004 seconds
Previously, parseMAC was called inside ipAddressFromFile on every poll
attempt. If the MAC was invalid, the parse error was returned, the loop
continued, and WaitForLease would keep retrying for 2 minutes (6 minutes
on nested VMs) before finally returning the parse error.

Fix by parsing the MAC once upfront in both WaitForLease and
IPAddressForMAC. An invalid MAC now fails immediately with a clear
error instead of silently retrying for minutes.
@nirs nirs force-pushed the watch-dhcpd-leases branch from 7d5d4ff to e1e239b Compare June 1, 2026 21:47
@nirs nirs changed the title wip: Watch dhcpd_leases changes Watch dhcpd_leases changes Jun 1, 2026
Replace DHCP lease polling with fsnotify file watching

Replace the 2-second polling loop in WaitForLease with event-driven
file watching using fsnotify. The leases file is now parsed only when
it actually changes, and the watcher includes a 10-second periodic
fallback in case fsnotify events are missed.

On CI first boot the file may not exist for minutes — the old code
logged every 2 seconds even though os.Open failed immediately.

Watcher
-------

The watcher monitors /var/db/dhcpd_leases via fsnotify. It has no
logging and no polling loop — it just returns events to the caller.

We watch the file directly rather than its parent directory (/var/db)
because macOS SIP protects some entries under /var/db, causing
fsnotify to fail with "operation not permitted".

Atomic writes (write tmp + rename) cause kqueue to lose the watch on
the original inode. We handle this by re-watching on Remove/Rename.

Events:
- fileMissing       — file does not exist (returned immediately)
- fileCreated       — file appeared, watching started
- fileChanged       — fsnotify detected a change
- periodicCheck     — 10s fallback in case events are missed
- deadlineExceeded  — timeout

WaitForLease
------------

Drives the loop, handles logging, searches the leases file only when
the watcher reports a change.

On fileMissing: logs "waiting until file is created" once, then
"still waiting" every minute. Sleeps 1s before retrying.

On fileCreated/fileChanged/periodicCheck: searches for the MAC.

Flows
-----

First boot, file exists (local):
1. fileCreated → search → no match
2. fileChanged → search → found (dhcpd wrote the lease)

First boot, file missing (CI):
1. fileMissing → log "waiting until file is created"
2. fileMissing × N → silent ("still waiting" every minute)
3. fileCreated → search → found

Second boot:
1. fileCreated → search → found immediately
@nirs nirs force-pushed the watch-dhcpd-leases branch from e1e239b to 4f33373 Compare June 1, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants