Skip to content

drivers: Allow parallel start for safe vm drivers#23112

Open
nirs wants to merge 3 commits into
kubernetes:masterfrom
nirs:parallel-drivers
Open

drivers: Allow parallel start for safe vm drivers#23112
nirs wants to merge 3 commits into
kubernetes:masterfrom
nirs:parallel-drivers

Conversation

@nirs

@nirs nirs commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Currently acquireMachinesLock uses a single lock per driver, serializing all "minikube start" commands for profiles using the same driver. This was needed for VirtualBox (VBoxManage cannot handle concurrent calls) but is unnecessary for drivers where each profile creates an independent VM or container with no shared global state.

In drenv (RamenDR test framework) we create 3 minikube clusters in parallel using kvm2. Due to the shared lock, cluster creation is serialized for ~25 seconds per cluster:

[hub] Cluster started in 41.09 seconds
[dr2] Cluster started in 64.46 seconds
[dr1] Cluster started in 89.48 seconds

All three clusters are requested at the same time, but the lock forces them to wait. The ~48 second spread between the first and last cluster is pure lock serialization overhead. The random lock acquisition order causes up to 50 seconds of runtime variance between runs (p95: 6m50s when both large clusters start first vs 7m40s when they start last).

Add a Parallel property to the driver registry. Drivers that set Parallel to true use a per-profile lock, allowing concurrent creation of multiple profiles. Drivers that do not set it (default false) keep the current serialized behavior.

Enable parallel start for kvm2, qemu2, vfkit, krunkit, docker, and podman. These drivers create fully independent VMs or containers and have no shared state that requires serialization. Other drivers (virtualbox, vmware, hyperv, hyperkit) are left serialized until someone tests them and confirms they are safe.

vfkit test results

Tested with drenv, creating 6 small clusters in parallel. Testing shows that start time for 6 clusters decreased from 36.3 seconds to 22.7 seconds (1.61x faster).

Metric master parallel-drivers Change
Mean 36.7 s 22.7 s 0.62x
Median 36.7 s 22.8 s 0.62x
Min 33.1 s 20.6 s 0.62x
Max 41.6 s 24.4 s 0.59x
Std Dev 1.7 s 0.8 s 0.47x
p95 39.5 s 24.1 s 0.61x
Passed 100 100 -
Failed 0 0 -

Before

% grep started out/master/050.log
2026-06-07 00:25:17,973 INFO    [c1] Cluster started in 13.44 seconds
2026-06-07 00:25:22,599 INFO    [c4] Cluster started in 18.07 seconds
2026-06-07 00:25:25,972 INFO    [c6] Cluster started in 21.43 seconds
2026-06-07 00:25:33,481 INFO    [c5] Cluster started in 28.95 seconds
2026-06-07 00:25:38,059 INFO    [c3] Cluster started in 33.53 seconds
2026-06-07 00:25:42,184 INFO    [c2] Cluster started in 37.65 seconds
2026-06-07 00:25:43,665 INFO    [provider] Environment started in 39.19 seconds

After

% grep started out/parallel-drivers/050.log 
2026-06-06 23:29:32,178 INFO    [c5] Cluster started in 20.28 seconds
2026-06-06 23:29:32,406 INFO    [c3] Cluster started in 20.52 seconds
2026-06-06 23:29:32,640 INFO    [c2] Cluster started in 20.75 seconds
2026-06-06 23:29:32,808 INFO    [c6] Cluster started in 20.92 seconds
2026-06-06 23:29:33,146 INFO    [c1] Cluster started in 21.25 seconds
2026-06-06 23:29:33,295 INFO    [c4] Cluster started in 21.39 seconds
2026-06-06 23:29:34,711 INFO    [provider] Environment started in 22.87 seconds

kvm test results

Tested with drenv, creating 6 small clusters in parallel. Testing shows that start time for 6 clusters decreased from 141.1 seconds to 63.3 seconds (2.2x faster).

Metric master parallel-drivers Change
Mean 141.1 s 63.3 s 0.45x
Median 140.5 s 62.9 s 0.45x
Min 131.6 s 60.6 s 0.46x
Max 159.7 s 69.7 s 0.44x
Std Dev 4.3 s 2.2 s 0.51x
p95 148.6 s 68.0 s 0.46x
Passed 98 98 -
Failed 2 2 -

Before

$ grep started out/master/010.log 
2026-06-07 17:15:18,653 INFO    [c4] Cluster started in 36.85 seconds
2026-06-07 17:15:37,320 INFO    [c2] Cluster started in 55.51 seconds
2026-06-07 17:15:58,168 INFO    [c3] Cluster started in 76.36 seconds
2026-06-07 17:16:16,929 INFO    [c5] Cluster started in 95.12 seconds
2026-06-07 17:16:35,841 INFO    [c6] Cluster started in 114.03 seconds
2026-06-07 17:16:55,975 INFO    [c1] Cluster started in 134.16 seconds
2026-06-07 17:16:58,029 INFO    [provider] Environment started in 136.29 seconds

After

$ grep started out/parallel-drivers/010.log 
2026-06-07 09:00:02,909 INFO    [c3] Cluster started in 58.72 seconds
2026-06-07 09:00:02,942 INFO    [c6] Cluster started in 58.75 seconds
2026-06-07 09:00:03,005 INFO    [c4] Cluster started in 58.81 seconds
2026-06-07 09:00:03,425 INFO    [c1] Cluster started in 59.24 seconds
2026-06-07 09:00:03,492 INFO    [c5] Cluster started in 59.30 seconds
2026-06-07 09:00:04,661 INFO    [c2] Cluster started in 60.47 seconds
2026-06-07 09:00:07,576 INFO    [provider] Environment started in 63.46 seconds

drenv stress test results

drenv stress test starting 3 minikube cluster using kvm driver and provisioning the cluster for DR testing. Tested with latest minikube release (1.38.1) and this PR.

The test show decreased start time from 426s to 406s (1.05x faster), and decreased variance from 16.9s to 10s (1.69x better).

Metric 1.38.1 parallel-drivers Change
Mean 426.5 s 406.7 s 0.95x
Median 432.0 s 404.8 s 0.94x
Min 396.4 s 392.5 s 0.99x
Max 463.3 s 460.9 s 0.99x
Std Dev 16.9 s 10.0 s 0.59x
p95 454.1 s 417.5 s 0.92x
Passed 99 97 -
Failed 1 3 -

Notes: all failures are known issues, not related to this change.

Before - drenv addon times

With latest release cluster start order is random and it has big effect on addon deployment times.

addons-trimmed

After - drenv addons times

With this change all clusters starts in parallel which make addons start time much more consistent and predictable.

addons-trimmed

@nirs nirs added this to the minikube v1.39 milestone Jun 6, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nirs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from medyagh June 6, 2026 16:26
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 6, 2026
@nirs

nirs commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 6, 2026
@minikube-pr-bot

This comment has been minimized.

@nirs

nirs commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/retest-required

1 similar comment
@nirs

nirs commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/retest-required

nirs added a commit to nirs/ramen that referenced this pull request Jun 6, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 7, 2026
@nirs nirs force-pushed the parallel-drivers branch 2 times, most recently from 54ddce7 to e761875 Compare June 7, 2026 04:20
@minikube-pr-bot

This comment has been minimized.

@nirs nirs force-pushed the parallel-drivers branch from e761875 to 85ef98f Compare June 7, 2026 04:41
@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@nirs nirs force-pushed the parallel-drivers branch from ce061a9 to 4bc7966 Compare June 7, 2026 06:03
@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

Replace exponential backoff retry with a simple 2s polling loop,
consistent with other drivers (vfkit, krunkit, qemu).

Before (exponential backoff, actual timestamps from logs):

   0.000s  attempt 1
   0.219s  attempt 2
   0.492s  attempt 3
   0.900s  attempt 4
   1.284s  attempt 5
   2.016s  attempt 6
   2.952s  attempt 7
   4.025s  attempt 8
   5.103s  attempt 9
   6.477s  attempt 10
   8.066s  attempt 11
  10.230s  attempt 12
  12.566s  attempt 13
  15.990s  attempt 14
          (retry.go: will retry after 5.3s)
  21.285s  attempt 15  <-- found IP

After (fixed 2s interval, actual timestamps from logs):

   0.000s  attempt 1
   2.005s  attempt 2
   4.010s  attempt 3
   6.015s  attempt 4
   8.020s  attempt 5
  10.025s  attempt 6
  12.031s  attempt 7
  14.037s  attempt 8
  16.042s  attempt 9
  18.046s  attempt 10  <-- found IP

The first 7 attempts in the old schedule fire within 2.9 seconds at
sub-second intervals. DHCP negotiation cannot complete this fast, so
every one of these attempts is guaranteed to fail — they only add load
on libvirt.

The last attempts 10-15 waited more than 2 second, increasing the time
to detect the IP, slowing start time by 3.2 seconds. Exponential backoff
is not the right tool for this job.

- Pass the domain object to avoid LookupDomainByName on every iteration
- Use d.PrivateMAC directly instead of parsing domain XML for the MAC
- Log start with timeout and finish with elapsed time
- Log attempt count for easier debugging
- Extract lookupIP and reserveIP to clarify the wait loop flow
@nirs nirs force-pushed the parallel-drivers branch from 4bc7966 to 5e6c2a9 Compare June 7, 2026 16:29
The KVM driver's GetIP() re-queries libvirt for the IP address on every
call instead of returning the stored d.IPAddress. This causes a bug
under parallel stress:

Call chain:
  1. waitForStaticIP() finds IP, sets d.IPAddress = "192.168.x.x"
  2. Start() returns successfully
  3. saveHost() calls api.Save(h) — persists driver JSON (with IP set)
  4. saveHost() calls h.Driver.GetIP() to store in node config (n.IP)
  5. GetIP() ignores d.IPAddress, opens 2 new libvirt connections,
     calls ipFromXML() which queries dhcpLease() for the MAC
  6. Under stress, dhcpLease() returns nil (lease not yet visible in
     the network's DHCP database) — returns ("", nil)
  7. saveHost() stores n.IP = ""
  8. setupKubeconfig() calls ControlPlaneEndpoint() with cp.IP = ""
  9. net.LookupIP("") fails → "failed to lookup ip for \"\""

Fix by returning the stored d.IPAddress when available (fast path),
falling back to querying libvirt only when the IP is not stored and
the domain is running. This is consistent with vfkit and krunkit which
also return the stored IP directly.

The fallback queries the domain's live interface addresses
(ListAllInterfaceAddresses) rather than the network's DHCP lease
database (GetDHCPLeases). This is more reliable because:
- The domain's interfaces reflect the actual assigned IP
- The lease database may be stale when addStaticIP races with
  concurrent network updates from other VMs
- No need to look up the MAC from domain XML (d.PrivateMAC is known)
@minikube-pr-bot

This comment has been minimized.

Currently acquireMachinesLock uses a single lock per driver, serializing
all "minikube start" commands for profiles using the same driver. This
was needed for VirtualBox (VBoxManage cannot handle concurrent calls) but
is unnecessary for drivers where each profile creates an independent VM
or container with no shared global state.

In drenv (RamenDR test framework) we create 3 minikube clusters in
parallel using kvm2. Due to the shared lock, cluster creation is
serialized for ~25 seconds per cluster:

  [hub] Cluster started in 41.09 seconds
  [dr2] Cluster started in 64.46 seconds
  [dr1] Cluster started in 89.48 seconds

All three clusters are requested at the same time, but the lock forces
them to wait. The ~48 second spread between the first and last cluster
is pure lock serialization overhead. The random lock acquisition order
causes up to 50 seconds of runtime variance between runs (p95: 6m50s
when both large clusters start first vs 7m40s when they start last).

Add a Parallel property to the driver registry. Drivers that set
Parallel to true use a per-profile lock, allowing concurrent creation of
multiple profiles. Drivers that do not set it (default false) keep the
current serialized behavior.

Enable parallel start for kvm2, qemu2, vfkit, krunkit, docker, and
podman. These drivers create fully independent VMs or containers and
have no shared state that requires serialization. Other drivers
(virtualbox, vmware, hyperv, hyperkit) are left serialized until someone
tests them and confirms they are safe.
@nirs nirs force-pushed the parallel-drivers branch from 5e6c2a9 to df40b8b Compare June 7, 2026 17:20
@minikube-pr-bot

Copy link
Copy Markdown

kvm2 driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 23112 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 39.4s    │ 36.8s                  │
│ enable ingress │ 18.3s    │ 17.7s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 40.8s 36.9s 40.2s 39.5s 39.8s
Times for minikube (PR 23112) start: 37.2s 36.2s 36.9s 37.4s 36.3s

Times for minikube ingress: 15.3s 19.2s 18.8s 18.7s 19.3s
Times for minikube (PR 23112) ingress: 15.8s 18.7s 19.8s 15.3s 18.8s

docker driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 23112 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 18.7s    │ 20.4s                  │
│ enable ingress │ 12.6s    │ 12.3s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube ingress: 12.6s 12.6s 12.6s 12.6s 12.6s
Times for minikube (PR 23112) ingress: 13.1s 10.6s 12.6s 12.6s 12.6s

Times for minikube start: 18.5s 17.9s 21.6s 18.1s 17.5s
Times for minikube (PR 23112) start: 18.3s 20.8s 21.3s 20.9s 20.5s

docker driver with containerd runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 23112 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 17.5s    │ 17.5s                  │
│ enable ingress │ 23.8s    │ 23.4s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 16.4s 19.3s 16.5s 19.5s 15.9s
Times for minikube (PR 23112) start: 19.7s 19.0s 16.2s 16.4s 16.4s

Times for minikube ingress: 23.6s 24.1s 24.1s 23.6s 23.6s
Times for minikube (PR 23112) ingress: 23.6s 23.1s 23.6s 23.6s 23.1s

@nirs

nirs commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

/retest

@nirs nirs marked this pull request as ready for review June 7, 2026 22:22
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2026
@k8s-ci-robot k8s-ci-robot requested a review from prezha June 7, 2026 22:22
nirs added a commit to nirs/ramen that referenced this pull request Jun 7, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@k8s-ci-robot

k8s-ci-robot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@nirs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-minikube-docker-crio-linux-x86 df40b8b link false /test pull-minikube-docker-crio-linux-x86

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nirs

This comment was marked as outdated.

@nirs

nirs commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

/retest-required

@nirs

nirs commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@obnoxxx can you review?

nirs added a commit to nirs/ramen that referenced this pull request Jun 9, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Jun 9, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Jun 10, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Jun 12, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Jun 14, 2026
Add envs/provider.yaml creating 6 tiny clusters in parallel for testing
minikube parallel clusters creation.

Related-to: kubernetes/minikube#23112
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants