Skip to content

Feature Request: VM Stop/Resume with CPU/Memory Release for Cost Optimization #4471

@vivek7405

Description

@vivek7405

Currently, Ubicloud VMs can only be in "running" or "deleted" states from a user's perspective. There's no way to temporarily stop a VM to save costs while preserving its storage and configuration.

I noticed there's an internal stopped state and incr_stop semaphore in prog/vm/metal/nexus.rb, but it:

  1. Doesn't release CPU cores or hugepages back to the host
  2. Continues billing for the stopped VM
  3. Has no corresponding resume capability
  4. Isn't exposed via the public API

Proposed Solution

Implement full VM stop/resume functionality that:

  1. Stops the VM - Shuts down Cloud Hypervisor process via systemd
  2. Releases resources - Frees CPU cores and hugepages on the host
  3. Finalizes billing - Ends current billing records
  4. Preserves state - Keeps storage volumes, network config, firewall rules
  5. Resumes on demand - Reallocates resources and restarts VM
  6. Handles edge cases - Queues or migrates if original host lacks capacity

Technical Approach

Based on my exploration of the codebase, here's a proposed implementation:

1. Extend stopped state to release resources

# prog/vm/metal/nexus.rb

label def stopped
  when_stop_set? do
    # Stop the VM
    host.sshable.cmd("sudo systemctl stop :vm_name", vm_name:)

    # Release CPU and memory back to host
    vm_host.update(
      used_cores: Sequel[:used_cores] - vm.cores,
      used_hugepages_1g: Sequel[:used_hugepages_1g] - vm.memory_gib
    )

    # For sliced VMs
    if vm.vm_host_slice
      vm.vm_host_slice.update(
        used_cpu_percent: Sequel[:used_cpu_percent] - vm.cpu_percent_limit,
        used_memory_gib: Sequel[:used_memory_gib] - vm.memory_gib
      )
    end

    # Finalize billing records
    active_billing_records.each(&:finalize)

    vm.update(display_state: "stopped")
  end

  decr_stop

  # Check for resume signal
  when_resume_set? do
    hop_resuming
  end

  nap 60
end

2. Add new resuming state

label def resuming
  # Check if original host has capacity
  if host_has_capacity?
    reallocate_on_current_host
  else
    # Option 1: Wait for capacity
    # Option 2: Migrate to different host with capacity
    hop_find_new_host
  end

  # Reallocate resources
  vm_host.update(
    used_cores: Sequel[:used_cores] + vm.cores,
    used_hugepages_1g: Sequel[:used_hugepages_1g] + vm.memory_gib
  )

  # Restart systemd service
  host.sshable.cmd("sudo systemctl start :vm_name", vm_name:)

  # Create new billing records
  create_billing_records

  vm.update(display_state: "starting")

  hop_wait_sshable
end

def host_has_capacity?
  vm_host.used_cores + vm.cores <= vm_host.total_cores &&
    vm_host.used_hugepages_1g + vm.memory_gib <= vm_host.total_hugepages_1g
end

3. Add resume semaphore

# model/vm.rb
semaphore :resume  # Add to existing semaphores

4. Add API endpoints

# routes/project/location/vm.rb

# Stop VM - releases CPU/memory, stops billing
post "/:vm_name/stop" do
  authorize("Vm:stop", @location.id)

  vm = @project.vms_dataset.where(location: @location.name, name: params[:vm_name]).first
  raise ResourceNotFound, "VM not found" unless vm
  raise InvalidRequest, "VM is not running" unless vm.display_state == "running"

  vm.incr_stop

  serialize(vm, :detailed)
end

# Resume VM - reallocates resources, resumes billing
post "/:vm_name/resume" do
  authorize("Vm:resume", @location.id)

  vm = @project.vms_dataset.where(location: @location.name, name: params[:vm_name]).first
  raise ResourceNotFound, "VM not found" unless vm
  raise InvalidRequest, "VM is not stopped" unless vm.display_state == "stopped"

  vm.incr_resume

  serialize(vm, :detailed)
end

5. Update VM serializer

# serializers/vm.rb
def self.serialize_detailed(vm)
  {
    # ... existing fields ...
    can_stop: vm.display_state == "running",
    can_resume: vm.display_state == "stopped",
  }
end

State Diagram

                         User Actions
                              │
              ┌───────────────┴───────────────┐
              │                               │
              ▼                               ▼
         POST /stop                    POST /resume
              │                               │
              ▼                               ▼
    ┌─────────────────┐             ┌─────────────────┐
    │     running     │             │     stopped     │
    │     (wait)      │◄────────────│                 │
    │                 │             │  • No CPU used  │
    │  • CPU allocated│             │  • No RAM used  │
    │  • RAM allocated│             │  • No billing   │
    │  • Billing active│            │  • Storage kept │
    └────────┬────────┘             └────────┬────────┘
             │                               │
             │      POST /delete             │
             ▼                               ▼
    ┌─────────────────────────────────────────────────┐
    │                    destroyed                     │
    │         (all resources released)                 │
    └─────────────────────────────────────────────────┘

Edge Cases to Handle

Scenario Proposed Behavior
Resume but host is full Queue and retry, or offer migration
Host rebooted while VM stopped VM stays stopped (no auto-start)
Stop during active SSH session Warn user, proceed with stop
Network/IP address Keep allocated (user expectation)
Attached volumes Keep attached, unmount cleanly
Firewall rules Preserve, reapply on resume
Load balancer membership Remove on stop, re-add on resume

Billing Considerations

Proposed billing model for stopped VMs:

Resource While Running While Stopped
vCPU Billed Not billed
Memory Billed Not billed
Storage Billed Billed (still allocated)
IPv4 address Billed Billed (still reserved)
IPv6 address Free Free

This matches user expectations - you pay for what you're using, but reserved resources (storage, IP) still cost money.

Benefits

  1. Cost savings for users - Significant savings for dev/staging workloads
  2. Better resource utilization - Stopped VMs free up host capacity for others
  3. CI/CD optimization - Runners can be stopped between jobs

Alternatives Considered

  1. Snapshot and delete - More complex, longer resume time, loses ephemeral state
  2. Hibernate to disk - Cloud Hypervisor doesn't support this well
  3. Keep billing while stopped - Poor user experience, not competitive

Willingness to Contribute

I'm happy to implement this feature and submit a PR. I've reviewed the codebase and believe the implementation is straightforward given the existing stopped state foundation.

Before starting, I'd like to:

  1. Confirm this aligns with Ubicloud's roadmap
  2. Discuss the billing model for stopped VMs
  3. Get guidance on handling the "no capacity on resume" edge case

Related Code

  • prog/vm/metal/nexus.rb - VM state machine (existing stopped state at line 241)
  • model/vm.rb - VM model with semaphores
  • model/vm_host.rb - Host resource tracking
  • model/billing_record.rb - Billing record management
  • routes/project/location/vm.rb - API routes
  • rhizome/host/lib/cloud_hypervisor.rb - Cloud Hypervisor integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions