Feature Request: VM Stop/Resume with CPU/Memory Release for Cost Optimization

Currently, Ubicloud VMs can only be in "running" or "deleted" states from a user's perspective. There's no way to temporarily stop a VM to save costs while preserving its storage and configuration.

I noticed there's an internal `stopped` state and `incr_stop` semaphore in `prog/vm/metal/nexus.rb`, but it:
1. Doesn't release CPU cores or hugepages back to the host
2. Continues billing for the stopped VM
3. Has no corresponding resume capability
4. Isn't exposed via the public API

### Proposed Solution

Implement full VM stop/resume functionality that:

1. **Stops the VM** - Shuts down Cloud Hypervisor process via systemd
2. **Releases resources** - Frees CPU cores and hugepages on the host
3. **Finalizes billing** - Ends current billing records
4. **Preserves state** - Keeps storage volumes, network config, firewall rules
5. **Resumes on demand** - Reallocates resources and restarts VM
6. **Handles edge cases** - Queues or migrates if original host lacks capacity

### Technical Approach

Based on my exploration of the codebase, here's a proposed implementation:

#### 1. Extend `stopped` state to release resources

```ruby
# prog/vm/metal/nexus.rb

label def stopped
  when_stop_set? do
    # Stop the VM
    host.sshable.cmd("sudo systemctl stop :vm_name", vm_name:)

    # Release CPU and memory back to host
    vm_host.update(
      used_cores: Sequel[:used_cores] - vm.cores,
      used_hugepages_1g: Sequel[:used_hugepages_1g] - vm.memory_gib
    )

    # For sliced VMs
    if vm.vm_host_slice
      vm.vm_host_slice.update(
        used_cpu_percent: Sequel[:used_cpu_percent] - vm.cpu_percent_limit,
        used_memory_gib: Sequel[:used_memory_gib] - vm.memory_gib
      )
    end

    # Finalize billing records
    active_billing_records.each(&:finalize)

    vm.update(display_state: "stopped")
  end

  decr_stop

  # Check for resume signal
  when_resume_set? do
    hop_resuming
  end

  nap 60
end
```

#### 2. Add new `resuming` state

```ruby
label def resuming
  # Check if original host has capacity
  if host_has_capacity?
    reallocate_on_current_host
  else
    # Option 1: Wait for capacity
    # Option 2: Migrate to different host with capacity
    hop_find_new_host
  end

  # Reallocate resources
  vm_host.update(
    used_cores: Sequel[:used_cores] + vm.cores,
    used_hugepages_1g: Sequel[:used_hugepages_1g] + vm.memory_gib
  )

  # Restart systemd service
  host.sshable.cmd("sudo systemctl start :vm_name", vm_name:)

  # Create new billing records
  create_billing_records

  vm.update(display_state: "starting")

  hop_wait_sshable
end

def host_has_capacity?
  vm_host.used_cores + vm.cores <= vm_host.total_cores &&
    vm_host.used_hugepages_1g + vm.memory_gib <= vm_host.total_hugepages_1g
end
```

#### 3. Add `resume` semaphore

```ruby
# model/vm.rb
semaphore :resume  # Add to existing semaphores
```

#### 4. Add API endpoints

```ruby
# routes/project/location/vm.rb

# Stop VM - releases CPU/memory, stops billing
post "/:vm_name/stop" do
  authorize("Vm:stop", @location.id)

  vm = @project.vms_dataset.where(location: @location.name, name: params[:vm_name]).first
  raise ResourceNotFound, "VM not found" unless vm
  raise InvalidRequest, "VM is not running" unless vm.display_state == "running"

  vm.incr_stop

  serialize(vm, :detailed)
end

# Resume VM - reallocates resources, resumes billing
post "/:vm_name/resume" do
  authorize("Vm:resume", @location.id)

  vm = @project.vms_dataset.where(location: @location.name, name: params[:vm_name]).first
  raise ResourceNotFound, "VM not found" unless vm
  raise InvalidRequest, "VM is not stopped" unless vm.display_state == "stopped"

  vm.incr_resume

  serialize(vm, :detailed)
end
```

#### 5. Update VM serializer

```ruby
# serializers/vm.rb
def self.serialize_detailed(vm)
  {
    # ... existing fields ...
    can_stop: vm.display_state == "running",
    can_resume: vm.display_state == "stopped",
  }
end
```

### State Diagram

```
                         User Actions
                              │
              ┌───────────────┴───────────────┐
              │                               │
              ▼                               ▼
         POST /stop                    POST /resume
              │                               │
              ▼                               ▼
    ┌─────────────────┐             ┌─────────────────┐
    │     running     │             │     stopped     │
    │     (wait)      │◄────────────│                 │
    │                 │             │  • No CPU used  │
    │  • CPU allocated│             │  • No RAM used  │
    │  • RAM allocated│             │  • No billing   │
    │  • Billing active│            │  • Storage kept │
    └────────┬────────┘             └────────┬────────┘
             │                               │
             │      POST /delete             │
             ▼                               ▼
    ┌─────────────────────────────────────────────────┐
    │                    destroyed                     │
    │         (all resources released)                 │
    └─────────────────────────────────────────────────┘
```

### Edge Cases to Handle

| Scenario | Proposed Behavior |
|----------|-------------------|
| Resume but host is full | Queue and retry, or offer migration |
| Host rebooted while VM stopped | VM stays stopped (no auto-start) |
| Stop during active SSH session | Warn user, proceed with stop |
| Network/IP address | Keep allocated (user expectation) |
| Attached volumes | Keep attached, unmount cleanly |
| Firewall rules | Preserve, reapply on resume |
| Load balancer membership | Remove on stop, re-add on resume |

### Billing Considerations

**Proposed billing model for stopped VMs:**

| Resource | While Running | While Stopped |
|----------|---------------|---------------|
| vCPU | Billed | Not billed |
| Memory | Billed | Not billed |
| Storage | Billed | Billed (still allocated) |
| IPv4 address | Billed | Billed (still reserved) |
| IPv6 address | Free | Free |

This matches user expectations - you pay for what you're using, but reserved resources (storage, IP) still cost money.

### Benefits

1. **Cost savings for users** - Significant savings for dev/staging workloads
2. **Better resource utilization** - Stopped VMs free up host capacity for others
3. **CI/CD optimization** - Runners can be stopped between jobs

### Alternatives Considered

1. **Snapshot and delete** - More complex, longer resume time, loses ephemeral state
2. **Hibernate to disk** - Cloud Hypervisor doesn't support this well
3. **Keep billing while stopped** - Poor user experience, not competitive

### Willingness to Contribute

I'm happy to implement this feature and submit a PR. I've reviewed the codebase and believe the implementation is straightforward given the existing `stopped` state foundation.

Before starting, I'd like to:
1. Confirm this aligns with Ubicloud's roadmap
2. Discuss the billing model for stopped VMs
3. Get guidance on handling the "no capacity on resume" edge case

### Related Code

- `prog/vm/metal/nexus.rb` - VM state machine (existing `stopped` state at line 241)
- `model/vm.rb` - VM model with semaphores
- `model/vm_host.rb` - Host resource tracking
- `model/billing_record.rb` - Billing record management
- `routes/project/location/vm.rb` - API routes
- `rhizome/host/lib/cloud_hypervisor.rb` - Cloud Hypervisor integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: VM Stop/Resume with CPU/Memory Release for Cost Optimization #4471

Proposed Solution

Technical Approach

1. Extend `stopped` state to release resources

2. Add new `resuming` state

3. Add `resume` semaphore

4. Add API endpoints

5. Update VM serializer

State Diagram

Edge Cases to Handle

Billing Considerations

Benefits

Alternatives Considered

Willingness to Contribute

Related Code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Proposed Behavior
Resume but host is full	Queue and retry, or offer migration
Host rebooted while VM stopped	VM stays stopped (no auto-start)
Stop during active SSH session	Warn user, proceed with stop
Network/IP address	Keep allocated (user expectation)
Attached volumes	Keep attached, unmount cleanly
Firewall rules	Preserve, reapply on resume
Load balancer membership	Remove on stop, re-add on resume

Resource	While Running	While Stopped
vCPU	Billed	Not billed
Memory	Billed	Not billed
Storage	Billed	Billed (still allocated)
IPv4 address	Billed	Billed (still reserved)
IPv6 address	Free	Free

Feature Request: VM Stop/Resume with CPU/Memory Release for Cost Optimization #4471

Description

Proposed Solution

Technical Approach

1. Extend stopped state to release resources

2. Add new resuming state

3. Add resume semaphore

4. Add API endpoints

5. Update VM serializer

State Diagram

Edge Cases to Handle

Billing Considerations

Benefits

Alternatives Considered

Willingness to Contribute

Related Code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Extend `stopped` state to release resources

2. Add new `resuming` state

3. Add `resume` semaphore