Skip to content

[HIGH] Implement health checks and graceful shutdown for production readiness #30

@superninja-app

Description

@superninja-app

[HIGH] Implement health checks and graceful shutdown for production readiness

Priority

🔴 P1 (High)

Description

The application currently lacks health check endpoints and graceful shutdown handling, which are essential for production deployment with load balancers and orchestrators like Kubernetes.

Missing Features:

  1. Health check endpoint for load balancers
  2. Readiness probe for Kubernetes
  3. Liveness probe for Kubernetes
  4. Graceful shutdown on SIGTERM/SIGINT
  5. Active session completion before shutdown

Acceptance Criteria

  • Add /health endpoint returning overall health status
  • Add /health/ready endpoint for readiness checks
  • Add /health/live endpoint for liveness checks
  • Implement graceful shutdown with configurable timeout
  • Wait for active sessions to complete before shutdown
  • Add metrics for shutdown duration
  • Document health check behavior
  • Test with Kubernetes deployment

Implementation

1. Health Check Endpoints

/// GET /health - Overall health status
async fn health_check(
    State(state): State<Arc<AppState>>,
) -> Result<Json<HealthStatus>, StatusCode> {
    let mut status = HealthStatus {
        status: "healthy".to_string(),
        version: env!("CARGO_PKG_VERSION").to_string(),
        uptime_seconds: state.start_time.elapsed().as_secs(),
        checks: HashMap::new(),
    };
    
    // Check Redis
    let redis_check = check_redis(&state.session_manager).await;
    if redis_check.status != "ok" {
        status.status = "degraded".to_string();
    }
    status.checks.insert("redis".to_string(), redis_check);
    
    // Check NATS
    let nats_check = check_nats(&state.media_metrics).await;
    if nats_check.status != "ok" {
        status.status = "degraded".to_string();
    }
    status.checks.insert("nats".to_string(), nats_check);
    
    // Check FairPlay SDK
    #[cfg(feature = "fairplay")]
    if let Some(handler) = &state.media_api_state.fairplay_handler {
        let fairplay_check = check_fairplay(handler).await;
        if fairplay_check.status != "ok" {
            status.status = "degraded".to_string();
        }
        status.checks.insert("fairplay".to_string(), fairplay_check);
    }
    
    if status.status == "healthy" {
        Ok(Json(status))
    } else {
        Err(StatusCode::SERVICE_UNAVAILABLE)
    }
}

/// GET /health/ready - Readiness probe
async fn readiness_check(
    State(state): State<Arc<AppState>>,
) -> Result<Json<ReadinessStatus>, StatusCode> {
    // Check if application is ready to serve traffic
    
    // Check Redis connection
    if state.session_manager.health_check().await.is_err() {
        return Ok(Json(ReadinessStatus {
            ready: false,
            reason: Some("Redis not available".to_string()),
        }));
    }
    
    // Check if shutting down
    if state.is_shutting_down.load(Ordering::Relaxed) {
        return Ok(Json(ReadinessStatus {
            ready: false,
            reason: Some("Server is shutting down".to_string()),
        }));
    }
    
    Ok(Json(ReadinessStatus {
        ready: true,
        reason: None,
    }))
}

/// GET /health/live - Liveness probe
async fn liveness_check() -> Json<LivenessStatus> {
    Json(LivenessStatus { alive: true })
}

2. Graceful Shutdown

use tokio::signal;
use std::sync::atomic::{AtomicBool, Ordering};

async fn shutdown_signal(is_shutting_down: Arc<AtomicBool>) {
    let ctrl_c = async {
        signal::ctrl_c()
            .await
            .expect("failed to install Ctrl+C handler");
    };
    
    #[cfg(unix)]
    let terminate = async {
        signal::unix::signal(signal::unix::SignalKind::terminate())
            .expect("failed to install signal handler")
            .recv()
            .await;
    };
    
    tokio::select! {
        _ = ctrl_c => {
            log::info!("Received Ctrl+C signal");
        },
        _ = terminate => {
            log::info!("Received SIGTERM signal");
        },
    }
    
    is_shutting_down.store(true, Ordering::Relaxed);
    log::info!("Shutdown signal received, starting graceful shutdown");
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // ... initialization ...
    
    let is_shutting_down = Arc::new(AtomicBool::new(false));
    
    // Start server with graceful shutdown
    let server = axum::Server::bind(&addr)
        .serve(app.into_make_service_with_connect_info::<SocketAddr>())
        .with_graceful_shutdown(shutdown_signal(is_shutting_down.clone()));
    
    log::info!("Server listening on {}", addr);
    
    if let Err(e) = server.await {
        log::error!("Server error: {}", e);
    }
    
    // Graceful shutdown sequence
    log::info!("Server stopped accepting new connections");
    
    // Wait for active sessions to complete (with timeout)
    let shutdown_timeout = Duration::from_secs(30);
    log::info!("Waiting up to {:?} for active sessions to complete...", shutdown_timeout);
    
    tokio::select! {
        _ = session_manager.wait_for_completion() => {
            log::info!("All sessions completed gracefully");
        }
        _ = tokio::time::sleep(shutdown_timeout) => {
            let remaining = session_manager.get_active_session_count().await?;
            log::warn!("Shutdown timeout reached, {} sessions still active", remaining);
        }
    }
    
    log::info!("Graceful shutdown complete");
    Ok(())
}

3. Kubernetes Configuration

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: arkavo-media-drm
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: arkavo
        image: arkavo/media-drm:latest
        ports:
        - containerPort: 9443
          name: http
        livenessProbe:
          httpGet:
            path: /health/live
            port: 9443
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 9443
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
        terminationGracePeriodSeconds: 45

Testing Requirements

  • Test health check endpoints return correct status
  • Test readiness probe during startup
  • Test readiness probe during shutdown
  • Test liveness probe
  • Test graceful shutdown with active sessions
  • Test shutdown timeout behavior
  • Test Kubernetes deployment with probes
  • Load test to verify no dropped connections during shutdown

Documentation Requirements

  • Document health check endpoints
  • Document shutdown behavior
  • Document Kubernetes configuration
  • Add operational runbook

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions