Skip to content

Conversation

@sxy-trans-n
Copy link
Collaborator

@sxy-trans-n sxy-trans-n commented Jun 19, 2025

Integrate WhisperKit and Enhance Swama CLI and Server

Summary

This PR integrates WhisperKit for local speech recognition, adds comprehensive CLI transcription support, and extends the server API to include audio transcription endpoints. All changes maintain high code quality with proper error handling and performance optimizations.

Key Changes

WhisperKit Integration

  • Added WhisperKit dependency to Package.swift
  • Implemented WhisperKitRunner with clean audio validation and transcription
  • Added model caching and concurrency control via ModelPool
  • Support for multiple response formats (simple, JSON, verbose JSON)

CLI Enhancements

  • New transcribe command with flexible options (model, language, format, temperature, prompt)
  • Enhanced pull command to support WhisperKit models
  • Updated model alias system with WhisperKit model mappings

Server API Extension

  • Added /v1/audio/transcriptions endpoint with OpenAI-compatible interface
  • Multipart form data parsing for audio file uploads
  • Proper error handling and HTTP status codes
  • Support for multiple output formats

Model Management

  • Unified model downloading for WhisperKit models
  • Added WhisperKit-specific model validation and storage
  • Enhanced model metadata generation

Files Changed

Core Package Files

  • swama/Package.swift - Added WhisperKit dependency
  • swama/Package.resolved - Updated dependencies

CLI Commands

  • swama/Sources/Swama/CLI/Command.swift - Registered Transcribe subcommand
  • swama/Sources/Swama/CLI/Pull.swift - Enhanced to support WhisperKit models
  • swama/Sources/Swama/CLI/Transcribe.swift - New transcription CLI command

Model Management

  • swama/Sources/SwamaKit/Model/ModelAliases.swift - Added WhisperKit aliases
  • swama/Sources/SwamaKit/Model/ModelDownloader.swift - WhisperKit download support
  • swama/Sources/SwamaKit/Model/ModelPool.swift - WhisperKit caching and concurrency
  • swama/Sources/SwamaKit/Model/WhisperKitRunner.swift - New WhisperKit runner

Server Components

  • swama/Sources/SwamaKit/Server/HTTPHandler.swift - Added transcription endpoint
  • swama/Sources/SwamaKit/Server/TranscriptionsHandler.swift - New transcription API handler

Testing

  • All existing unit tests pass
  • CLI help and basic functionality verified
  • Code formatted with SwiftFormat
  • Successful build in release mode

Usage Examples

CLI Transcription

# Basic transcription
swama transcribe audio.wav

# With specific model and language
swama transcribe audio.wav -m whisper-base -l en

# Verbose output with timestamps
swama transcribe audio.wav --verbose

# JSON output
swama transcribe audio.wav -f json

Model Download

# Download WhisperKit models
swama pull whisper-base
swama pull whisper-small
swama pull whisper-large

Server API

# Transcribe audio via API
curl -X POST http://localhost:28100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-base" \
  -F "response_format=json"

Backward Compatibility

This PR maintains full backward compatibility with existing CLI commands and server endpoints. All existing functionality continues to work as expected.

Copy link
Collaborator

@syh-trans-n syh-trans-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@sxy-trans-n sxy-trans-n merged commit da109de into main Jun 19, 2025
2 checks passed
@sxy-trans-n sxy-trans-n deleted the audio branch June 19, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants