Skip to content

arkCyber/STT-Server-Rust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

STT Server - Apple M4 MLX Accelerated Speech-to-Text Server

Rust License: MIT Apple Silicon

A high-performance, production-grade Speech-to-Text server built in Rust, specifically optimized for Apple M4 chips with MLX acceleration. This project provides a modern, concurrent alternative to the original whisper.cpp server with enhanced performance and observability.

πŸš€ Features

🏎️ High Performance

  • Apple M4 MLX Acceleration: Native support for Apple's Machine Learning framework
  • True Concurrency: Async/await architecture with configurable worker pools
  • Zero-Copy Audio Processing: Minimal memory allocations in hot paths
  • Connection Pooling: Efficient resource management for high-load scenarios

πŸ“Š Production Ready

  • Comprehensive Observability: Prometheus metrics, structured logging, OpenTelemetry integration
  • Graceful Shutdown: Clean resource cleanup and request completion
  • Configuration Management: YAML configuration with environment variable overrides
  • Health Checks: Built-in health endpoints for load balancers
  • Rate Limiting: Configurable rate limiting and backpressure handling

🎡 Audio Support

  • Multiple Formats: WAV, MP3, FLAC, OGG, M4A support
  • Automatic Conversion: Built-in audio format conversion
  • Quality Enhancement: Noise reduction and audio normalization
  • Streaming Support: Real-time audio processing capabilities

🧠 ML Features

  • Whisper Integration: Based on OpenAI's Whisper model
  • Multiple Model Sizes: Support for tiny, base, small, medium, large models
  • Language Detection: Automatic language detection and multi-language support
  • Confidence Scores: Word-level confidence scores and timestamps
  • Beam Search: Configurable beam search for improved accuracy

πŸ“‹ Requirements

System Requirements

  • OS: macOS 11.0+ (for MLX acceleration) or Linux
  • CPU: Apple Silicon (M1/M2/M3/M4) recommended, x86_64 supported
  • Memory: 4GB+ RAM (8GB+ recommended for large models)
  • Storage: 1GB+ for models and logs

Software Requirements

  • Rust: 1.70+ with Cargo
  • MLX Framework: Automatically handled on Apple Silicon
  • FFmpeg: Optional, for advanced audio format conversion

πŸ› οΈ Installation

Quick Start (Apple Silicon)

# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust

# Install dependencies and build
cargo build --release --features=production

# Download a Whisper model
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O models/ggml-base.en.bin

# Run the server
./target/release/stt-server-rust

Development Setup

# Install development dependencies
cargo build --features=dev

# Run with development settings
cargo run -- --dev --log-level debug

Docker Installation with NGINX (Recommended)

The project includes a complete Docker setup with NGINX reverse proxy, automatic port conflict resolution, and monitoring stack:

# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust

# Download required models
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin -O models/ggml-base.bin

# Deploy complete stack with automatic port cleanup
./deploy.sh deploy --rebuild

# Or deploy with backup
./deploy.sh deploy --backup --rebuild

Docker Stack Services

Docker Management Commands

# Check service status
./deploy.sh status

# View logs
./deploy.sh logs

# Monitor services in real-time
./deploy.sh monitor

# Test deployment
./deploy.sh test

# Stop services
./deploy.sh stop

# Restart services
./deploy.sh restart

# Clean up everything
./deploy.sh cleanup

Simple Docker Installation

# Build Docker image (Apple Silicon)
docker build -t stt-server --platform linux/arm64 .

# Run container
docker run -p 8080:8080 -p 9090:9090 -v ./models:/app/models stt-server

βš™οΈ Configuration

The server supports multiple configuration methods:

Configuration File (config.yaml)

server:
  host: "0.0.0.0"
  port: 8080
  max_concurrent_requests: 100

model:
  path: "models/ggml-base.en.bin"
  language: "en"

hardware:
  enable_mlx: true
  mlx_device: "auto"
  max_gpu_memory_percent: 80

Environment Variables

export STT_SERVER_HOST=0.0.0.0
export STT_SERVER_PORT=8080
export STT_MODEL_PATH=models/ggml-large.bin
export STT_LOG_LEVEL=info
export STT_ENABLE_MLX=true

Command Line Arguments

./stt-server-rust \
  --host 0.0.0.0 \
  --port 8080 \
  --model models/ggml-base.en.bin \
  --threads 4 \
  --enable-mlx \
  --metrics

🌐 API Usage

Health Check

curl http://localhost:8080/health
{
  "status": "ok",
  "version": "0.1.0",
  "uptime": "12h 34m 56s",
  "models_loaded": 1
}

Speech-to-Text Transcription

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "language=en" \
  -F "response_format=json"

Response:

{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a test transcription.",
      "confidence": 0.98,
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.99},
        {"word": "this", "start": 0.6, "end": 0.8, "confidence": 0.97}
      ]
    }
  ]
}

Supported Response Formats

  • JSON (default): Structured response with segments and timestamps
  • Text: Plain text transcription
  • SRT: SubRip subtitle format
  • VTT: WebVTT subtitle format

Advanced Options

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "language=auto" \
  -F "response_format=json" \
  -F "temperature=0.0" \
  -F "beam_size=5" \
  -F "enable_timestamps=true" \
  -F "enable_word_confidence=true"

πŸ“Š Monitoring and Observability

Prometheus Metrics

Access metrics at http://localhost:9090/metrics:

# HELP stt_requests_total Total number of STT requests
# TYPE stt_requests_total counter
stt_requests_total{status="success"} 1234
stt_requests_total{status="error"} 5

# HELP stt_inference_duration_seconds Time spent on STT inference
# TYPE stt_inference_duration_seconds histogram
stt_inference_duration_seconds_bucket{le="0.1"} 100
stt_inference_duration_seconds_bucket{le="0.5"} 800
stt_inference_duration_seconds_bucket{le="1.0"} 950

# HELP stt_mlx_memory_usage_bytes MLX memory usage in bytes
# TYPE stt_mlx_memory_usage_bytes gauge
stt_mlx_memory_usage_bytes 2147483648

Structured Logging

{
  "timestamp": "2024-01-20T12:00:00.000Z",
  "level": "INFO",
  "target": "stt_server::inference",
  "message": "STT inference completed successfully",
  "request_id": "req_1234567890",
  "audio_duration": 3.5,
  "inference_time": 0.85,
  "model_name": "base.en",
  "language": "en"
}

Health Monitoring

# Basic health check
curl http://localhost:8080/health

# Detailed health with metrics
curl http://localhost:8080/health/detailed

πŸš€ Performance Optimization

Apple M4 Specific Optimizations

hardware:
  enable_mlx: true
  mlx_device: "gpu"  # Use M4 GPU
  enable_metal_performance_shaders: true
  enable_neural_engine: true
  max_gpu_memory_percent: 90

  memory:
    max_memory_mb: 16384  # 16GB for M4 Max
    enable_preallocation: true
    enable_memory_mapping: true

Performance Tuning

server:
  max_concurrent_requests: 200  # Increase for high load
  request_timeout_seconds: 120

inference:
  num_threads: 8  # Match M4 performance cores
  batch_size: 4   # Process multiple requests together
  enable_beam_search: true
  beam_width: 5

audio:
  enable_format_conversion: true
  enable_noise_reduction: false  # Disable for speed

Benchmarking

# Install k6 for load testing
brew install k6

# Run benchmark
k6 run bench.js

πŸ”§ Development

Building from Source

# Debug build
cargo build

# Release build with full optimization
cargo build --release --features=production

# Apple M4 optimized build
RUSTFLAGS="-C target-cpu=apple-m4" cargo build --release

Running Tests

# Unit tests
cargo test

# Integration tests
cargo test --test integration

# Benchmark tests
cargo bench

Code Quality

# Format code
cargo fmt

# Lint code
cargo clippy -- -D warnings

# Security audit
cargo audit

πŸ“š Architecture

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   HTTP Client   β”‚ => β”‚   Axum Server   β”‚ => β”‚ Request Router  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
                       β”‚ Rate Limiter    β”‚ <===========β”˜
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Audio Processor β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ MLX Inference   β”‚
                       β”‚ Engine          β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Response        β”‚
                       β”‚ Formatter       β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Concurrency Model

  • Async HTTP Server: Axum with Tokio runtime
  • Worker Pool: Configurable number of inference workers
  • Request Queue: Bounded queue with backpressure
  • Resource Pool: Model and buffer pooling for efficiency

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Guidelines

  1. Follow the .cursorrules coding standards
  2. Add comprehensive tests for new features
  3. Update documentation for API changes
  4. Ensure all CI checks pass

Reporting Issues

Please use the GitHub Issues to report bugs or request features.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Support


Built with ❀️ by arkSong for the Apple M4 era

About

A high-performance, production-grade Speech-to-Text server built in Rust

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published