STT Server - Apple M4 MLX Accelerated Speech-to-Text Server

A high-performance, production-grade Speech-to-Text server built in Rust, specifically optimized for Apple M4 chips with MLX acceleration. This project provides a modern, concurrent alternative to the original whisper.cpp server with enhanced performance and observability.

🚀 Features

🏎️ High Performance

Apple M4 MLX Acceleration: Native support for Apple's Machine Learning framework
True Concurrency: Async/await architecture with configurable worker pools
Zero-Copy Audio Processing: Minimal memory allocations in hot paths
Connection Pooling: Efficient resource management for high-load scenarios

📊 Production Ready

Comprehensive Observability: Prometheus metrics, structured logging, OpenTelemetry integration
Graceful Shutdown: Clean resource cleanup and request completion
Configuration Management: YAML configuration with environment variable overrides
Health Checks: Built-in health endpoints for load balancers
Rate Limiting: Configurable rate limiting and backpressure handling

🎵 Audio Support

Multiple Formats: WAV, MP3, FLAC, OGG, M4A support
Automatic Conversion: Built-in audio format conversion
Quality Enhancement: Noise reduction and audio normalization
Streaming Support: Real-time audio processing capabilities

🧠 ML Features

Whisper Integration: Based on OpenAI's Whisper model
Multiple Model Sizes: Support for tiny, base, small, medium, large models
Language Detection: Automatic language detection and multi-language support
Confidence Scores: Word-level confidence scores and timestamps
Beam Search: Configurable beam search for improved accuracy

📋 Requirements

System Requirements

OS: macOS 11.0+ (for MLX acceleration) or Linux
CPU: Apple Silicon (M1/M2/M3/M4) recommended, x86_64 supported
Memory: 4GB+ RAM (8GB+ recommended for large models)
Storage: 1GB+ for models and logs

Software Requirements

Rust: 1.70+ with Cargo
MLX Framework: Automatically handled on Apple Silicon
FFmpeg: Optional, for advanced audio format conversion

🛠️ Installation

Quick Start (Apple Silicon)

# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust

# Install dependencies and build
cargo build --release --features=production

# Download a Whisper model
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O models/ggml-base.en.bin

# Run the server
./target/release/stt-server-rust

Development Setup

# Install development dependencies
cargo build --features=dev

# Run with development settings
cargo run -- --dev --log-level debug

Docker Installation with NGINX (Recommended)

The project includes a complete Docker setup with NGINX reverse proxy, automatic port conflict resolution, and monitoring stack:

# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust

# Download required models
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin -O models/ggml-base.bin

# Deploy complete stack with automatic port cleanup
./deploy.sh deploy --rebuild

# Or deploy with backup
./deploy.sh deploy --backup --rebuild

Docker Stack Services

STT Server: http://localhost:8080 (NGINX proxy)
Prometheus: http://localhost:9090 (metrics)
Grafana: http://localhost:3000 (dashboards, admin/admin123)
Redis: localhost:6379 (caching)

Docker Management Commands

# Check service status
./deploy.sh status

# View logs
./deploy.sh logs

# Monitor services in real-time
./deploy.sh monitor

# Test deployment
./deploy.sh test

# Stop services
./deploy.sh stop

# Restart services
./deploy.sh restart

# Clean up everything
./deploy.sh cleanup

Simple Docker Installation

# Build Docker image (Apple Silicon)
docker build -t stt-server --platform linux/arm64 .

# Run container
docker run -p 8080:8080 -p 9090:9090 -v ./models:/app/models stt-server

⚙️ Configuration

The server supports multiple configuration methods:

Configuration File (config.yaml)

server:
  host: "0.0.0.0"
  port: 8080
  max_concurrent_requests: 100

model:
  path: "models/ggml-base.en.bin"
  language: "en"

hardware:
  enable_mlx: true
  mlx_device: "auto"
  max_gpu_memory_percent: 80

Environment Variables

export STT_SERVER_HOST=0.0.0.0
export STT_SERVER_PORT=8080
export STT_MODEL_PATH=models/ggml-large.bin
export STT_LOG_LEVEL=info
export STT_ENABLE_MLX=true

Command Line Arguments

./stt-server-rust \
  --host 0.0.0.0 \
  --port 8080 \
  --model models/ggml-base.en.bin \
  --threads 4 \
  --enable-mlx \
  --metrics

🌐 API Usage

Health Check

curl http://localhost:8080/health

{
  "status": "ok",
  "version": "0.1.0",
  "uptime": "12h 34m 56s",
  "models_loaded": 1
}

Speech-to-Text Transcription

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "language=en" \
  -F "response_format=json"

Response:

{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a test transcription.",
      "confidence": 0.98,
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.99},
        {"word": "this", "start": 0.6, "end": 0.8, "confidence": 0.97}
      ]
    }
  ]
}

Supported Response Formats

JSON (default): Structured response with segments and timestamps
Text: Plain text transcription
SRT: SubRip subtitle format
VTT: WebVTT subtitle format

Advanced Options

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "language=auto" \
  -F "response_format=json" \
  -F "temperature=0.0" \
  -F "beam_size=5" \
  -F "enable_timestamps=true" \
  -F "enable_word_confidence=true"

📊 Monitoring and Observability

Prometheus Metrics

Access metrics at http://localhost:9090/metrics:

# HELP stt_requests_total Total number of STT requests
# TYPE stt_requests_total counter
stt_requests_total{status="success"} 1234
stt_requests_total{status="error"} 5

# HELP stt_inference_duration_seconds Time spent on STT inference
# TYPE stt_inference_duration_seconds histogram
stt_inference_duration_seconds_bucket{le="0.1"} 100
stt_inference_duration_seconds_bucket{le="0.5"} 800
stt_inference_duration_seconds_bucket{le="1.0"} 950

# HELP stt_mlx_memory_usage_bytes MLX memory usage in bytes
# TYPE stt_mlx_memory_usage_bytes gauge
stt_mlx_memory_usage_bytes 2147483648

Structured Logging

{
  "timestamp": "2024-01-20T12:00:00.000Z",
  "level": "INFO",
  "target": "stt_server::inference",
  "message": "STT inference completed successfully",
  "request_id": "req_1234567890",
  "audio_duration": 3.5,
  "inference_time": 0.85,
  "model_name": "base.en",
  "language": "en"
}

Health Monitoring

# Basic health check
curl http://localhost:8080/health

# Detailed health with metrics
curl http://localhost:8080/health/detailed

🚀 Performance Optimization

Apple M4 Specific Optimizations

hardware:
  enable_mlx: true
  mlx_device: "gpu"  # Use M4 GPU
  enable_metal_performance_shaders: true
  enable_neural_engine: true
  max_gpu_memory_percent: 90

  memory:
    max_memory_mb: 16384  # 16GB for M4 Max
    enable_preallocation: true
    enable_memory_mapping: true

Performance Tuning

server:
  max_concurrent_requests: 200  # Increase for high load
  request_timeout_seconds: 120

inference:
  num_threads: 8  # Match M4 performance cores
  batch_size: 4   # Process multiple requests together
  enable_beam_search: true
  beam_width: 5

audio:
  enable_format_conversion: true
  enable_noise_reduction: false  # Disable for speed

Benchmarking

# Install k6 for load testing
brew install k6

# Run benchmark
k6 run bench.js

🔧 Development

Building from Source

# Debug build
cargo build

# Release build with full optimization
cargo build --release --features=production

# Apple M4 optimized build
RUSTFLAGS="-C target-cpu=apple-m4" cargo build --release

Running Tests

# Unit tests
cargo test

# Integration tests
cargo test --test integration

# Benchmark tests
cargo bench

Code Quality

# Format code
cargo fmt

# Lint code
cargo clippy -- -D warnings

# Security audit
cargo audit

📚 Architecture

High-Level Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   HTTP Client   │ => │   Axum Server   │ => │ Request Router  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                        │
                       ┌─────────────────┐             │
                       │ Rate Limiter    │ <===========┘
                       └─────────────────┘
                                │
                       ┌─────────────────┐
                       │ Audio Processor │
                       └─────────────────┘
                                │
                       ┌─────────────────┐
                       │ MLX Inference   │
                       │ Engine          │
                       └─────────────────┘
                                │
                       ┌─────────────────┐
                       │ Response        │
                       │ Formatter       │
                       └─────────────────┘

Concurrency Model

Async HTTP Server: Axum with Tokio runtime
Worker Pool: Configurable number of inference workers
Request Queue: Bounded queue with backpressure
Resource Pool: Model and buffer pooling for efficiency

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Guidelines

Follow the .cursorrules coding standards
Add comprehensive tests for new features
Update documentation for API changes
Ensure all CI checks pass

Reporting Issues

Please use the GitHub Issues to report bugs or request features.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper for the base STT model
whisper.cpp for the C++ implementation inspiration
Apple MLX for machine learning acceleration
Rust Community for the amazing ecosystem

📞 Support

📧 Email: arksong2018@gmail.com
💬 Discord: STT Server Community
📖 Documentation: docs.stt-server.example.com

Built with ❤️ by arkSong for the Apple M4 era

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
models		models
monitoring		monitoring
scripts		scripts
src		src
tests		tests
whisper.cpp		whisper.cpp
.cursorrules		.cursorrules
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT_STATUS.md		PROJECT_STATUS.md
README.md		README.md
UPLOAD_STATUS.md		UPLOAD_STATUS.md
check_git_status.sh		check_git_status.sh
cleanup_project.sh		cleanup_project.sh
config.yaml		config.yaml
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
monitor_upload.sh		monitor_upload.sh
prepare_for_github.sh		prepare_for_github.sh
test_server.sh		test_server.sh
whitelist.json		whitelist.json

License

arkCyber/STT-Server-Rust

Folders and files

Latest commit

History

Repository files navigation