A high-performance, production-grade Speech-to-Text server built in Rust, specifically optimized for Apple M4 chips with MLX acceleration. This project provides a modern, concurrent alternative to the original whisper.cpp server with enhanced performance and observability.
- Apple M4 MLX Acceleration: Native support for Apple's Machine Learning framework
- True Concurrency: Async/await architecture with configurable worker pools
- Zero-Copy Audio Processing: Minimal memory allocations in hot paths
- Connection Pooling: Efficient resource management for high-load scenarios
- Comprehensive Observability: Prometheus metrics, structured logging, OpenTelemetry integration
- Graceful Shutdown: Clean resource cleanup and request completion
- Configuration Management: YAML configuration with environment variable overrides
- Health Checks: Built-in health endpoints for load balancers
- Rate Limiting: Configurable rate limiting and backpressure handling
- Multiple Formats: WAV, MP3, FLAC, OGG, M4A support
- Automatic Conversion: Built-in audio format conversion
- Quality Enhancement: Noise reduction and audio normalization
- Streaming Support: Real-time audio processing capabilities
- Whisper Integration: Based on OpenAI's Whisper model
- Multiple Model Sizes: Support for tiny, base, small, medium, large models
- Language Detection: Automatic language detection and multi-language support
- Confidence Scores: Word-level confidence scores and timestamps
- Beam Search: Configurable beam search for improved accuracy
- OS: macOS 11.0+ (for MLX acceleration) or Linux
- CPU: Apple Silicon (M1/M2/M3/M4) recommended, x86_64 supported
- Memory: 4GB+ RAM (8GB+ recommended for large models)
- Storage: 1GB+ for models and logs
- Rust: 1.70+ with Cargo
- MLX Framework: Automatically handled on Apple Silicon
- FFmpeg: Optional, for advanced audio format conversion
# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust
# Install dependencies and build
cargo build --release --features=production
# Download a Whisper model
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O models/ggml-base.en.bin
# Run the server
./target/release/stt-server-rust# Install development dependencies
cargo build --features=dev
# Run with development settings
cargo run -- --dev --log-level debugThe project includes a complete Docker setup with NGINX reverse proxy, automatic port conflict resolution, and monitoring stack:
# Clone the repository
git clone https://github.com/arkCyber/STT-Server-Rust.git
cd STT-Server-Rust
# Download required models
mkdir -p models
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin -O models/ggml-base.bin
# Deploy complete stack with automatic port cleanup
./deploy.sh deploy --rebuild
# Or deploy with backup
./deploy.sh deploy --backup --rebuild- STT Server: http://localhost:8080 (NGINX proxy)
- Prometheus: http://localhost:9090 (metrics)
- Grafana: http://localhost:3000 (dashboards, admin/admin123)
- Redis: localhost:6379 (caching)
# Check service status
./deploy.sh status
# View logs
./deploy.sh logs
# Monitor services in real-time
./deploy.sh monitor
# Test deployment
./deploy.sh test
# Stop services
./deploy.sh stop
# Restart services
./deploy.sh restart
# Clean up everything
./deploy.sh cleanup# Build Docker image (Apple Silicon)
docker build -t stt-server --platform linux/arm64 .
# Run container
docker run -p 8080:8080 -p 9090:9090 -v ./models:/app/models stt-serverThe server supports multiple configuration methods:
server:
host: "0.0.0.0"
port: 8080
max_concurrent_requests: 100
model:
path: "models/ggml-base.en.bin"
language: "en"
hardware:
enable_mlx: true
mlx_device: "auto"
max_gpu_memory_percent: 80export STT_SERVER_HOST=0.0.0.0
export STT_SERVER_PORT=8080
export STT_MODEL_PATH=models/ggml-large.bin
export STT_LOG_LEVEL=info
export STT_ENABLE_MLX=true./stt-server-rust \
--host 0.0.0.0 \
--port 8080 \
--model models/ggml-base.en.bin \
--threads 4 \
--enable-mlx \
--metricscurl http://localhost:8080/health{
"status": "ok",
"version": "0.1.0",
"uptime": "12h 34m 56s",
"models_loaded": 1
}curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "language=en" \
-F "response_format=json"Response:
{
"text": "Hello, this is a test transcription.",
"language": "en",
"duration": 3.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.5,
"text": "Hello, this is a test transcription.",
"confidence": 0.98,
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.99},
{"word": "this", "start": 0.6, "end": 0.8, "confidence": 0.97}
]
}
]
}- JSON (default): Structured response with segments and timestamps
- Text: Plain text transcription
- SRT: SubRip subtitle format
- VTT: WebVTT subtitle format
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "language=auto" \
-F "response_format=json" \
-F "temperature=0.0" \
-F "beam_size=5" \
-F "enable_timestamps=true" \
-F "enable_word_confidence=true"Access metrics at http://localhost:9090/metrics:
# HELP stt_requests_total Total number of STT requests
# TYPE stt_requests_total counter
stt_requests_total{status="success"} 1234
stt_requests_total{status="error"} 5
# HELP stt_inference_duration_seconds Time spent on STT inference
# TYPE stt_inference_duration_seconds histogram
stt_inference_duration_seconds_bucket{le="0.1"} 100
stt_inference_duration_seconds_bucket{le="0.5"} 800
stt_inference_duration_seconds_bucket{le="1.0"} 950
# HELP stt_mlx_memory_usage_bytes MLX memory usage in bytes
# TYPE stt_mlx_memory_usage_bytes gauge
stt_mlx_memory_usage_bytes 2147483648
{
"timestamp": "2024-01-20T12:00:00.000Z",
"level": "INFO",
"target": "stt_server::inference",
"message": "STT inference completed successfully",
"request_id": "req_1234567890",
"audio_duration": 3.5,
"inference_time": 0.85,
"model_name": "base.en",
"language": "en"
}# Basic health check
curl http://localhost:8080/health
# Detailed health with metrics
curl http://localhost:8080/health/detailedhardware:
enable_mlx: true
mlx_device: "gpu" # Use M4 GPU
enable_metal_performance_shaders: true
enable_neural_engine: true
max_gpu_memory_percent: 90
memory:
max_memory_mb: 16384 # 16GB for M4 Max
enable_preallocation: true
enable_memory_mapping: trueserver:
max_concurrent_requests: 200 # Increase for high load
request_timeout_seconds: 120
inference:
num_threads: 8 # Match M4 performance cores
batch_size: 4 # Process multiple requests together
enable_beam_search: true
beam_width: 5
audio:
enable_format_conversion: true
enable_noise_reduction: false # Disable for speed# Install k6 for load testing
brew install k6
# Run benchmark
k6 run bench.js# Debug build
cargo build
# Release build with full optimization
cargo build --release --features=production
# Apple M4 optimized build
RUSTFLAGS="-C target-cpu=apple-m4" cargo build --release# Unit tests
cargo test
# Integration tests
cargo test --test integration
# Benchmark tests
cargo bench# Format code
cargo fmt
# Lint code
cargo clippy -- -D warnings
# Security audit
cargo auditβββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β HTTP Client β => β Axum Server β => β Request Router β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ β
β Rate Limiter β <===========β
βββββββββββββββββββ
β
βββββββββββββββββββ
β Audio Processor β
βββββββββββββββββββ
β
βββββββββββββββββββ
β MLX Inference β
β Engine β
βββββββββββββββββββ
β
βββββββββββββββββββ
β Response β
β Formatter β
βββββββββββββββββββ
- Async HTTP Server: Axum with Tokio runtime
- Worker Pool: Configurable number of inference workers
- Request Queue: Bounded queue with backpressure
- Resource Pool: Model and buffer pooling for efficiency
We welcome contributions! Please see our Contributing Guide for details.
- Follow the .cursorrules coding standards
- Add comprehensive tests for new features
- Update documentation for API changes
- Ensure all CI checks pass
Please use the GitHub Issues to report bugs or request features.
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper for the base STT model
- whisper.cpp for the C++ implementation inspiration
- Apple MLX for machine learning acceleration
- Rust Community for the amazing ecosystem
- π§ Email: arksong2018@gmail.com
- π¬ Discord: STT Server Community
- π Documentation: docs.stt-server.example.com
Built with β€οΈ by arkSong for the Apple M4 era