A fast, containerized, and scalable API service for automated OCR and PII redaction on document batches.
Redact is a production-ready microservice built with FastAPI that handles the secure and asynchronous processing of data redaction tasks. Designed to be easily deployed using Docker and scaled via a worker-based architecture (Redis/RQ setup).
Showcases:
- A robust, modern Python API using FastAPI (complete with automatic documentation).
- A clear separation of concerns (API, Core, Services, Workers).
- Scalable asynchronous task processing.
- Containerization with Docker.
- Database interaction via SQLAlchemy.
- RESTful API: Clear endpoints for submitting and retrieving redaction tasks.
- Asynchronous Processing: Long-running redaction tasks are handled by a dedicated worker pool, keeping the API fast and responsive.
- Persistent Storage: Uses SQLAlchemy for task metadata and Redis for task queuing.
- Containerized: Built for easy deployment with a Dockerfile.
- Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.
Current: Making pytorch go brrrrrr i.e model creation
Next: Integration and testing (meh)
Process batches of document images, automatically detect and redact sensitive information (PII).
[Diagram/description - update as you build #kinda lazy to do this one icl]
- Batch image upload
- Asynchronous processing
- Custom redaction model
- Enable batch prediction
- REST API
- Add auth
- Results retrieval
- Deploy to Docker/Kubernetes
- Phase 1: Basic upload
- Phase 2: Job tracking
- Phase 3: Async infrastructure
- Phase 4: Model development (IN PROGRESS)
- Phase 5: Integration
- Phase 6: Deployment
[Add benchmarks as you build]
| Endpoint | Operation | Payload Size | Concurrent Users | Requests/sec | Avg Latency (ms) | P95 Latency (ms) | Error Rate | Notes |
|---|---|---|---|---|---|---|---|---|
POST /predict |
Create | 1MB image | 1 | 0.66 | 1450 | 1600 | 0% | Includes file validation, disk write, Redis |
GET /predict/{id} |
Read | N/A | 10 | Fetches job status from Redis | ||||
POST /items |
Create | 512B JSON | Basic DB insert | |||||
GET /items/{id} |
Read | N/A | Add caching notes if applicable | |||||
PUT /items/{id} |
Update | 1KB JSON | ||||||
DELETE /items/{id} |
Delete | N/A |
Legend:
- Payload Size: Size of file or JSON sent in the request.
- Concurrent Users: Simulated users (e.g., in Locust).
- Requests/sec: Throughput under load.
- Latency: Time from request to response (P95 = 95th percentile).
- [Model: TBD post Phase 4]
| Model Name | Input Size | Avg Inference Time (ms) | Throughput (req/sec) | Peak RAM Usage | Device | Notes |
|---|---|---|---|---|---|---|
simulate_model_work |
N/A (simulated) | 30000 (sleep) | 0.03 | N/A | CPU | Simulated workload |
real_model.onnx |
224x224 image | CPU | Replace with real benchmark | |||
resnet50 |
512x512 image | GPU | ONNXRuntime on GPU | |||
custom_model.pt |
Variable | CPU/GPU | Fill in after deployment |
Legend:
- Inference Time: Time to run prediction (ms).
- Throughput: How many predictions/sec the model can handle.
- Peak RAM: Max memory used during inference.
- Device: CPU or GPU.
git clone https://github.com/yourname/redaction-api.git
cd redaction-apipython -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windowspip install -r requirements.txtcp .env.example .envuvicorn app.main:app --reloadcurl -X POST http://localhost:8000/predict \
-F "files=@cat.jpg"curl http://localhost:8000/predict/abc123pytest tests/./scripts/benchmark
[Link to /docs once deployed]
When the server is running locally, visit:
- http://localhost:8000/docs β Swagger UI
- http://localhost:8000/redoc β ReDoc
These provide interactive documentation of all available endpoints with live testing.
FastAPI, mainly because it's lighter than Django.Postgres, I'm familiar with the dbms already, no need for any external onboarding.Redis: Read about persistence, speed, and distributed support. Decided to go with it over normal multiprocessing.Queue, it scales and it integrates well with Celery and RQ. Not using Kafka, or rabbit, project isn't that advanced.RQ: Initially wanted to use Celery, however after much investigation, I realized that it's probably too advanced for my use case. I'd rather avoid the setup overhead, just wanted simple and quick setup.
Main directories and their purpose for anyone looking to understand or extend the codebase.
redact/
βββ benchmarks/ # Load testing and performance scripts (Locust, custom benchmarks)
βββ redact/ # Main source code package
β βββ api/ # FastAPI router definitions and main application entrypoint
β βββ core/ # Configuration, logging, database/redis connection handling
β βββ services/ # Business logic layer (e.g., storage abstraction)
β βββ sqlschema/ # SQLAlchemy model definitions
β βββ workers/ # Asynchronous task processing logic (inference, main worker)
βββ scripts/ # Helper scripts for development (benchmarking, service setup)
βββ tests/ # Unit and integration tests (uses pytest)
βββ requirements.txt # All project dependencies
- Web UI
- Multiple output formats
- Batch job scheduling
This project is licensed under the MIT License β see the LICENSE file for details.