Skip to content

fw7th/redact

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Status Build Status Coverage Status


Redact

A fast, containerized, and scalable API service for automated OCR and PII redaction on document batches.

Redact is a production-ready microservice built with FastAPI that handles the secure and asynchronous processing of data redaction tasks. Designed to be easily deployed using Docker and scaled via a worker-based architecture (Redis/RQ setup).

Showcases:

  • A robust, modern Python API using FastAPI (complete with automatic documentation).
  • A clear separation of concerns (API, Core, Services, Workers).
  • Scalable asynchronous task processing.
  • Containerization with Docker.
  • Database interaction via SQLAlchemy.

Features

  • RESTful API: Clear endpoints for submitting and retrieving redaction tasks.
  • Asynchronous Processing: Long-running redaction tasks are handled by a dedicated worker pool, keeping the API fast and responsive.
  • Persistent Storage: Uses SQLAlchemy for task metadata and Redis for task queuing.
  • Containerized: Built for easy deployment with a Dockerfile.
  • Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.

🚧 Project Status: [Phase 3 of 6]

Current: Making pytorch go brrrrrr i.e model creation Next: Integration and testing (meh)

Goal

Process batches of document images, automatically detect and redact sensitive information (PII).

Architecture

[Diagram/description - update as you build #kinda lazy to do this one icl]

Features

  • Batch image upload
  • Asynchronous processing
  • Custom redaction model
  • Enable batch prediction
  • REST API
  • Add auth
  • Results retrieval
  • Deploy to Docker/Kubernetes

Phases

  • Phase 1: Basic upload
  • Phase 2: Job tracking
  • Phase 3: Async infrastructure
  • Phase 4: Model development (IN PROGRESS)
  • Phase 5: Integration
  • Phase 6: Deployment

Tech Stack

Python FastAPI Uvicorn

PostgreSQL SQLAlchemy Redis RQ

PyTorch Model

Docker Docker Compose

Tests Linter Load Testing

Performance

[Add benchmarks as you build]

API Benchmarking Table

Endpoint Operation Payload Size Concurrent Users Requests/sec Avg Latency (ms) P95 Latency (ms) Error Rate Notes
POST /predict Create 1MB image 1 0.66 1450 1600 0% Includes file validation, disk write, Redis
GET /predict/{id} Read N/A 10 Fetches job status from Redis
POST /items Create 512B JSON Basic DB insert
GET /items/{id} Read N/A Add caching notes if applicable
PUT /items/{id} Update 1KB JSON
DELETE /items/{id} Delete N/A

Legend:

  • Payload Size: Size of file or JSON sent in the request.
  • Concurrent Users: Simulated users (e.g., in Locust).
  • Requests/sec: Throughput under load.
  • Latency: Time from request to response (P95 = 95th percentile).

Model Inference Benchmark Table

  • [Model: TBD post Phase 4]
Model Name Input Size Avg Inference Time (ms) Throughput (req/sec) Peak RAM Usage Device Notes
simulate_model_work N/A (simulated) 30000 (sleep) 0.03 N/A CPU Simulated workload
real_model.onnx 224x224 image CPU Replace with real benchmark
resnet50 512x512 image GPU ONNXRuntime on GPU
custom_model.pt Variable CPU/GPU Fill in after deployment

Legend:

  • Inference Time: Time to run prediction (ms).
  • Throughput: How many predictions/sec the model can handle.
  • Peak RAM: Max memory used during inference.
  • Device: CPU or GPU.

πŸ”§ Quickstart

Clone the repo

git clone https://github.com/yourname/redaction-api.git
cd redaction-api

Create and activate virtualenv (optional)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install dependencies

pip install -r requirements.txt

Copy example env and configure

cp .env.example .env

Start the app

uvicorn app.main:app --reload

Client request example

curl -X POST http://localhost:8000/predict \
  -F "files=@cat.jpg"

Check job status

curl http://localhost:8000/predict/abc123

Running tests

pytest tests/

Optionally benchmark

./scripts/benchmark

API Documentation

[Link to /docs once deployed]

When the server is running locally, visit:

These provide interactive documentation of all available endpoints with live testing.

Design Decisions

  • FastAPI, mainly because it's lighter than Django.
  • Postgres, I'm familiar with the dbms already, no need for any external onboarding.
  • Redis: Read about persistence, speed, and distributed support. Decided to go with it over normal multiprocessing.Queue, it scales and it integrates well with Celery and RQ. Not using Kafka, or rabbit, project isn't that advanced.
  • RQ: Initially wanted to use Celery, however after much investigation, I realized that it's probably too advanced for my use case. I'd rather avoid the setup overhead, just wanted simple and quick setup.

Project Structure

Main directories and their purpose for anyone looking to understand or extend the codebase.

redact/
β”œβ”€β”€ benchmarks/         # Load testing and performance scripts (Locust, custom benchmarks)
β”œβ”€β”€ redact/             # Main source code package
β”‚   β”œβ”€β”€ api/            # FastAPI router definitions and main application entrypoint
β”‚   β”œβ”€β”€ core/           # Configuration, logging, database/redis connection handling
β”‚   β”œβ”€β”€ services/       # Business logic layer (e.g., storage abstraction)
β”‚   β”œβ”€β”€ sqlschema/      # SQLAlchemy model definitions
β”‚   └── workers/        # Asynchronous task processing logic (inference, main worker)
β”œβ”€β”€ scripts/            # Helper scripts for development (benchmarking, service setup)
β”œβ”€β”€ tests/              # Unit and integration tests (uses pytest)
└── requirements.txt    # All project dependencies

Future Improvements

  • Web UI
  • Multiple output formats
  • Batch job scheduling

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.

About

Document redaction microservice.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published