Redact

A fast, containerized, and scalable API service for automated OCR and PII redaction on document batches.

Redact is a production-ready microservice built with FastAPI that handles the secure and asynchronous processing of data redaction tasks. Designed to be easily deployed using Docker and scaled via a worker-based architecture (Redis/RQ setup).

Showcases:

A robust, modern Python API using FastAPI (complete with automatic documentation).
A clear separation of concerns (API, Core, Services, Workers).
Scalable asynchronous task processing.
Containerization with Docker.
Database interaction via SQLAlchemy.

Features

RESTful API: Clear endpoints for submitting and retrieving redaction tasks.
Asynchronous Processing: Long-running redaction tasks are handled by a dedicated worker pool, keeping the API fast and responsive.
Persistent Storage: Uses SQLAlchemy for task metadata and Redis for task queuing.
Containerized: Built for easy deployment with a Dockerfile.
Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.

🚧 Project Status: [Phase 3 of 6]

Current: Making pytorch go brrrrrr i.e model creation Next: Integration and testing (meh)

Goal

Process batches of document images, automatically detect and redact sensitive information (PII).

Architecture

[Diagram/description - update as you build #kinda lazy to do this one icl]

Features

Phases

Tech Stack

Performance

[Add benchmarks as you build]

API Benchmarking Table

Endpoint	Operation	Payload Size	Concurrent Users	Requests/sec	Avg Latency (ms)	P95 Latency (ms)	Error Rate	Notes
`POST /predict`	Create	1MB image	1	0.66	1450	1600	0%	Includes file validation, disk write, Redis
`GET /predict/{id}`	Read	N/A	10					Fetches job status from Redis
`POST /items`	Create	512B JSON						Basic DB insert
`GET /items/{id}`	Read	N/A						Add caching notes if applicable
`PUT /items/{id}`	Update	1KB JSON
`DELETE /items/{id}`	Delete	N/A

Legend:

Payload Size: Size of file or JSON sent in the request.

Concurrent Users: Simulated users (e.g., in Locust).

Requests/sec: Throughput under load.

Latency: Time from request to response (P95 = 95th percentile).

Model Inference Benchmark Table

[Model: TBD post Phase 4]

Model Name	Input Size	Avg Inference Time (ms)	Throughput (req/sec)	Peak RAM Usage	Device	Notes
`simulate_model_work`	N/A (simulated)	30000 (sleep)	0.03	N/A	CPU	Simulated workload
`real_model.onnx`	224x224 image				CPU	Replace with real benchmark
`resnet50`	512x512 image				GPU	ONNXRuntime on GPU
`custom_model.pt`	Variable				CPU/GPU	Fill in after deployment

Legend:

Inference Time: Time to run prediction (ms).

Throughput: How many predictions/sec the model can handle.

Peak RAM: Max memory used during inference.

Device: CPU or GPU.

🔧 Quickstart

Clone the repo

git clone https://github.com/yourname/redaction-api.git
cd redaction-api

Create and activate virtualenv (optional)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install dependencies

pip install -r requirements.txt

Copy example env and configure

cp .env.example .env

Start the app

uvicorn app.main:app --reload

Client request example

curl -X POST http://localhost:8000/predict \
  -F "files=@cat.jpg"

Check job status

curl http://localhost:8000/predict/abc123

Running tests

pytest tests/

Optionally benchmark

./scripts/benchmark

API Documentation

[Link to /docs once deployed]

When the server is running locally, visit:

http://localhost:8000/docs — Swagger UI
http://localhost:8000/redoc — ReDoc

These provide interactive documentation of all available endpoints with live testing.

Design Decisions

FastAPI, mainly because it's lighter than Django.
Postgres, I'm familiar with the dbms already, no need for any external onboarding.
Redis: Read about persistence, speed, and distributed support. Decided to go with it over normal multiprocessing.Queue, it scales and it integrates well with Celery and RQ. Not using Kafka, or rabbit, project isn't that advanced.
RQ: Initially wanted to use Celery, however after much investigation, I realized that it's probably too advanced for my use case. I'd rather avoid the setup overhead, just wanted simple and quick setup.

Project Structure

Main directories and their purpose for anyone looking to understand or extend the codebase.

redact/
├── benchmarks/         # Load testing and performance scripts (Locust, custom benchmarks)
├── redact/             # Main source code package
│   ├── api/            # FastAPI router definitions and main application entrypoint
│   ├── core/           # Configuration, logging, database/redis connection handling
│   ├── services/       # Business logic layer (e.g., storage abstraction)
│   ├── sqlschema/      # SQLAlchemy model definitions
│   └── workers/        # Asynchronous task processing logic (inference, main worker)
├── scripts/            # Helper scripts for development (benchmarking, service setup)
├── tests/              # Unit and integration tests (uses pytest)
└── requirements.txt    # All project dependencies

Future Improvements

Web UI
Multiple output formats
Batch job scheduling

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Redact

Features

🚧 Project Status: [Phase 3 of 6]

Goal

Architecture

Features

Phases

Tech Stack

Performance

API Benchmarking Table

Model Inference Benchmark Table

🔧 Quickstart

Clone the repo

Create and activate virtualenv (optional)

Install dependencies

Copy example env and configure

Start the app

Client request example

Check job status

Running tests

Optionally benchmark

API Documentation

Design Decisions

Project Structure

Future Improvements

📄 License

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
redact		redact
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

fw7th/redact

Folders and files

Latest commit

History

Repository files navigation

Redact

Features

🚧 Project Status: [Phase 3 of 6]

Goal

Architecture

Features

Phases

Tech Stack

Performance

API Benchmarking Table

Model Inference Benchmark Table

🔧 Quickstart

Clone the repo

Create and activate virtualenv (optional)

Install dependencies

Copy example env and configure

Start the app

Client request example

Check job status

Running tests

Optionally benchmark

API Documentation

Design Decisions

Project Structure

Future Improvements

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages