Maskify

A powerful document redaction system that automatically detects and masks Aadhaar UIDs (Unique Identification Numbers) from PDF and image files using OCR and the Verhoeff algorithm for validation.

Overview

Maskify is an intelligent document processing tool developed for SIH-2024 that helps protect sensitive personal information by automatically detecting and redacting Aadhaar numbers from documents. The system uses advanced OCR technology combined with mathematical validation to ensure accurate identification of genuine UIDs before masking them.

Features

Automatic UID Detection: Uses Tesseract OCR to extract text from documents
Verhoeff Algorithm Validation: Validates detected UIDs using the Verhoeff checksum algorithm
Multi-Format Support: Processes both PDF and image files (JPG/JPEG)
Smart Rotation Detection: Automatically detects document orientation
Precise Masking: Redacts UIDs by masking individual digits with black rectangles
REST API: Easy-to-use HTTP endpoint for document processing
Docker Support: Containerized deployment for consistent environments

Technology Stack

Python 3.11 with FastAPI — REST API server
Tesseract OCR — Optical character recognition
OpenCV — Image processing and manipulation
Poppler — PDF to image conversion

Key Libraries

fastapi / uvicorn — ASGI web framework and server
pytesseract — OCR text extraction
opencv-python-headless — Image processing
pdf2image — PDF conversion
img2pdf — PDF generation
pillow — Image manipulation
numpy — Numerical operations
regex — Pattern matching

Installation

Prerequisites

Python (v3.11+)
Tesseract OCR
Poppler utilities

Local Setup

Clone the repository

git clone https://github.com/funinkina/Maskify.git
cd Maskify

Install Python dependencies
```
pip install -r requirements.txt
```

Install system dependencies

For Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils

For macOS:

brew install tesseract poppler

Docker Setup

Build and run using Docker:

docker build -t maskify .
docker run -p 8000:8000 maskify

Usage

Starting the Server

Development mode (with auto-reload):

uvicorn app:app --reload

Production mode:

uvicorn app:app --host 0.0.0.0 --port 8000

The server will start on http://localhost:8000

API Endpoint

POST /process-file

Upload a document for UID redaction.

Parameters:

file (file, required): PDF or JPG/JPEG file to process
level (string, required): Redaction level (currently supports standard redaction)

Example using cURL:

curl -X POST http://localhost:8000/process-file \
  -F "file=@/path/to/document.pdf" \
  -F "level=standard" \
  --output redacted_document.pdf

Example using Python (requests):

import requests

with open("document.pdf", "rb") as f:
    response = requests.post(
        "http://localhost:8000/process-file",
        files={"file": f},
        data={"level": "standard"},
    )

with open("redacted_document.pdf", "wb") as f:
    f.write(response.content)

Response:

Success: Returns the redacted file (same format as input)
400: Unsupported file type
422: No UIDs found in the document
500: Processing error

How It Works

File Upload: User uploads a PDF or image file through the API
PDF Conversion: If PDF, converts first page to high-resolution image (300 DPI)
OCR Processing: Tesseract extracts text with bounding box coordinates
Pattern Matching: Regex searches for 12-digit number sequences
UID Validation:
- Applies Verhoeff algorithm to validate checksum
- Filters out numbers ending in 1947 (Aadhaar year filter)
Rotation Detection: Tests multiple orientations and blur levels
Redaction: Masks validated UIDs with black rectangles
Output Generation: Returns processed file in original format

Verhoeff Algorithm

The system uses the Verhoeff algorithm to validate Aadhaar numbers. This checksum algorithm is specifically designed to catch common transcription errors and provides a higher level of validation than simpler checksum methods.

Project Structure

Maskify/
├── app.py               # FastAPI server and API endpoint
├── redaction.py          # Python engine for UID detection and masking
├── requirements.txt      # Python dependencies
├── Dockerfile            # Docker configuration
├── vercel.json           # Vercel deployment config
└── Readme.md             # Project documentation

Security Considerations

Temporary files are automatically cleaned up after processing
Each request uses an isolated temp directory (concurrent-safe)
No persistent storage of sensitive documents

Known Limitations

Currently processes only the first page of multi-page PDFs
Optimized for standard Aadhaar card formats
Requires clear, readable text for accurate OCR
Processing time varies based on image quality and resolution

Building the Report

The project report is written in LaTeX and can be compiled to PDF using XeLaTeX.

Prerequisites

XeLaTeX (part of TeX Live or MacTeX)

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install -y texlive-xetex

Linux (Fedora/RHEL)

sudo dnf install -y texlive-xetex

Windows

Download and install TeX Live or MiKTeX
During installation, ensure XeLaTeX is selected
Add the TeX bin directory to your system PATH

Compilation

To compile the report, navigate to the Report directory and run:

cd Report
xelatex -interaction=nonstopmode main.tex

Run the command twice to ensure all cross-references, citations, and table of contents are resolved correctly:

xelatex -interaction=nonstopmode main.tex
xelatex -interaction=nonstopmode main.tex

The compiled PDF will be generated as main.pdf in the Report directory.

Report Structure

Report/
├── main.tex              # Main document
├── chapter1.tex          # Introduction
├── chapter2.tex          # Literature Review
├── chapter3.tex          # System Design
├── chapter4.tex          # Implementation
├── chapter5.tex          # Conclusion
├── abstract.tex          # Abstract
├── cover.tex             # Cover page
├── references.bib        # Bibliography
└── main.pdf              # Compiled output

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Report		Report
static		static
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Readme.md		Readme.md
app.py		app.py
audit.py		audit.py
detectors.py		detectors.py
flow.png		flow.png
font.ttf		font.ttf
merge.py		merge.py
merged.ttf		merged.ttf
ner_detector.py		ner_detector.py
output.pdf		output.pdf
redact.tex		redact.tex
redaction.py		redaction.py
requirements.txt		requirements.txt
turnitin.py		turnitin.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maskify

Overview

Features

Technology Stack

Key Libraries

Installation

Prerequisites

Local Setup

Docker Setup

Usage

Starting the Server

API Endpoint

How It Works

Verhoeff Algorithm

Project Structure

Security Considerations

Known Limitations

Building the Report

Prerequisites

Linux (Ubuntu/Debian)

Linux (Fedora/RHEL)

Windows

Compilation

Report Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Maskify

Overview

Features

Technology Stack

Key Libraries

Installation

Prerequisites

Local Setup

Docker Setup

Usage

Starting the Server

API Endpoint

How It Works

Verhoeff Algorithm

Project Structure

Security Considerations

Known Limitations

Building the Report

Prerequisites

Linux (Ubuntu/Debian)

Linux (Fedora/RHEL)

Windows

Compilation

Report Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages