A powerful document redaction system that automatically detects and masks Aadhaar UIDs (Unique Identification Numbers) from PDF and image files using OCR and the Verhoeff algorithm for validation.
Maskify is an intelligent document processing tool developed for SIH-2024 that helps protect sensitive personal information by automatically detecting and redacting Aadhaar numbers from documents. The system uses advanced OCR technology combined with mathematical validation to ensure accurate identification of genuine UIDs before masking them.
- Automatic UID Detection: Uses Tesseract OCR to extract text from documents
- Verhoeff Algorithm Validation: Validates detected UIDs using the Verhoeff checksum algorithm
- Multi-Format Support: Processes both PDF and image files (JPG/JPEG)
- Smart Rotation Detection: Automatically detects document orientation
- Precise Masking: Redacts UIDs by masking individual digits with black rectangles
- REST API: Easy-to-use HTTP endpoint for document processing
- Docker Support: Containerized deployment for consistent environments
- Python 3.11 with FastAPI — REST API server
- Tesseract OCR — Optical character recognition
- OpenCV — Image processing and manipulation
- Poppler — PDF to image conversion
fastapi/uvicorn— ASGI web framework and serverpytesseract— OCR text extractionopencv-python-headless— Image processingpdf2image— PDF conversionimg2pdf— PDF generationpillow— Image manipulationnumpy— Numerical operationsregex— Pattern matching
- Python (v3.11+)
- Tesseract OCR
- Poppler utilities
-
Clone the repository
git clone https://github.com/funinkina/Maskify.git cd Maskify -
Install Python dependencies
pip install -r requirements.txt
-
Install system dependencies
For Ubuntu/Debian:
sudo apt-get update sudo apt-get install -y tesseract-ocr poppler-utils
For macOS:
brew install tesseract poppler
Build and run using Docker:
docker build -t maskify .
docker run -p 8000:8000 maskifyDevelopment mode (with auto-reload):
uvicorn app:app --reloadProduction mode:
uvicorn app:app --host 0.0.0.0 --port 8000The server will start on http://localhost:8000
POST /process-file
Upload a document for UID redaction.
Parameters:
file(file, required): PDF or JPG/JPEG file to processlevel(string, required): Redaction level (currently supports standard redaction)
Example using cURL:
curl -X POST http://localhost:8000/process-file \
-F "file=@/path/to/document.pdf" \
-F "level=standard" \
--output redacted_document.pdfExample using Python (requests):
import requests
with open("document.pdf", "rb") as f:
response = requests.post(
"http://localhost:8000/process-file",
files={"file": f},
data={"level": "standard"},
)
with open("redacted_document.pdf", "wb") as f:
f.write(response.content)Response:
- Success: Returns the redacted file (same format as input)
- 400: Unsupported file type
- 422: No UIDs found in the document
- 500: Processing error
- File Upload: User uploads a PDF or image file through the API
- PDF Conversion: If PDF, converts first page to high-resolution image (300 DPI)
- OCR Processing: Tesseract extracts text with bounding box coordinates
- Pattern Matching: Regex searches for 12-digit number sequences
- UID Validation:
- Applies Verhoeff algorithm to validate checksum
- Filters out numbers ending in 1947 (Aadhaar year filter)
- Rotation Detection: Tests multiple orientations and blur levels
- Redaction: Masks validated UIDs with black rectangles
- Output Generation: Returns processed file in original format
The system uses the Verhoeff algorithm to validate Aadhaar numbers. This checksum algorithm is specifically designed to catch common transcription errors and provides a higher level of validation than simpler checksum methods.
Maskify/
├── app.py # FastAPI server and API endpoint
├── redaction.py # Python engine for UID detection and masking
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── vercel.json # Vercel deployment config
└── Readme.md # Project documentation
- Temporary files are automatically cleaned up after processing
- Each request uses an isolated temp directory (concurrent-safe)
- No persistent storage of sensitive documents
- Currently processes only the first page of multi-page PDFs
- Optimized for standard Aadhaar card formats
- Requires clear, readable text for accurate OCR
- Processing time varies based on image quality and resolution
The project report is written in LaTeX and can be compiled to PDF using XeLaTeX.
- XeLaTeX (part of TeX Live or MacTeX)
sudo apt-get update
sudo apt-get install -y texlive-xetexsudo dnf install -y texlive-xetex- Download and install TeX Live or MiKTeX
- During installation, ensure XeLaTeX is selected
- Add the TeX bin directory to your system PATH
To compile the report, navigate to the Report directory and run:
cd Report
xelatex -interaction=nonstopmode main.texRun the command twice to ensure all cross-references, citations, and table of contents are resolved correctly:
xelatex -interaction=nonstopmode main.tex
xelatex -interaction=nonstopmode main.texThe compiled PDF will be generated as main.pdf in the Report directory.
Report/
├── main.tex # Main document
├── chapter1.tex # Introduction
├── chapter2.tex # Literature Review
├── chapter3.tex # System Design
├── chapter4.tex # Implementation
├── chapter5.tex # Conclusion
├── abstract.tex # Abstract
├── cover.tex # Cover page
├── references.bib # Bibliography
└── main.pdf # Compiled output