Github Link: https://github.com/ravi-ivar-7/hilabs
- OS: Linux 6.14.0-29-generic #29~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64
- CPU: 12th Gen Intel(R) Core(TM) i5-1240P
- RAM: 7.4Gi total
- Node.js: v20.16.0
- Python: 3.12.3
- Docker: 28.4.0
- Docker and Docker Compose installed
- Git (for cloning the repository)
- 8GB+ RAM recommended
-
Clone the repository:
git clone https://github.com/ravi-ivar-7/hilabs.git cd hilabs -
Choose your setup method:
# Copy environment configuration
cp .env.example .env
# Start all services locally
./services.sh startAccess the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
docker-compose up --buildAccess the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Redis: localhost:6379
Healthcare Contract Analysis Challenge:
┌─────────────────────────────────────────────────────────────┐
│ INPUT: Healthcare contract PDFs (TN/WA states) │
│ │
│ GOAL: Classify clauses as Standard/Non-Standard/Ambiguous │
│ │
│ METHOD: Compare against state-specific template standards │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────┬─────────────────────────────────────────┐
│ Attribute │ Description │
├─────────────────────┼─────────────────────────────────────────┤
│ Medicaid Timely │ Claims submission deadlines (120 days) │
│ Filing │ │
├─────────────────────┼─────────────────────────────────────────┤
│ Medicare Timely │ Medicare claims deadlines (365 days) │
│ Filing │ │
├─────────────────────┼─────────────────────────────────────────┤
│ No Steerage/SOC │ Network participation rules │
├─────────────────────┼─────────────────────────────────────────┤
│ Medicaid Fee │ Payment methodology for Medicaid │
│ Schedule │ │
├─────────────────────┼─────────────────────────────────────────┤
│ Medicare Fee │ Payment methodology for Medicare │
│ Schedule │ │
└─────────────────────┴─────────────────────────────────────────┘
Multi-Step Analysis Approach:
┌─────────────────────────────────────────────────────────────┐
│ 1. Exception Detection → Conditional clauses = NON-STANDARD │
│ 2. Exact Text Matching → Perfect match = STANDARD │
│ 3. Placeholder Substitution → Structure match = STANDARD │
│ 4. Fuzzy String Matching → 70%+ similarity = STANDARD │
│ 5. Semantic Analysis → SBERT embeddings for meaning │
│ 6. Methodology Detection → Different methods = NON-STANDARD │
└─────────────────────────────────────────────────────────────┘
*Detailed implementation flowchart available below in Processing Pipeline section*
AI/ML Components:
• spaCy NLP → Text processing and analysis
• SBERT → Semantic similarity embeddings
• RapidFuzz → Lexical string matching
• Local Models → No external API dependencies
*Full technical architecture details above in System Architecture section*
Variable Content Handling:
┌─────────────────────────────────────────────────────────────┐
│ Original: "Provider accepts 95% of eligible charges" │
│ Template: "Provider accepts XX% of eligible charges" │
│ │
│ Normalization Process: │
│ 1. Replace "95%" → "<PCT>" │
│ 2. Replace "XX%" → "<PCT>" │
│ 3. Compare normalized versions │
│ 4. Match: STANDARD classification │
└─────────────────────────────────────────────────────────────┘
Placeholder Patterns:
• <PCT>: Percentages (95%, XX%, one hundred percent)
• <ORG>: Organizations (Plan, Company, Network)
• <MEMBER>: Member types (Enrollee, Subscriber)
• <DATE>: Dates and timeframes
• <FEE_SCHEDULE>: Payment references
Meaning-Based Classification:
• SBERT embeddings capture clause intent beyond exact wording
• Cosine similarity measures semantic closeness to templates
• Configurable thresholds balance precision vs coverage
• Handles paraphrased clauses with same legal meaning
*Implementation details and thresholds below in Processing Pipeline section*
Feedback Loop:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ambiguous │───▶│ Human Review │───▶│ Corrected │
│ Classification │ │ Interface │ │ Classification │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Show clause + │ │ User selects │ │ Store feedback │
│ template match │ │ correct class │ │ for learning │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Key Innovations:
✓ Multi-step classification combining lexical + semantic analysis
✓ State-specific template management (TN/WA)
✓ Advanced placeholder normalization
✓ Local AI processing (no external APIs)
✓ Real-time progress tracking
✓ Human-in-the-loop feedback system
✓ Local data processing only
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ Worker │ │ Classification │
│ (Next.js) │◄──►│ (FastAPI) │◄──►│ (Celery) │◄──►│ Engine │
│ │ │ │ │ │ │ (spaCy+SBERT) │
│ • Upload UI │ │ • REST API │ │ • Async Tasks │ │ • NLP Models │
│ • Dashboard │ │ • Database │ │ • Redis Queue │ │ • Templates │
│ • Results │ │ • File Storage │ │ • Processing │ │ • Classification│
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
:3000 :8000 Redis:6379 Local AI
Pages:
├── / (Home)
├── /upload (PDF Upload)
└── /analysis (Results, Analysis & Review Dashboard)
├── Results Display
├── Classification Analysis
└── Human Feedback Modals
Tech Stack:
• Next.js
• TypeScript + Tailwind CSS
• Real-time status updates
API Endpoints:
├── POST /api/v1/contracts/upload
├── GET /api/v1/contracts/{id}/status
├── GET /api/v1/contracts/{id}/results
├── GET /api/v1/contracts/{id} (contract details)
├── POST /api/v1/contracts/clauses/feedback
├── GET /api/v1/health (health check)
└── GET /docs (Swagger UI)
Database Tables:
├── Contract (main contract data + processing status)
├── FileRecord (file storage tracking)
├── ContractClause (classified clauses + confidence scores)
├── ProcessingLog (audit trail + component logging)
└── ClauseFeedback (human corrections + ratings)
Task Queue Architecture:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │───►│ Redis │───►│ Worker │
│ Request │ │ Message │ │ Processes │
│ │ │ Broker │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Result │
│ Backend │
│ Storage │
└─────────────┘
NLP Pipeline:
PDF → PyMuPDF → Text Cleaning → Clause Extraction → spaCy Analysis → SBERT Similarity → Classification
Models Used:
• spaCy: en_core_web_sm (NLP pipeline)
• SBERT: all-MiniLM-L6-v2 (semantic similarity)
• RapidFuzz: string matching
• Local processing (no external APIs)
┌─────────────┐
│ User Upload │
│ PDF + State │
└──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Frontend │────►│ Backend │────►│ Celery │
│ Validation │ │ File Save │ │ Queue │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Database │◄────│ Results │◄────│ Stage 1+2 │
│ Storage │ │ Assembly │ │ Processing │
└─────────────┘ └─────────────┘ └─────────────┘
upload/
├── contracts-tn/
│ ├── {uuid}.pdf # Original PDF
│ ├── {uuid}_clauses.json # Extracted clauses
│ └── {uuid}_results.json # Classification results
└── contracts-wa/
├── {uuid}.pdf
├── {uuid}_clauses.json
└── {uuid}_results.json
docker-compose.yml:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ frontend │ │ backend │ │ worker │ │ redis │
│ :3000 │ │ :8000 │ │ (celery) │ │ :6379 │
│ │ │ │ │ │ │ │
│ Next.js │ │ FastAPI │ │ Processing │ │ Redis │
│ React UI │ │ SQLite DB │ │ Pipeline │ │ Message │
│ │ │ │ │ │ │ Broker │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
└────────────────┼────────────────┼────────────────┘
│ │
hilabs-network (bridge)
Shared Volumes:
• backend_data (SQLite DB + app data)
• worker_uploads (PDF files + processing results)
• redis_data (Redis persistence)
Scaling Strategy (code can be extended):
• Worker containers are the primary bottleneck (NLP processing: 30-120 sec)
• Horizontal scaling: docker-compose scale worker=N based on Redis queue depth
• Resource allocation: Worker (2 CPU, 4GB) > Backend (1 CPU, 1GB) > Frontend (0.5 CPU, 512MB)
• Auto-scaling triggers: Queue depth >50 tasks → scale up, <10 tasks → scale down
• Current: Single worker container, extensible to multi-worker deployment
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │───▶│ Stage 1 │───▶│ Stage 2 │───▶│ Results │
│ PDF File │ │Preprocessing│ │Classification│ │ Dashboard │
│ │ │ │ │ │ │ │
│ • File │ │ • Extract │ │ • Template │ │ • Standard │
│ • State │ │ • Clean │ │ • Compare │ │ • Non-Std │
│ • Validate │ │ • Clauses │ │ • Classify │ │ • Ambiguous │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Frontend Celery Task Celery Task Frontend
Input: PDF Contract + State (TN/WA)
┌─────────────┐
│ PDF Upload │
└──────┬──────┘
│ 20% - Loading PDF
▼
┌─────────────┐ PyMuPDF Extraction
│ Text Extract│───► Raw Text + Metadata
└──────┬──────┘ Fallback Methods
│ 60% - Cleaning Text
▼
┌─────────────┐ Remove Artifacts
│ Text Clean │───► Normalized Text
└──────┬──────┘ UTF-8 Encoding
│ 70% - Extracting Clauses
▼
┌─────────────┐ Sentence Splitting
│ Clause │───► Clause List + IDs
│ Extraction │ Context Preservation
└──────┬──────┘
│ 90% - Saving Data
▼
┌─────────────┐ JSON Serialization
│ Data Store │───► {id}_clauses.json
└─────────────┘ Database Records
│ 100% - Complete
▼
Queue Stage 2
Input: Clause Data + Templates
┌─────────────┐
│ Load Data │
└──────┬──────┘
│ 20% - Loading Templates
▼
┌─────────────┐ State Detection (TN/WA)
│ Template │───► Load State Templates
│ Loading │ Initialize NLP Models
└──────┬──────┘
│ 40% - Starting Classification
▼
┌─────────────┐ 6-Step Analysis
│ Classify │───► Standard/Non-Standard/Ambiguous
│ Clauses │ Confidence Scoring
└──────┬──────┘
│ 80% - Saving Results
▼
┌─────────────┐ Database Storage
│ Store │───► {id}_results.json
│ Results │ Audit Logs
└─────────────┘
│ 100% - Complete
▼
Results Ready
For Each Clause:
Step 1: Exception Check
┌─────────────────────────────────────┐
│ Scan for: "except", "unless", │ ──► Non-Standard
│ "provided that", "subject to" │ (90% confidence)
└─────────────────────────────────────┘
Step 2: Exact Match
┌─────────────────────────────────────┐
│ Normalized text == Template text │ ──► Standard
│ (case insensitive, whitespace) │ (99% confidence)
└─────────────────────────────────────┘
Step 3: Placeholder Substitution
┌─────────────────────────────────────┐
│ Replace variables: %,dates,names │ ──► Standard
│ Check structure similarity │ (95% confidence)
└─────────────────────────────────────┘
Step 4: Fuzzy Matching
┌─────────────────────────────────────┐
│ RapidFuzz string similarity │ ──► Standard
│ Threshold: 70% │ (90% confidence)
└─────────────────────────────────────┘
Step 5: Semantic Similarity
┌─────────────────────────────────────┐
│ SBERT embeddings + cosine similarity│
│ ≥60%: Standard (85% confidence) │ ──► Standard/Ambiguous
│ 50-70%: Ambiguous │
└─────────────────────────────────────┘
Step 6: Methodology Detection
┌─────────────────────────────────────┐
│ Check for different payment methods │ ──► Non-Standard
│ "medicare rate", "billed charge" │ (85% confidence)
└─────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Browser │───▶│ FastAPI │───▶│ Celery │
│ Upload │ │ Backend │ │ Worker │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SQLite │◄───│ Results │◄───│ spaCy+SBERT │
│ Database │ │ Assembly │ │ Classifier │
└─────────────┘ └─────────────┘ └─────────────┘
upload/contracts-{state}/
├── {uuid}.pdf ◄─── Original upload
├── {uuid}_clauses.json ◄─── Stage 1 output
└── {uuid}_results.json ◄─── Stage 2 output
Database Flow:
Contract Table ──► ProcessingLog ──► ContractClause ──► ClauseFeedback
│ │ │ │
Status Audit Trail Classifications Human Feedback
Stage 1 Progress:
0% ──► 20% ──► 60% ──► 70% ──► 90% ──► 100%
│ │ │ │ │ │
│ │ │ │ │ └─ Data Saved
│ │ │ │ └─ JSON Export
│ │ │ └─ Clause Extraction
│ │ └─ Text Cleaning
│ └─ PDF Loading
└─ Task Started
Stage 2 Progress:
0% ──► 20% ──► 40% ──► 60% ──► 80% ──► 100%
│ │ │ │ │ │
│ │ │ │ │ └─ Results Saved
│ │ │ │ └─ Database Storage
│ │ │ └─ Classification
│ │ └─ Template Loading
│ └─ Data Loading
└─ Task Started
Error Types:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PDF Corrupt │ │ NLP Model │ │ Database │
│ File Issues │ │ Load Failed │ │ Connection │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Retry with │ │ Fallback to │ │ Retry with │
│ Alternative │ │ Basic NLP │ │ Backoff │
│ Extraction │ │ Processing │ │ Strategy │
└─────────────┘ └─────────────┘ └─────────────┘
- Used actual template PDFs as ground truth
- Tested template clauses against themselves to establish baseline accuracy
- Template clauses are pre-extracted and hardcoded in
worker/classification_parameters.py - This eliminates content variation and focuses purely on algorithm performance
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ User Contract PDFs │───▶│ PDF Text Extraction │───▶│ Extracted Clauses │
│ (Upload via UI) │ │ • pdfplumber (1st) │ │ (Dynamic parsing) │
└─────────────────────┘ │ • PyPDF2 (fallback)│ └─────────────────────┘
│ • Robust extraction │ │
└─────────────────────┘ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Static Templates │ │ Classification │
│ • TN_TEMPLATE_ │────────────────────────────▶│ • Fuzzy matching │
│ CLAUSES (5 attrs) │ │ • Semantic analysis │
│ • WA_TEMPLATE_ │ │ • Multi-step logic │
│ CLAUSES (5 attrs) │ │ • Confidence scores │
│ • Hardcoded in code │ └─────────────────────┘
└─────────────────────┘ │
▼
┌─────────────────────┐
│ Dual Storage │
│ • Database records │
│ • JSON backup files │
│ • Audit trails │
└─────────────────────┘
┌─────────────────────┬─────────┬─────────┬─────────────────────────┐
│ Attribute │ Before │ After │ Improvement Method │
├─────────────────────┼─────────┼─────────┼─────────────────────────┤
│ WA Medicare Timely │ 0% │ 40% │ Added XX-day language │
│ WA Medicaid Fee │ 11% │ 21% │ Structure alignment │
│ WA No Steerage/SOC │ 4% │ 9% │ Added legal phrases │
└─────────────────────┴─────────┴─────────┴─────────────────────────┘
Based on systematic testing and validation results, the worker/classification_parameters.py file has been continuously updated to improve classification accuracy:
Key Improvements Made:
- Threshold Tuning: Adjusted fuzzy matching threshold to 70% and semantic similarity to 0.60 based on real-world performance
- Template Clause Refinement: Updated actual template clauses extracted from TN/WA PDFs for better matching
- Exception Token Expansion: Added comprehensive list of conditional clause indicators
- Placeholder Normalization: Enhanced pattern matching for variables like percentages, dates, and organization names
- State-Specific Optimization: Separate template sets for TN and WA with state-specific legal language
Configuration Structure:
# worker/classification_parameters.py - Centralized Configuration
# Classification thresholds (optimized through testing)
FUZZY_THRESHOLD = 70 # RapidFuzz similarity threshold
SBERT_THRESHOLD = 0.60 # SBERT semantic similarity threshold
SBERT_AMBIG_LOW = 0.50 # Lower bound for ambiguous classification
SBERT_AMBIG_HIGH = 0.70 # Upper bound for ambiguous classification
# Template clauses (5 attributes per state)
TN_TEMPLATE_CLAUSES = {
"Medicaid Timely Filing": "...",
"Medicare Timely Filing": "...",
"Medicaid Fee Schedule": "...",
"Medicare Fee Schedule": "...",
"No Steerage/SOC": "..."
}
WA_TEMPLATE_CLAUSES = {
# Same 5 attributes with state-specific content
}
# Exception detection
EXCEPTION_TOKENS = ['except', 'unless', 'provided that', ...]
# Text normalization patterns
PLACEHOLDER_MAP = {
r"\b\d{1,3}\s*%\b": "<PCT>",
r"\b(Fee\s+Schedule|Compensation\s+Schedule)\b": "<FEE_SCHEDULE>",
# ... 20+ normalization patterns
}Performance Impact:
- WA Medicare Timely Filing: 0% → 40% accuracy improvement
- WA Medicaid Fee Schedule: 11% → 21% accuracy improvement
- WA No Steerage/SOC: 4% → 9% accuracy improvement
- Overall System: Consistent 85%+ confidence scores on standard clauses
This centralized parameter approach allows for rapid iteration and testing of classification improvements without code changes.