A comprehensive Natural Language Processing (NLP) pipeline for Hinglish (Hindi-English code-mixed) text analysis, featuring text preprocessing, language detection, sentiment analysis, and a production-ready REST API.
- URL and mention removal
- Hashtag processing
- Emoji preservation
- Special character handling
- Smart tokenization
- Token-level language identification (Hindi/English)
- Named entity recognition
- Code-mixing detection
- Dominant language identification
- Statistical analysis of language distribution
- Advanced transformer-based sentiment analysis (DistilBERT)
- Fallback rule-based sentiment detection
- Multi-label sentiment with confidence scores
- Support for Hinglish text
- Batch processing capability
- FastAPI-based REST API with auto-generated documentation
- 6 production-ready endpoints
- Request/response validation with Pydantic
- CORS support for web applications
- Comprehensive error handling
- Swagger UI & ReDoc documentation
- Installation
- Quick Start
- API Documentation
- Usage Examples
- Project Structure
- Testing
- Model Information
- Development
- Python 3.8+
- pip package manager
- Virtual environment (recommended)
- Clone the repository:
git clone <repository-url>
cd Code-mixed-NLP- Create and activate virtual environment:
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/Mac
python3 -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txtRequired Packages:
fastapi>=0.104.1- Web frameworkuvicorn>=0.24.0- ASGI serverpydantic>=2.5.0- Data validationtransformers>=4.35.0- NLP modelstorch>=2.0.0- PyTorchpytest>=7.4.3- Testing frameworkrequests>=2.31.0- HTTP library
Option A: Using the batch file (Windows)
start_server.batOption B: Direct Python command
python app/main.pyThe server will start on http://localhost:8000
Open your browser and navigate to:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Using curl:
curl -X POST "http://localhost:8000/api/v1/analyze" \
-H "Content-Type: application/json" \
-d "{\"text\": \"Yeh movie bahut accha hai! I loved it!\"}"Using Python:
import requests
response = requests.post(
"http://localhost:8000/api/v1/analyze",
json={"text": "Yeh movie bahut accha hai! I loved it!"}
)
print(response.json())http://localhost:8000
GET /healthResponse:
{
"status": "healthy",
"version": "1.0.0",
"modules": {
"preprocessing": true,
"language_detection": true,
"sentiment_analysis": true
}
}POST /api/v1/preprocessRequest:
{
"text": "Check out https://example.com! 😊 #amazing"
}Response:
{
"original": "Check out https://example.com! 😊 #amazing",
"cleaned": "check out 😊 amazing",
"tokens": ["check", "out", "😊", "amazing"],
"token_count": 4
}POST /api/v1/detect-languageRequest:
{
"text": "Main bahut happy hoon today"
}Response:
{
"tokens": ["main", "bahut", "happy", "hoon", "today"],
"labels": ["lang2", "lang2", "lang1", "lang2", "lang1"],
"statistics": {
"lang1": {"count": 2, "percentage": 40.0},
"lang2": {"count": 3, "percentage": 60.0}
},
"is_code_mixed": true,
"dominant_language": "lang2"
}Language Labels:
lang1- Englishlang2- Hindi/Romanized Hindine- Named Entityother- Punctuation, numbers, special characters
POST /api/v1/analyze-sentimentRequest:
{
"text": "This is absolutely amazing!"
}Response:
{
"label": "positive",
"confidence": 0.9998,
"scores": {
"positive": 0.9998,
"negative": 0.0002
}
}POST /api/v1/analyzeRequest:
{
"text": "Yeh movie bahut accha hai! I loved it!"
}Response:
{
"original_text": "Yeh movie bahut accha hai! I loved it!",
"cleaned_text": "yeh movie bahut accha hai i loved it",
"tokens": ["yeh", "movie", "bahut", "accha", "hai", "i", "loved", "it"],
"token_count": 8,
"language_detection": {
"labels": ["lang2", "lang1", "lang2", "lang2", "lang2", "lang1", "lang1", "lang1"],
"statistics": {
"lang1": {"count": 4, "percentage": 50.0},
"lang2": {"count": 4, "percentage": 50.0}
},
"is_code_mixed": true,
"dominant_language": "lang1"
},
"sentiment": {
"label": "positive",
"confidence": 0.9998,
"scores": {
"positive": 0.9998,
"negative": 0.0002
}
}
}POST /api/v1/analyze/batchRequest:
{
"texts": [
"This is amazing!",
"Yeh bahut accha hai",
"This is terrible"
]
}Response:
{
"count": 3,
"results": [
{
"original_text": "This is amazing!",
"cleaned_text": "this is amazing",
...
},
...
]
}- Text length: 1-5000 characters
- Batch size: 1-100 texts
- Required fields: All fields marked as required must be provided
- Empty text: Not allowed (validation error)
400 Bad Request:
{
"detail": "Invalid input data"
}422 Validation Error:
{
"detail": [
{
"loc": ["body", "text"],
"msg": "field required",
"type": "value_error.missing"
}
]
}500 Internal Server Error:
{
"detail": "Processing failed: [error message]"
}import requests
post = "Aaj ka match dekha? Virat ne mara 6! What a shot! 🏏"
response = requests.post(
"http://localhost:8000/api/v1/analyze",
json={"text": post}
)
result = response.json()
print(f"Sentiment: {result['sentiment']['label']}")
print(f"Languages: {result['language_detection']['dominant_language']}")
print(f"Code-mixed: {result['language_detection']['is_code_mixed']}")reviews = [
"Yeh product bahut accha hai!",
"Delivery was terrible and late",
"Amazing quality, highly recommend",
"Waste of money, bilkul bakwas"
]
response = requests.post(
"http://localhost:8000/api/v1/analyze/batch",
json={"texts": reviews}
)
for i, result in enumerate(response.json()['results']):
print(f"Review {i+1}: {result['sentiment']['label']}")text = "Main bahut khush hoon because I got promoted!"
# First detect language
lang_response = requests.post(
"http://localhost:8000/api/v1/detect-language",
json={"text": text}
)
lang_data = lang_response.json()
if lang_data['is_code_mixed']:
print("Code-mixed text detected!")
print(f"Dominant: {lang_data['dominant_language']}")
# Then analyze sentiment
sent_response = requests.post(
"http://localhost:8000/api/v1/analyze-sentiment",
json={"text": text}
)
print(f"Sentiment: {sent_response.json()['label']}")Code-mixed-NLP/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── pipeline.py # Integrated NLP pipeline
│ │
│ ├── preprocessing/
│ │ ├── __init__.py
│ │ └── cleaner.py # Text preprocessing module
│ │
│ ├── language_detection/
│ │ ├── __init__.py
│ │ └── detector.py # Language detection module
│ │
│ ├── sentiment_analysis/
│ │ ├── __init__.py
│ │ └── analyzer.py # Sentiment analysis module
│ │
│ └── tests/
│ ├── __init__.py
│ ├── test_preprocessing.py # 19 tests
│ ├── test_language_detection.py # 23 tests
│ └── test_sentiment_analysis.py # 30 tests
│
├── test_api_integration.py # 21 API integration tests
├── test_api.py # API testing client
├── start_server.bat # Server startup script
├── requirements.txt # Python dependencies
└── README.md # This file
Unit Tests (72):
$env:PYTHONPATH="."; pytest app/tests/ -vIntegration Tests (21):
pytest test_api_integration.py -vAll Tests:
$env:PYTHONPATH="."; pytest app/tests/ test_api_integration.py -v-
✅ 72 Unit Tests
- 19 Preprocessing tests
- 23 Language Detection tests
- 30 Sentiment Analysis tests
-
✅ 21 Integration Tests
- Health & info endpoints
- Preprocessing endpoint
- Language detection endpoint
- Sentiment analysis endpoint
- Full analysis endpoint
- Batch processing endpoint
- Error handling
Run the test client:
python test_api.pyThis will test all endpoints and display results.
Model: distilbert-base-uncased-finetuned-sst-2-english
- Architecture: DistilBERT (Distilled BERT)
- Size: 268 MB
- Training: Fine-tuned on SST-2 sentiment dataset
- Performance:
- High accuracy on English text
- Good performance on Hinglish text
- 99%+ confidence on clear sentiments
- Labels: positive, negative
Fallback: Rule-based sentiment analysis using:
- Positive word list (excellent, amazing, wonderful, etc.)
- Negative word list (terrible, awful, horrible, etc.)
Approach: Rule-based detection using:
- Hindi word dictionary (200+ common words)
- English stopwords (NLTK corpus)
- Devanagari script detection
- Named entity recognition patterns
Features:
- Token-level language labeling
- Code-mixing detection
- Dominant language identification
- Statistical distribution analysis
- Create module in appropriate directory
- Write unit tests in
app/tests/ - Add to pipeline in
pipeline.py - Create API endpoint in
main.py - Add integration tests in
test_api_integration.py
- Follow PEP 8 guidelines
- Use type hints where applicable
- Document functions with docstrings
- Write comprehensive tests
# Enable auto-reload
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000- Start the server
- Open http://localhost:8000/docs
- Click "Try it out" on any endpoint
- Enter request data and click "Execute"
- View response
- First request: ~60 seconds (DistilBERT model download & load)
- Subsequent requests: Instant (model cached in memory)
- Single text: ~100-300ms
- Batch (10 texts): ~1-2 seconds
- Large text (1000 words): ~500ms
- Memory: ~1.5 GB (with DistilBERT loaded)
- CPU: Moderate (no GPU required)
- Disk: ~300 MB (model cache)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run the test suite
- Submit a pull request
MIT License - see LICENSE file for details
- Yadnesh Teli - Initial work
- Hugging Face for transformer models
- FastAPI for the excellent web framework
- NLTK for NLP utilities
For issues, questions, or suggestions:
- Open an issue on GitHub
- Contact: [your-email@example.com]
Made with ❤️ for the Hinglish NLP community