WhyFix is an advanced system designed to provide explanations for lexical errors made by second language (L2) learners. The system leverages Retrieval-Augmented Generation (RAG) techniques and large language models to deliver contextually appropriate feedback for academic writing.
- Overview
- System Architecture
- Installation
- Usage
- Data Sources
- Workflow
- Configuration
- Evaluation
- Contributing
WhyFix addresses the challenge of providing meaningful explanations for lexical errors in L2 academic writing. The system combines:
- Retrieval-Augmented Generation (RAG) for context-aware explanations
- Multi-stage processing pipelines for comprehensive error analysis
- Batch processing capabilities for large-scale evaluation
- Interactive web interface for real-time feedback
- Comprehensive evaluation metrics for system assessment
- Error Detection & Correction: Identifies and corrects lexical errors in academic writing
- Contextual Explanations: Provides detailed explanations for why corrections are necessary
- Multi-source RAG: Integrates multiple knowledge sources for enhanced accuracy
- Scalable Processing: Supports batch processing for large datasets
- Interactive Interface: Streamlit-based web application for user interaction
WhyFix/
├── models/ # ML models and batch processing
├── process/ # Core processing pipeline
│ ├── input/ # Input data handling
│ ├── rag/ # RAG system implementations
│ ├── retrieval/ # Vector retrieval mechanisms
│ └── utils/ # Processing utilities
├── preprocess/ # Data preprocessing
├── postprocess/ # Result analysis and metrics
├── streamlit/ # Web interface
├── data/ # Training and evaluation data
└── scripts/ # Automation scripts
- Python 3.10+
- Git
- Virtual environment (recommended)
-
Clone the repository:
git clone <repository-url> cd WhyFix
-
Create and activate virtual environment:
python -m venv env source env/bin/activate # On Windows: env\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp .env.example .env # Edit .env with your API keys and configurations
-
Run the sampling workflow:
./run_sampling.sh A small
-
Execute batch processing:
./run_batch_jobs.sh A small $(date +"%m%d-%H%M") -
Launch web interface:
cd streamlit/src/app streamlit run main.py
python -m process.retrieval.retrieve_3_collocation gpt true A small./run_sampling.sh A small-
Academic Writing Guidelines
- English Academic Writing for Students and Researchers
- Cultural Issues in Academic Writing
- Features of Academic Writing
-
Dictionary Resources
- Cambridge Dictionary (definitions, CEFR levels, examples)
- Macmillan English Dictionary
- Academic Keyword List (AKL)
-
Collocation Resources
- Collocation databases
-
CLC FCE Dataset
- 4,853 exam scripts from Cambridge FCE
- Multiple error types: Replace, False Friend, Collocation, etc.
- Demographic and proficiency information
-
Longman Dictionary of Common Errors
- 1,342 sentence pairs
- Text normalization and tokenization
- Error annotation parsing
- Context extraction and formatting
- Embedding generation for knowledge sources
- Vector database indexing
- Similarity threshold optimization
- Query formulation from error context
- Multi-source retrieval (academic writing, collocations, dictionary)
- Context ranking and selection
- Error detection and classification
- Candidate correction generation
- Explanation synthesis
- Automatic evaluation metrics
- Human evaluation protocols
- Performance analysis
model_settings:
embedding_model: "text-embedding-3-large"
llm_model: "gpt-4.1-nano"
temperature: 0.0
max_tokens: 40000
retrieval_settings:
top_k: 5
data_settings:
sample_type: "longman", "fce"
embedding_size: text-embedding-3-smallCreate a .env file with:
OPENAI_API_KEY=your_openai_api_key
# Elastic Search
ES_USER=your_es_user
ES_PASSWORD=your_es_password
ES_ENDPOINT="localhost:9200"
ES_URL=your_es_url
ES_API_KEY=your_es_api_key
ES_INDEX_NAME="collocation"- BLEU Score: Translation quality assessment
- ROUGE Score: Summary quality evaluation
- BERTScore: Semantic similarity measurement
- Exact Match: Precision of corrections
python -m postprocess.automatic_metrics --input results/ --output evaluation/WhyFix/
├── README.md # This file
├── .gitignore # Git ignore patterns
├── requirements.txt # Python dependencies
├── run_batch_jobs.sh # Batch processing script
├── run_sampling.sh # Sampling workflow script
├── models/ # ML models and batch processing
│ ├── __init__.py
│ ├── README.md
│ ├── batch_api.py
│ ├── experiment.yaml
│ ├── llm_setup.py
│ └── methods_combine_step*.py
├── process/ # Core processing pipeline
│ ├── __init__.py
│ ├── README.md
│ ├── input/
│ ├── rag/
│ ├── retrieval/
│ └── utils/
├── preprocess/ # Data preprocessing
├── postprocess/ # Result analysis
├── streamlit/ # Web interface
└── data/ # Data storage
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is part of academic research. Please refer to the institution's guidelines for usage and distribution.
Last Updated: July 17, 2025