WhyFix: Correction Explanation System

WhyFix is an advanced system designed to provide explanations for lexical errors made by second language (L2) learners. The system leverages Retrieval-Augmented Generation (RAG) techniques and large language models to deliver contextually appropriate feedback for academic writing.

Overview

WhyFix addresses the challenge of providing meaningful explanations for lexical errors in L2 academic writing. The system combines:

Retrieval-Augmented Generation (RAG) for context-aware explanations
Multi-stage processing pipelines for comprehensive error analysis
Batch processing capabilities for large-scale evaluation
Interactive web interface for real-time feedback
Comprehensive evaluation metrics for system assessment

Key Features

Error Detection & Correction: Identifies and corrects lexical errors in academic writing
Contextual Explanations: Provides detailed explanations for why corrections are necessary
Multi-source RAG: Integrates multiple knowledge sources for enhanced accuracy
Scalable Processing: Supports batch processing for large datasets
Interactive Interface: Streamlit-based web application for user interaction

System Architecture

WhyFix/
├── models/                     # ML models and batch processing
├── process/                    # Core processing pipeline
│   ├── input/                 # Input data handling
│   ├── rag/                   # RAG system implementations
│   ├── retrieval/             # Vector retrieval mechanisms
│   └── utils/                 # Processing utilities
├── preprocess/                # Data preprocessing
├── postprocess/              # Result analysis and metrics
├── streamlit/                # Web interface
├── data/                     # Training and evaluation data
└── scripts/                  # Automation scripts

Installation

Prerequisites

Python 3.10+
Git
Virtual environment (recommended)

Setup

Clone the repository:
```
git clone <repository-url>
cd WhyFix
```

Create and activate virtual environment:

python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

cp .env.example .env
# Edit .env with your API keys and configurations

Usage

Quick Start

Run the sampling workflow:
```
./run_sampling.sh A small
```

Execute batch processing:

./run_batch_jobs.sh A small $(date +"%m%d-%H%M")

Launch web interface:

cd streamlit/src/app
streamlit run main.py

Command Line Interface

RAG-based Retrieval

python -m process.retrieval.retrieve_3_collocation gpt true A small

Batch Processing

./run_sampling.sh A small

Data Sources

Knowledge Base Sources

Academic Writing Guidelines
- English Academic Writing for Students and Researchers
- Cultural Issues in Academic Writing
- Features of Academic Writing
Dictionary Resources
- Cambridge Dictionary (definitions, CEFR levels, examples)
- Macmillan English Dictionary
- Academic Keyword List (AKL)
Collocation Resources
- Collocation databases

Evaluation Datasets

CLC FCE Dataset
- 4,853 exam scripts from Cambridge FCE
- Multiple error types: Replace, False Friend, Collocation, etc.
- Demographic and proficiency information
Longman Dictionary of Common Errors
- 1,342 sentence pairs

Workflow

1. Data Preprocessing

Text normalization and tokenization
Error annotation parsing
Context extraction and formatting

2. Vector Store Construction

Embedding generation for knowledge sources
Vector database indexing
Similarity threshold optimization

3. Retrieval-Augmented Generation

Query formulation from error context
Multi-source retrieval (academic writing, collocations, dictionary)
Context ranking and selection

4. Error Correction Pipeline

Error detection and classification
Candidate correction generation
Explanation synthesis

5. Evaluation and Metrics

Automatic evaluation metrics
Human evaluation protocols
Performance analysis

Configuration

Experiment Configuration (`models/experiment.yaml`)

model_settings:
  embedding_model: "text-embedding-3-large"
  llm_model: "gpt-4.1-nano"
  temperature: 0.0
  max_tokens: 40000

retrieval_settings:
  top_k: 5

data_settings:
  sample_type: "longman", "fce"
  embedding_size: text-embedding-3-small

Environment Variables

Create a .env file with:

OPENAI_API_KEY=your_openai_api_key

# Elastic Search
ES_USER=your_es_user
ES_PASSWORD=your_es_password
ES_ENDPOINT="localhost:9200"
ES_URL=your_es_url
ES_API_KEY=your_es_api_key
ES_INDEX_NAME="collocation"

Evaluation

Automatic Metrics

BLEU Score: Translation quality assessment
ROUGE Score: Summary quality evaluation
BERTScore: Semantic similarity measurement
Exact Match: Precision of corrections

Running Evaluation

python -m postprocess.automatic_metrics --input results/ --output evaluation/

Directory Structure

WhyFix/
├── README.md                   # This file
├── .gitignore                 # Git ignore patterns
├── requirements.txt           # Python dependencies
├── run_batch_jobs.sh          # Batch processing script
├── run_sampling.sh            # Sampling workflow script
├── models/                    # ML models and batch processing
│   ├── __init__.py
│   ├── README.md
│   ├── batch_api.py
│   ├── experiment.yaml
│   ├── llm_setup.py
│   └── methods_combine_step*.py
├── process/                   # Core processing pipeline
│   ├── __init__.py
│   ├── README.md
│   ├── input/
│   ├── rag/
│   ├── retrieval/
│   └── utils/
├── preprocess/               # Data preprocessing
├── postprocess/             # Result analysis
├── streamlit/              # Web interface
└── data/                   # Data storage

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is part of academic research. Please refer to the institution's guidelines for usage and distribution.

Last Updated: July 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhyFix: Correction Explanation System

Table of Contents

Overview

Key Features

System Architecture

Installation

Prerequisites

Setup

Usage

Quick Start

Command Line Interface

RAG-based Retrieval

Batch Processing

Data Sources

Knowledge Base Sources

Evaluation Datasets

Workflow

1. Data Preprocessing

2. Vector Store Construction

3. Retrieval-Augmented Generation

4. Error Correction Pipeline

5. Evaluation and Metrics

Configuration

Experiment Configuration (`models/experiment.yaml`)

Environment Variables

Evaluation

Automatic Metrics

Running Evaluation

Directory Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
postprocess		postprocess
preprocess		preprocess
process		process
streamlit		streamlit
.gitignore		.gitignore
README.md		README.md
run_batch_jobs.sh		run_batch_jobs.sh
run_sampling.sh		run_sampling.sh

atwolin/WhyFix

Folders and files

Latest commit

History

Repository files navigation

WhyFix: Correction Explanation System

Table of Contents

Overview

Key Features

System Architecture

Installation

Prerequisites

Setup

Usage

Quick Start

Command Line Interface

RAG-based Retrieval

Batch Processing

Data Sources

Knowledge Base Sources

Evaluation Datasets

Workflow

1. Data Preprocessing

2. Vector Store Construction

3. Retrieval-Augmented Generation

4. Error Correction Pipeline

5. Evaluation and Metrics

Configuration

Experiment Configuration (models/experiment.yaml)

Environment Variables

Evaluation

Automatic Metrics

Running Evaluation

Directory Structure

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Experiment Configuration (`models/experiment.yaml`)

Packages