A PyTorch implementation of the Internal Coherence Maximization algorithm from the paper "Unsupervised Elicitation of Language Models" by Wen et al. (2025). This implementation supports both vLLM and transformers backends for efficient inference.
ICM is an unsupervised algorithm that fine-tunes pretrained language models on their own generated labels without external supervision. It works by:
- Mutual Predictability: Finding labels where the model can infer each label from all others
- Logical Consistency: Enforcing task-specific consistency constraints
- Simulated Annealing: Iteratively improving the label set using temperature-based acceptance
- 🚀 Dual Backend Support: Optimized vLLM backend for production, transformers for compatibility
- 🔧 Modular Design: Easily extensible components for different tasks
- 📊 Built-in Tasks: Support for truthfulness, math correctness, and comparison tasks
- 🧪 Comprehensive Testing: Unit tests and integration tests included
- 📈 Performance Tracking: Detailed metrics and experiment logging
- 🌍 Real Data Support: Run experiments on actual datasets (TruthfulQA, GSM8K, HH-RLHF)
- 🤖 Unsupervised Learning: No labels needed - ICM discovers patterns automatically
- Python 3.9+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
- uv (for package management)
First, install uv if you haven't already:
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uv# Create and activate a virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package in editable mode with core dependencies
uv pip install -e .
# For vLLM backend (recommended for performance)
uv pip install -e ".[vllm]"
# For all dependencies including development tools
uv pip install -e ".[all]"FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
RUN pip install --upgrade pip && \
pip install vllm==0.7.0 transformers>=4.51.0 \
tqdm numpy pandas psutil# No need to activate venv when using uv run!
# Save this as quick_test.py and run with: uv run quick_test.py
from icm_implementation import ICM, ICMConfig, create_truthfulness_dataset
# Create dataset
data = [
("Is the Earth round?", "Yes, the Earth is spherical", None),
("Is the Earth flat?", "No, the Earth is round", None),
("Is 2+2=4?", "Yes, 2+2 equals 4", None),
("Is 2+2=5?", "No, 2+2 equals 4, not 5", None),
]
dataset = create_truthfulness_dataset(data)
# Configure ICM
config = ICMConfig(
model_name="Qwen/Qwen3-4B", # or any HF model
backend="auto", # uses vLLM if available
initial_examples=2,
alpha=50.0
)
# Run ICM
icm = ICM(config)
labeled_data = icm.run(dataset)
# Check results
for data_point, label in labeled_data:
print(f"Input: {data_point.input_text}")
print(f"Label: {config.label_names[label]}\n")# Run all tasks with default model
uv run icm_examples.py --task all
# Run specific task with custom model
uv run icm_examples.py --task math --model meta-llama/Llama-3.2-1B
# Quick test with small dataset
uv run icm_examples.py --task truthfulness --small
# Compare backends
uv run icm_examples.py --compare-backendsfrom icm_implementation import ICM, ICMConfig, DataPoint, LogicalConsistency
class CustomConsistency(LogicalConsistency):
def check_consistency(self, x_i, y_i, x_j, y_j):
# Implement your consistency logic
return True # or False based on your constraints
# Create custom dataset
dataset = [
DataPoint(
id=i,
input_text="Your task-specific input",
metadata={"custom_field": value}
)
for i, value in enumerate(your_data)
]
# Run with custom consistency
config = ICMConfig(num_labels=3, label_names=["A", "B", "C"])
icm = ICM(config)
icm.consistency_checker = CustomConsistency()
results = icm.run(dataset)-
ModelBackend: Abstract interface for model inference
VLLMBackend: High-performance batch inferenceTransformersBackend: Compatible with any HuggingFace model
-
LogicalConsistency: Handles task-specific consistency checking
- General consistency (default)
- Asymmetry consistency (for comparisons)
- Math correctness consistency
-
ICM Algorithm: Main algorithm implementation
- Simulated annealing with temperature scheduling
- Consistency fixing subroutine
- Score calculation and tracking
@dataclass
class ICMConfig:
# Model settings
model_name: str = "Qwen/Qwen3-4B"
backend: str = "auto" # "vllm", "transformers", or "auto"
# Algorithm parameters
initial_examples: int = 8 # K in the paper
initial_temperature: float = 10.0 # T_0
final_temperature: float = 0.01 # T_min
cooling_rate: float = 0.99 # β
alpha: float = 50.0 # Mutual predictability weight
# Inference settings
max_context_length: int = 32768
max_new_tokens: int = 64
temperature: float = 0.1
top_p: float = 0.95dataset = create_truthfulness_dataset([
(question, claim, is_true), # is_true can be None
...
])dataset = create_math_correctness_dataset([
(problem, solution, answer, is_correct),
...
])dataset = create_comparison_dataset([
(query, response_a, response_b, a_is_better),
...
])- Use smaller
max_context_lengthfor limited GPU memory - Adjust
initial_examplesbased on dataset size - Use
backend="transformers"with CPU for testing
- Use vLLM backend for 5-10x speedup
- Batch size is automatically optimized
- Reduce
max_iterationsfor faster results
- Qwen3-4B: Best balance of performance and efficiency
- Qwen3-1.7B: For resource-constrained environments
- Llama-3.2-1B: Alternative lightweight option
# Run all tests
uv run icm_test_suite.py
# Run specific test class
uv run python -m unittest icm_test_suite.TestLogicalConsistency
# Run with verbose output
uv run icm_test_suite.py -v
# Or use pytest if you have dev dependencies installed
uv run pytest icm_test_suite.py -vICM includes a powerful experiment runner that works with real datasets from Hugging Face. You can evaluate ICM's unsupervised learning capabilities on actual benchmarks without any labeled data.
- Truthfulness (TruthfulQA) - Evaluate factual accuracy of claims
- Math Correctness (GSM8K) - Verify mathematical problem solutions
- Comparison (HH-RLHF) - Learn preferences between responses
# Run on a single task
uv run run_experiments.py --task truthfulness
# Run on all tasks
uv run run_experiments.py --task all
# Customize model and sample size
uv run run_experiments.py --task math --model Qwen/Qwen3-4B --sample-size 100
# Control iterations
uv run run_experiments.py --task comparison --max-iterations 200# Quick test with small model
uv run run_experiments.py --task math --model Qwen/Qwen2.5-0.5B --sample-size 20
# Full experiment with Qwen3-4B
uv run run_experiments.py --task all --model Qwen/Qwen3-4B --sample-size 50
# Large-scale truthfulness evaluation
uv run run_experiments.py --task truthfulness --sample-size 200 --max-iterations 400The experiment runner:
- Loads real data from Hugging Face datasets (TruthfulQA, GSM8K, HH-RLHF)
- Formats data into question-claim pairs suitable for ICM
- Runs ICM algorithm to label data without supervision
- Enforces consistency using task-specific logical constraints
- Saves detailed results including metrics, labels, and score history
Results are saved to icm_results/ with filenames like:
REAL_truthfulness_Qwen_Qwen3-4B_20250615_120000.json
Each result file contains:
- Full configuration used
- Final metrics (score, mutual predictability, inconsistencies)
- All labeled examples with model's predictions
- Score history for analysis
- Runtime and acceptance rate statistics
Truthfulness (TruthfulQA)
- Tests ability to distinguish true/false claims
- Uses questions from TruthfulQA validation set
- No specific consistency constraints
Math Correctness (GSM8K)
- Verifies correct vs incorrect math solutions
- Enforces mathematical consistency: same problem can't have different correct answers
- Creates deliberate wrong answers for contrastive learning
Comparison (HH-RLHF)
- Learns preferences between helpful/harmless responses
- Uses Anthropic's HH-RLHF dataset
- Enforces asymmetry: if A>B then B cannot be >A
Results are automatically saved to icm_results/ with:
- Detailed JSON logs for each experiment
- Summary CSV with key metrics
- Score history and acceptance rates
- Full labeled datasets for analysis
- Context Length: Limited by model's context window for in-context examples
- Concept Salience: Only works for concepts the model already understands
- Compute Requirements: Requires multiple forward passes per label
If you use this implementation, please cite the original paper:
@article{wen2025unsupervised,
title={Unsupervised Elicitation of Language Models},
author={Wen, Jiaxin and others},
journal={arXiv preprint arXiv:2505.15134},
year={2025}
}-
CUDA Out of Memory
config.max_context_length = 4096 # Reduce context config.backend = "transformers" # Use CPU
-
vLLM Import Error
# Install with specific CUDA version uv pip install vllm --index-url https://download.pytorch.org/whl/cu121 -
Slow Performance
- Ensure vLLM backend is being used
- Check GPU utilization with
nvidia-smi - Reduce dataset size or max_iterations
Contributions are welcome! Areas for improvement:
- Additional consistency types
- Support for more model architectures
- Multi-GPU support
- Additional evaluation metrics
This implementation is provided for research purposes. Please ensure you comply with the licenses of the models you use (Qwen3, Llama, etc.).