SocraticBench is a research framework and benchmark dedicated to building and evaluating LLMs that teach by asking, not just answering β with a commitment to developing the best AI tutors for real learning.
Most AI tutors are built to answer questions, not to teach students how to think. While this makes them fast and useful, it also risks promoting shallow learning: students get answers, but not understanding.
The Socratic Method offers a powerful alternative β teaching through questions rather than answers. It encourages students to reflect, reason, and discover insights on their own. But todayβs language models are rarely evaluated on their ability to teach Socratically.
SocraticBench fills this gap. It is a research-grade library and benchmark for:
- Generating Socratic dialogues between a simulated teacher and student (both LLMs)
- Evaluating conversations using a calibrated LLM-as-a-judge, trained to assess Socratic quality
- Exploring how well different LLMs (e.g. GPT-4, Claude, LLaMA) perform as Socratic teachers
The goal is not to make chatbots more helpful β it's to make them better at fostering critical thinking and deep understanding.
- Modular pipeline with customizable Stage components.
- Supports multi-agent setups: Student, Teacher, and Judge agents with distinct roles.
- Built-in dataset loader for Princeton NLP's textbook chapters.
- Robust parsing and retry logic for prompting unreliable LLM outputs.
- Evaluation agent that follows a strict Socratic rubric, checking:
- Topic coverage
- Adherence to Socratic principles (e.g. no direct answers)
- Indicators of student understanding
The pipeline extracts data from educational contents and produces conversation seeds, containing a set of discussion topics and an opening question. Two LLMs playing the roles of a teacher and a student discuss on the topic following a loop. At the end of the conversation, an LLM-as-a-judge rates the conversation as successful β covering the topics, addressing the opening question, and respected the Socratic principles β or failed.
You can use SocraticBench by installing it with:
pip install socratic-benchDepending on the LLM provider you are planning to use (via OpenAI API, local models, etc.), you may need some additional settings. We provide three alternative LLM implementations, and you must install the dependencies for the ones you're using.
pip install socratic-bench[openai] # if using OpenAI as backend
pip install socratic-bench[ollama] # if using Ollama as backend
pip install socratic-bench[hf] # if self-hosting LLMs with HF transformers
pip install socratic-bench[openai,ollama] # if a mix of Ollama and OpenAIYou'll also need to set up the API token if using OpenAI.
from openai import Client
from socratic_bench import StudentAgent, TeacherAgent, JudgeAgent, ConversationSeederAgent, socratic_bench, Record
from socratic_bench.llms.openai import OpenAILLM
from socratic_bench.readers import PrincetonChapters
# Define the LLM
client = Client()
llm = OpenAILLM("gpt-4o-mini", client)
# Specify the agents
seeder = ConversationSeederAgent(llm)
student = StudentAgent(llm)
teacher = TeacherAgent(llm)
judge = JudgeAgent(llm)
# Specify your data source
data_source = PrincetonChapters(Record, num_conversations=2)
# load the pipeline
pipeline = socratic_bench(data_source, seeder, student, teacher, judge)
# run it !
results, stats = pipeline.run()Each Record in results contains:
seed.question:the student's initial questionchat_history:the full dialogueassessment:whether the teacher passed the Socratic criteriafeedback:detailed explanation from the Judge agent decisionfailure:flag indicating whether the record encountered processing issues
Let's say, for example, you want to compose your own pipeline by eliminating certain steps and adding new others. You can use the lower level API and create a mix of steps the way you wish. For example,
from typing import List
from socratic_bench import Record, DataSource
from socratic_bench.agents import ConversationSeeder, Student, Teacher
from socratic_bench.pipeline import SocraticBench, Emitter
from socratic_bench.stages import SeedStage, Stage
data_source: DataSource[Record] = # ...
seeder: ConversationSeeder = # ...
students: List[Student] = # ...
teacher: Teacher = # ...
class GroupChatStage(Stage[List[Record], List[Record]]):
"""Chat stage that simulates a classroom"""
def __init__(self, students: List[Student], teacher: Teacher):
self._students = students
self._teacher = teacher
def process(self, sample: List[Record], emitter: Emitter[List[Record]]) -> None:
... # your custom implementation
my_bench = (
SocraticBench.from_data(data_source)
.apply(SeedStage(seeder)) # standard seed generation
.batch() # grouping records for batch processing
.apply(GroupChatStage(students, teacher)) # using a custom classroom for creating chat exchanges
.flatten() # flattening groups
# .apply(EvaluationStage(judge)) -> You removed the evaluation step
)
results, stats = my_bench.run()The pipeline relies on the abstraction of socratic_bench.agents.LLM abstraction. In principle, you can implement this
class and plug any LLM provider of your choice. We, however, provide three alternatives:
from openai import Client
from socratic_bench.llms.openai import OpenAILLM
client = Client()
llm = OpenAILLM("gpt-4o-mini", client)from ollama import Client
from socratic_bench.llms.ollama import OllamaLLM
client = Client(host="http://localhost:11434")
llm = OllamaLLM("mistral-small3.1:24b", client, num_ctx=5120, temperature=0.15)import torch
from socratic_bench.llms.hf import HFLLM
llm = HFLLM(
"mistralai/Mistral-Small-3.1-24B-Instruct-2503",
model_kwargs={
"trust_remote_code": True,
"device_map": "cuda",
"torch_dtype": torch.float16
}
)It's possible to count statistics (e.g., errors) while processing pipeline steps. These statistics are reported at the
end of the pipeline execution and can be used for pipeline debugging. To use those counters, you must declare them in
the Stage implementation before using the emitter.increment method. For example,
from typing import Any, Tuple, List, Generator
from socratic_bench import DataSource
from socratic_bench.pipeline import SocraticBench, Stage, Emitter
class CountingStage(Stage[Any, Any]):
def counters(self) -> Tuple[str, ...]:
return ("processed",)
def process(self, sample: int, emitter: Emitter[Any]) -> None:
emitter.increment("processed")
emitter.emit(sample)
class DummySource(DataSource[int]):
def __init__(self, data: List[int]):
self.data = data
def read(self) -> Generator[int, None, None]:
for x in self.data:
yield x
source = DummySource([10, 20])
bench = SocraticBench.from_data(source).apply(CountingStage())
output, tracker = bench.run()
print(tracker["processed"])
# 2The default pipeline (socratic_bench) produces the following counters:
| Counter Name | Stage | Description |
|---|---|---|
| seed.in | SeedStage |
Number of input records |
| seed.missing | SeedStage |
Number input records with missing seed (unprocessed) |
| seed.failure | SeedStage |
Number of records that LLM failed to produce valid seeds (failure=True) |
| seed.out | SeedStage |
Number of output records (including failures) |
| chat_stage.eligible | ChatStage |
Number of valid input records for chat rollout |
| chat_stage.failure | ChatStage |
Number of records that LLM failed to produce a valid conversation (failure=True) |
| chat_stage.success | ChatStage |
Number of output records with a valid chat |
| judge.in | EvaluationStage |
Number of input records (including rejected) |
| judge.accepted | EvaluationStage |
Number of conversations with an accepted outcome |
| judge.rejected | EvaluationStage |
Number of conversations with an rejected outcome |
| judge.failed_evaluation | EvaluationStage |
Number of conversations that the Judge LLM failed to produce a correct evaluation (failure=True) |
| judge.out | EvaluationStage |
Number of output records (including rejected) |
You can implement your own custom:
- LLM backend
- Student, Teacher, Judge agent behavior
- ConversationSeeder logic
- Evaluation rubrics
Simply subclass the relevant ABCs (abstract base classes) and plug them into the pipeline.
We evaluated how well different LLMs can lead Socratic teaching dialogues by generating over 350 multi-turn conversations between a simulated teacher and student.
Our experiments compared a range of models β including Gemma 3, LLaMA 3.3, Mistral-Small 3.1, GPT-4o, LearnLM 2.0, and EULER (a fine-tuned Socratic LLM) β using the same seed questions and student model. An LLM-as-a-judge (Qwen 32B) evaluated whether each teacher successfully guided the student to a deep understanding. The results are displayed in the table bellow.
| LLM | Model Size (B) | Success Rate | 95% CI |
|---|---|---|---|
| EULER | 3.82 | 18.3% | [14.6%, 22.7%] |
| Gemma 3 | 27 | 22.9% | [18.8%, 27.5%] |
| GPT-4o | --- | 17.4% | [13.8%, 21.8%] |
| Llama3.3 | 70 | 21.1% | [17.2%, 25.7%] |
| MistralSmall 3.1 | 24 | 20.9% | [16.9%, 25.4%] |
| LearnLM 2.0 | --- | 21.1% | [17.2%, 25.7%] |
Key findings:
- Socratic teaching is hard: Success rates were low across all models, with the best reaching only ~23% success.
- Bigger isnβt better: Model size did not correlate with better Socratic teaching. Smaller models like Mistral-Small matched or outperformed larger ones like LLaMA-3.3.
- Gemma 3 was the top performer, but its lead over others was marginal within the confidence interval.
Even education-tuned LLMs underperformed: Both GPT-4o and LearnLM 2.0 scored below expectations.
Conversations improve with more turns, but gains plateau around 8 dialogue rounds.
Why do LLMs fail? Common issues included not covering all topics or drifting off-topic. The core problem seems structural: LLMs are trained to respond, not to teach.
These results highlight a fundamental limitation of current LLMs: Theyβre not yet designed to lead structured, goal-driven educational dialogues. Improving this may require moving from passive response models to ones trained for proactive, guided instruction.
To enable automatic evaluation of Socratic dialogues, we developed a prompt-guided LLM "judge" and validated it against expert annotations. Our goal was to select LLMs that best align with human judgment, using Cohenβs ΞΊ (kappa) β a metric that corrects for chance agreement in binary classification tasks.
We began by generating a small set of dialogues using Mistral-Nemo, then asked two experts to independently rate 30 samples (Accept/Reject + justification). This yielded a Cohenβs ΞΊ = 0.50 (95% CI: [β0.047, 0.889]), indicating moderate agreement but with wide uncertainty due to the small sample size.
| Rater 1 \ Rater 2 | Reject | Accept |
|---|---|---|
| Reject | 23 | 1 |
| Accept | 3 | 3 |
π Expert-annotated datasets: expert-1.json and expert-2.json.
After resolving disagreements, the experts produced a 30-sample consensus dataset and revised the judge LLM prompt. The final prompt incorporates a short summary and checklist before giving a final verdict, drawing from Chain-of-Thought prompting.
We then collected 50 more independent ratings from each expert, yielding a 130-sample labeled dataset used to evaluate judge LLMs (see merged.json).
We benchmarked various open- and closed-source LLMs, comparing their outputs to expert judgments using Cohenβs ΞΊ. The best-performing model, Qwen 3 (32B), achieved a ΞΊ of 0.26 β modest agreement but sufficient for approximate assessment.
| Model Name | Size | Cohenβs ΞΊ | 95% CI |
|---|---|---|---|
| Gemma 3 | 27B | 0.255 | [0.082, 0.419] |
| GPT-4o | N/A | 0.212 | [0.021, 0.405] |
| GPT-4o-mini | N/A | 0.113 | [0.047, 0.188] |
| LLaMA 3.3 | 70B | 0.063 | [β0.052, 0.177] |
| Mistral-Small 3.1 | 24B | 0.226 | [0.073, 0.378] |
| Phi-4 | 14B | 0.057 | [β0.036, 0.149] |
| Qwen 3 (1.7B) | 1.7B | 0.100 | [β0.090, 0.295] |
| Qwen 3 (14B) | 14B | 0.215 | [0.027, 0.400] |
| Qwen 3 (32B) | 32B | 0.260 | [0.077, 0.438] |
| Qwen 3 (4B) | 4B | 0.134 | [0.014, 0.261] |
| Qwen 3 (8B) | 8B | 0.192 | [0.027, 0.355] |
We are also releasing the experimental results on several open- and closed-source LLMs on the Socratic evaluation and the expert annotated data for calibrating the judge model. Here are the files to watch for:
βββ datasets
β βββ evaluation # datasets used for evaluating LLMs
β β βββ seeds.json # the seed dataset used throughout experimentation
β β βββ int_{max_length}_{model}.json # dataset containing detailed interactions between teacher/student agents
β β βββ eval_{max_length}_{model}.json # dataset with the interactions traces + judge LLM evaluation
β βββ human-eval # datasets containing expert annotated data
β β βββ seed-dataset.json # seed dataset used to produce the inputs for annotators
β β βββ expert-{id}.json # dataset with inputs independently annotated by individual {id}
β β βββ agreement.json # dataset containing final annotations after inter-annotator discussion
β β βββ merged.json # merge of expert-{id} + aggreement datasets
β βββ judge-benchmark # datasets used for callibrating the LLM-as-a-judge
β β βββ {model}_{size}.json # assessment dataset for {model} with {size}
β βββ positive-traces # a sample of 754 successiful conversations (7384 turns) using MistralSmall-3.1 for student/teacher roles
If you are considering fine-tuning a model with SFT, you can load a set of positive traces sampled from random subjects by loading:
from datasets import Dataset
dataset = Dataset.load_from_disk("./datasets/positive-traces")
print(len(dataset))
# 754(You'll need to install HF datasets library, i.e., pip install datasets)
We sampled 754 successiful conversation samples (or 7384 conversation turns) and used the traces to
Supervise Fine-Tune (SFT) a Qwen:4b LoRA adapter. The model achieved a comparable performance to Gemma3 (22.8%; CI [18.7%, 27.5%]) on this benchmark.
It represents a surprising improvement over the base model + prompt engineering (17.2%; CI [13.4%, 21.6%]);
The model is available at eurecom-ds/qwen3-4b-socratic.
SocraticBench is an early step toward evaluating and improving Socratic teaching with LLMs. While it offers a reproducible benchmark and valuable insights, several limitations are worth noting:
-
Limited human-annotated dataset: Our expert evaluation set includes 130 annotated samples β sufficient for model comparisons, but too small for detailed subgroup analysis or robust generalization.
-
Moderate inter-annotator agreement: Human evaluation of Socratic quality remains subjective. Our experts reached only moderate agreement (
$\kappa = 0.50$ ), highlighting the challenge of defining "good teaching" even among humans. -
Fixed prompts and judge setup: We used a single prompting strategy for teacher LLMs and a fixed judge prompt to assess conversations. Alternative designs might yield different rankings.
-
Synthetic dialogues only: The benchmark relies entirely on LLM-generated student responses. Real student interactions may be more complex and unpredictable.
These limitations are an opportunity for the community. We welcome contributions that extend SocraticBench with improved prompts, new student/teacher models, or larger expert datasets.