This repository contains the dataset, models, and evaluation code from our paper:
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation
Authors: Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri
REALTALK builds upon our previous work LoCoMo, a synthetic dataset generated by LLMs for simulating long-term dialogue. While LoCoMo provided a controlled sandbox for exploring memory and persona behavior, REALTALK offers real-world complexity, drawn from human conversations across 21 days.
π LoCoMo (LLM-generated benchmark):
- Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents
- Website: https://snap-research.github.io/locomo/
- Code: https://github.com/snap-research/LoCoMo
Long-term, open-domain dialogue is critical for building conversational agents capable of demonstrating emotional intelligence (EI) and recalling past interactions. However, most existing benchmarks rely on synthetic, LLM-generated data, which lacks the depth, messiness, and variability of real conversations.
To address this gap, we introduce REALTALK: a 21-day corpus of authentic, real-world messaging app conversations between human participants. This enables a more rigorous evaluation of dialogue models in naturalistic settings.
- A real-world long-form dialogue dataset with emotional grounding and time-span variation.
- Comparative analysis of emotional expression and persona consistency between REALTALK and LLM-generated dialogues (LoCoMo).
- Two benchmark tasks designed for long-term conversation evaluation:
- Persona Simulation β continue a chat as a specific user based on past messages.
- Memory Probing β answer questions that require recalling facts from earlier interactions.
Set your OpenAI API key for memory probing tasks:
export OPENAI_API_KEY=sk-...data/*.json: Preprocessed REALTALK conversations (for evaluation and model input).data/raw/*.xlsx: Raw message exports used to construct the dataset.data/locomo/*.json: Split version of LOCOMO synthetic dataset
We compare speaker-level emotional intelligence (EI) in REALTALK and LOCOMO across key attributes such as self-awareness, empathy, motivation, social skills, and self-regulation.
Each speaker is evaluated using both LLM-based and classifier-based models, generating scores for:
- Reflective frequency
- Empathy average
- Emotion and sentiment diversity
- Social grounding and intimacy
- Emotional alignment and recovery dynamics
Key Insight: Real-world conversations in REALTALK exhibit higher emotional diversity and greater speaker variance, while synthetic dialogues in LOCOMO tend to be uniform, overly empathetic, and emotionally constrained.
We split the original LOCOMO dataset into individual participant-level files (under data/locomo/) to match the format of REALTALK (under data/*.json). This ensures a fair comparison between corresponding conversation sessions across both datasets.
Each conversation is analyzed using five speaker-level emotional intelligence metrics:
| Metric | Description |
|---|---|
| Self-Awareness | Speakerβs ability to recognize and articulate emotions and perspectives. |
| Empathy | Average empathy score of a speakerβs messages. |
| Motivation | Speakerβs engagement in maintaining the conversation. |
| Social Skills | Speakerβs ability to foster trust and engagement. |
| Self-Regulation | Speakerβs ability to maintain emotional and sentiment stability. |
Below is an excerpt of speaker-level results from a LOCOMO chat session (Chat_1_Caroline_Melanie.json):
"reflective_frequencies": {
"Caroline": 0.18,
"Melanie": 0.09
},
"empathy_average": {
"Caroline": 3.23,
"Melanie": 3.46
},
"sentiment_stability": {
"Caroline": 0.94,
"Melanie": 0.96
},
"emotion_diversities": {
"Caroline": 1.05,
"Melanie": 0.89
}
...All files were analyzed using the same evaluation pipeline, producing both:
- Per-turn annotations (e.g., empathy, motivation)
- Aggregate speaker-level metrics (as shown above)
Given previous conversation history, continue the dialogue as one of the original participants.
Code and model evaluation details will be released shortly.
Assess how well a model remembers facts from earlier parts of a long conversation.
python evaluate_memory.py \
--data_path data/Chat_1_Emi_Elise.json \
--qa_model gpt-4o-mini \
--evaluate_model gpt-4o-mini--qa_model: The model used to generate answers to the probing questions.--evaluate_model: The model used to assign a GPT score.
A lexical F1 score is also computed as a rule-based measure.
If you use this dataset or benchmark in your work, please cite:
@article{lee2025realtalk,
title={Realtalk: A 21-day real-world dataset for long-term conversation},
author={Lee, Dong-Ho and Maharana, Adyasha and Pujara, Jay and Ren, Xiang and Barbieri, Francesco},
journal={arXiv preprint arXiv:2502.13270},
year={2025}
}