REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

This repository contains the dataset, models, and evaluation code from our paper:
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

Authors: Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri

🔁 Continuation of LoCoMo (Evaluating Very Long-Term Conversational Memory of LLM Agents)

REALTALK builds upon our previous work LoCoMo, a synthetic dataset generated by LLMs for simulating long-term dialogue. While LoCoMo provided a controlled sandbox for exploring memory and persona behavior, REALTALK offers real-world complexity, drawn from human conversations across 21 days.

📄 LoCoMo (LLM-generated benchmark):

Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents
Website: https://snap-research.github.io/locomo/
Code: https://github.com/snap-research/LoCoMo

📚 Overview

Long-term, open-domain dialogue is critical for building conversational agents capable of demonstrating emotional intelligence (EI) and recalling past interactions. However, most existing benchmarks rely on synthetic, LLM-generated data, which lacks the depth, messiness, and variability of real conversations.

To address this gap, we introduce REALTALK: a 21-day corpus of authentic, real-world messaging app conversations between human participants. This enables a more rigorous evaluation of dialogue models in naturalistic settings.

🧠 Key Contributions

A real-world long-form dialogue dataset with emotional grounding and time-span variation.
Comparative analysis of emotional expression and persona consistency between REALTALK and LLM-generated dialogues (LoCoMo).
Two benchmark tasks designed for long-term conversation evaluation:
1. Persona Simulation – continue a chat as a specific user based on past messages.
2. Memory Probing – answer questions that require recalling facts from earlier interactions.

📁 Table of Contents

⚙️ Setup

Set your OpenAI API key for memory probing tasks:

export OPENAI_API_KEY=sk-...

📦 Data

data/*.json: Preprocessed REALTALK conversations (for evaluation and model input).
data/raw/*.xlsx: Raw message exports used to construct the dataset.
data/locomo/*.json: Split version of LOCOMO synthetic dataset

📊 Data Analysis

🧠 Speaker-Level Emotional Intelligence (EI)

We compare speaker-level emotional intelligence (EI) in REALTALK and LOCOMO across key attributes such as self-awareness, empathy, motivation, social skills, and self-regulation.

Each speaker is evaluated using both LLM-based and classifier-based models, generating scores for:

Reflective frequency
Empathy average
Emotion and sentiment diversity
Social grounding and intimacy
Emotional alignment and recovery dynamics

Key Insight: Real-world conversations in REALTALK exhibit higher emotional diversity and greater speaker variance, while synthetic dialogues in LOCOMO tend to be uniform, overly empathetic, and emotionally constrained.

🔍 Comparison Setup

We split the original LOCOMO dataset into individual participant-level files (under data/locomo/) to match the format of REALTALK (under data/*.json). This ensures a fair comparison between corresponding conversation sessions across both datasets.

Each conversation is analyzed using five speaker-level emotional intelligence metrics:

Metric	Description
Self-Awareness	Speaker’s ability to recognize and articulate emotions and perspectives.
Empathy	Average empathy score of a speaker’s messages.
Motivation	Speaker’s engagement in maintaining the conversation.
Social Skills	Speaker’s ability to foster trust and engagement.
Self-Regulation	Speaker’s ability to maintain emotional and sentiment stability.

📊 Example Output

Below is an excerpt of speaker-level results from a LOCOMO chat session (Chat_1_Caroline_Melanie.json):

"reflective_frequencies": {
  "Caroline": 0.18,
  "Melanie": 0.09
},
"empathy_average": {
  "Caroline": 3.23,
  "Melanie": 3.46
},
"sentiment_stability": {
  "Caroline": 0.94,
  "Melanie": 0.96
},
"emotion_diversities": {
  "Caroline": 1.05,
  "Melanie": 0.89
}
...

All files were analyzed using the same evaluation pipeline, producing both:

Per-turn annotations (e.g., empathy, motivation)
Aggregate speaker-level metrics (as shown above)

🧪 Tasks

🗣️ Persona Simulation (Coming soon)

Given previous conversation history, continue the dialogue as one of the original participants.
Code and model evaluation details will be released shortly.

🧠 Memory Probing

Assess how well a model remembers facts from earlier parts of a long conversation.

🔧 Run evaluation:

python evaluate_memory.py \
    --data_path data/Chat_1_Emi_Elise.json \
    --qa_model gpt-4o-mini \
    --evaluate_model gpt-4o-mini

📝 Argument details:

--qa_model: The model used to generate answers to the probing questions.
--evaluate_model: The model used to assign a GPT score.
A lexical F1 score is also computed as a rule-based measure.

📌 Citation

If you use this dataset or benchmark in your work, please cite:

@article{lee2025realtalk,
  title={Realtalk: A 21-day real-world dataset for long-term conversation},
  author={Lee, Dong-Ho and Maharana, Adyasha and Pujara, Jay and Ren, Xiang and Barbieri, Francesco},
  journal={arXiv preprint arXiv:2502.13270},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
eval_emotional_intelligence		eval_emotional_intelligence
eval_memory		eval_memory
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
analyze_data.py		analyze_data.py
evaluate_memory.py		evaluate_memory.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

🔁 Continuation of LoCoMo (Evaluating Very Long-Term Conversational Memory of LLM Agents)

📚 Overview

🧠 Key Contributions

📁 Table of Contents

⚙️ Setup

📦 Data

📊 Data Analysis

🧠 Speaker-Level Emotional Intelligence (EI)

🔍 Comparison Setup

📊 Example Output

🧪 Tasks

🗣️ Persona Simulation (Coming soon)

🧠 Memory Probing

🔧 Run evaluation:

📝 Argument details:

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

🔁 Continuation of LoCoMo (Evaluating Very Long-Term Conversational Memory of LLM Agents)

📚 Overview

🧠 Key Contributions

📁 Table of Contents

⚙️ Setup

📦 Data

📊 Data Analysis

🧠 Speaker-Level Emotional Intelligence (EI)

🔍 Comparison Setup

📊 Example Output

🧪 Tasks

🗣️ Persona Simulation (Coming soon)

🧠 Memory Probing

🔧 Run evaluation:

📝 Argument details:

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages