Skip to content

danny911kr/REALTALK

Repository files navigation

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

PRs Welcome
arXiv

This repository contains the dataset, models, and evaluation code from our paper:
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

Authors: Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri


🔁 Continuation of LoCoMo (Evaluating Very Long-Term Conversational Memory of LLM Agents)

REALTALK builds upon our previous work LoCoMo, a synthetic dataset generated by LLMs for simulating long-term dialogue. While LoCoMo provided a controlled sandbox for exploring memory and persona behavior, REALTALK offers real-world complexity, drawn from human conversations across 21 days.

📄 LoCoMo (LLM-generated benchmark):


📚 Overview

Long-term, open-domain dialogue is critical for building conversational agents capable of demonstrating emotional intelligence (EI) and recalling past interactions. However, most existing benchmarks rely on synthetic, LLM-generated data, which lacks the depth, messiness, and variability of real conversations.

To address this gap, we introduce REALTALK: a 21-day corpus of authentic, real-world messaging app conversations between human participants. This enables a more rigorous evaluation of dialogue models in naturalistic settings.

🧠 Key Contributions

  • A real-world long-form dialogue dataset with emotional grounding and time-span variation.
  • Comparative analysis of emotional expression and persona consistency between REALTALK and LLM-generated dialogues (LoCoMo).
  • Two benchmark tasks designed for long-term conversation evaluation:
    1. Persona Simulation – continue a chat as a specific user based on past messages.
    2. Memory Probing – answer questions that require recalling facts from earlier interactions.

📁 Table of Contents

  1. Setup
  2. Data
  3. Data Analysis
  4. Tasks

⚙️ Setup

Set your OpenAI API key for memory probing tasks:

export OPENAI_API_KEY=sk-...

📦 Data

  • data/*.json: Preprocessed REALTALK conversations (for evaluation and model input).
  • data/raw/*.xlsx: Raw message exports used to construct the dataset.
  • data/locomo/*.json: Split version of LOCOMO synthetic dataset

📊 Data Analysis

🧠 Speaker-Level Emotional Intelligence (EI)

We compare speaker-level emotional intelligence (EI) in REALTALK and LOCOMO across key attributes such as self-awareness, empathy, motivation, social skills, and self-regulation.

Each speaker is evaluated using both LLM-based and classifier-based models, generating scores for:

  • Reflective frequency
  • Empathy average
  • Emotion and sentiment diversity
  • Social grounding and intimacy
  • Emotional alignment and recovery dynamics

Key Insight: Real-world conversations in REALTALK exhibit higher emotional diversity and greater speaker variance, while synthetic dialogues in LOCOMO tend to be uniform, overly empathetic, and emotionally constrained.

🔍 Comparison Setup

We split the original LOCOMO dataset into individual participant-level files (under data/locomo/) to match the format of REALTALK (under data/*.json). This ensures a fair comparison between corresponding conversation sessions across both datasets.

Each conversation is analyzed using five speaker-level emotional intelligence metrics:

Metric Description
Self-Awareness Speaker’s ability to recognize and articulate emotions and perspectives.
Empathy Average empathy score of a speaker’s messages.
Motivation Speaker’s engagement in maintaining the conversation.
Social Skills Speaker’s ability to foster trust and engagement.
Self-Regulation Speaker’s ability to maintain emotional and sentiment stability.

📊 Example Output

Below is an excerpt of speaker-level results from a LOCOMO chat session (Chat_1_Caroline_Melanie.json):

"reflective_frequencies": {
  "Caroline": 0.18,
  "Melanie": 0.09
},
"empathy_average": {
  "Caroline": 3.23,
  "Melanie": 3.46
},
"sentiment_stability": {
  "Caroline": 0.94,
  "Melanie": 0.96
},
"emotion_diversities": {
  "Caroline": 1.05,
  "Melanie": 0.89
}
...

All files were analyzed using the same evaluation pipeline, producing both:

  • Per-turn annotations (e.g., empathy, motivation)
  • Aggregate speaker-level metrics (as shown above)

🧪 Tasks

🗣️ Persona Simulation (Coming soon)

Given previous conversation history, continue the dialogue as one of the original participants.
Code and model evaluation details will be released shortly.


🧠 Memory Probing

Assess how well a model remembers facts from earlier parts of a long conversation.

🔧 Run evaluation:

python evaluate_memory.py \
    --data_path data/Chat_1_Emi_Elise.json \
    --qa_model gpt-4o-mini \
    --evaluate_model gpt-4o-mini

📝 Argument details:

  • --qa_model: The model used to generate answers to the probing questions.
  • --evaluate_model: The model used to assign a GPT score.
    A lexical F1 score is also computed as a rule-based measure.

📌 Citation

If you use this dataset or benchmark in your work, please cite:

@article{lee2025realtalk,
  title={Realtalk: A 21-day real-world dataset for long-term conversation},
  author={Lee, Dong-Ho and Maharana, Adyasha and Pujara, Jay and Ren, Xiang and Barbieri, Francesco},
  journal={arXiv preprint arXiv:2502.13270},
  year={2025}
}

About

Evaluate your agent memory on real-world dialogues, not LLM-simulated dialogues.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors