Skip to content

RedPlanetHQ/core-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Core Memory System Benchmark

A comprehensive evaluation framework for the CORE using the LOCOMO benchmark. This repository allows anyone to reproduce our benchmark results and compare their own memory systems against Core's performance.

πŸ† Benchmark Results

Core demonstrates strong performance across all question categories in the LOCOMO evaluation:

  • Single Hop Questions: 91% accuracy
  • Multi Hop Questions: 85% accuracy
  • Open Domain: 71% accuracy
  • Temporal: 88% accuracy
  • Overall Average: 85% accuracy

Core Benchmark Results

Comparison with other memory systems (mem0, memobase, zep) shows Core's consistent high performance across all categories.

πŸš€ Quick Start

Prerequisites

  • Node.js 16+
  • OpenAI API key
  • Core memory system running (or your own memory system API)

Installation

git clone https://github.com/RedPlanetHQ/core-benchmark.git
cd core-benchmark
npm install

Configuration

cp .env.example .env

Edit .env with your settings:

# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key_here

# Core Memory System API
API_KEY=your_core_api_key
BASE_URL=http://localhost:3033

Ingest Conversations

Before running evaluations, you need to ingest the LOCOMO conversations into your memory system:

node locomo/ingest_conversations.js

This will:

  • Load conversations from locomo/locomo10.json
  • Send conversation data to your memory system's /api/v1/ingest endpoint
  • Track ingestion progress to avoid duplicates
  • Prepare your memory system for evaluation

Run the Benchmark

npm run evaluate

This will evaluate your memory system against the LOCOMO dataset and output detailed performance metrics.

πŸ“Š What This Benchmark Tests

The LOCOMO benchmark evaluates memory systems across four critical dimensions:

Single Hop Questions (91% - Core)

Direct factual recall from conversation history.

Q: "What restaurant did Sarah recommend?"
Expected: "Chez Laurent"

Multi Hop Questions (85% - Core)

Complex reasoning across multiple conversation turns.

Q: "Based on the budget discussion, what's the revised timeline?"
Expected: "Pushed to Q3 due to 30% budget cut"

Open Domain Questions (71% - Core)

Questions requiring both memory and general knowledge.

Q: "What are the risks of the approach they discussed?"
Expected: "Market volatility and regulatory changes"

Temporal Questions (88% - Core)

Time-sensitive queries and chronological reasoning.

Q: "What happened after the client meeting last Tuesday?"
Expected: "Sent revised proposal on Wednesday"

πŸ”§ How It Works

1. Memory Integration

The benchmark integrates with your memory system via standard APIs:

// Search for relevant context
const searchResults = await searchService.search(question, userId, {
  limit: 20,
  scoreThreshold: 0.7,
});

// Generate answer using retrieved context
const answer = await makeModelCall(messages, context);

2. Evaluation Process

Each question goes through:

  1. Context Retrieval: Search memory for relevant information
  2. Answer Generation: LLM generates response using context
  3. Intelligent Scoring: Compare against gold standard using LLM evaluation
  4. Performance Metrics: Calculate accuracy, retrieval rates, and error analysis

3. LLM-Based Evaluation

Instead of simple string matching, we use GPT-4 for intelligent evaluation:

"Generated answer contains sufficient matching content with gold standard"
β†’ CORRECT (Match ratio: 0.85)

πŸ—οΈ Architecture

core-benchmark/
β”œβ”€β”€ locomo/                 # LOCOMO dataset and evaluation scripts
β”‚   β”œβ”€β”€ evaluate_qa.js     # Main evaluation orchestrator
β”‚   └── evals/             # Evaluation results
β”œβ”€β”€ services/              # Core evaluation services
β”‚   β”œβ”€β”€ qaService.js       # Question answering pipeline
β”‚   β”œβ”€β”€ evaluateService.js # LLM-based answer evaluation
β”‚   └── search.server.js   # Memory system integration
└── lib/                   # Utilities and model integration

πŸ”€ Using Your Own Memory System

API Requirements

Your memory system needs these endpoints:

# Search/retrieval endpoint
POST /api/v1/search
{
  "query": "search query here"
}

# Response format (flexible - we'll adapt)
{
  "results": ["context1", "context2"],
  "episodes": ["episode1", "episode2"],
  "facts": [{"fact": "...", "validAt": "..."}]
}

Configuration Steps

  1. Update Search Integration (services/search.server.js):
async search(query, userId, options = {}) {
  const response = await this.axios.post('/api/v1/search', {
    query: query,
    // Add your system's specific parameters
    userId: userId,
    ...options
  });

  // Adapt response format for your system
  return {
    episodes: response.data.results || [],
    facts: response.data.facts || []
  };
}
  1. Set Your API Endpoint:
BASE_URL=https://your-memory-system.com
API_KEY=your_api_key
  1. Run Evaluation:
npm run evaluate

πŸ“ˆ Understanding Results

Output Files

  • evaluation_results.json: Complete metrics and analysis
  • evaluation_locomo*.json: Per-conversation detailed results
  • Console: Real-time progress and summary statistics

Key Metrics

  • Context Retrieval Rate: How often relevant context was found
  • QA Success Rate: Percentage of questions that received answers
  • Answer Accuracy: Percentage of correct answers (LLM-evaluated)
  • Category Performance: Breakdown by question type
  • Match Ratio: Semantic similarity scores

Sample Output

=== CORE MEMORY SYSTEM RESULTS ===
Total questions: 1,247
Questions with retrieved context: 1,189/1,247 (95.3%)
Questions with generated answers: 1,205/1,247 (96.6%)
Correct answers: 1,025/1,205 (85.1%)

=== CATEGORY BREAKDOWN ===
Single Hop: 245/269 (91.1%)
Multi Hop: 156/184 (84.8%)
Open Domain: 287/405 (70.9%)
Temporal: 337/384 (87.8%)

πŸš€ Advanced Usage

Custom Dataset

Replace locomo/locomo10.json with your own conversation data:

[
  {
    "qa": [
      {
        "question": "What did Alice order for lunch?",
        "answer": "Caesar salad with chicken",
        "evidence": "Conversation excerpt supporting this answer",
        "category": 1
      }
    ]
  }
]

Batch Processing

Adjust batch size for your system's capabilities:

// In evaluate_qa.js
const batchSize = 15; // Reduce for rate-limited APIs

Custom Evaluation Prompts

Modify evaluation criteria in services/evaluateService.js:

const evaluationPrompt = `
Your custom evaluation criteria here...
Be generous with time format differences...
Focus on semantic similarity over exact matching...
`;

🀝 Contributing

We welcome contributions to improve the benchmark:

  1. New Question Categories: Add specialized evaluation criteria
  2. Memory System Integrations: Support for more memory APIs
  3. Evaluation Improvements: Enhanced scoring mechanisms
  4. Performance Optimizations: Faster evaluation pipelines

πŸ“ž Support

  • πŸ› Issues: GitHub Issues
  • πŸ“§ Questions: Create a discussion in the repository
  • πŸ“š Documentation: Check our Wiki

πŸ† Leaderboard

Want to add your memory system to our benchmark comparison? Submit your results:

  1. Run the full evaluation on the LOCOMO dataset
  2. Share your evaluation_results.json
  3. We'll add your system to the public leaderboard

Current standings:

  1. Core: 85% overall accuracy
  2. Zep: 75% overall accuracy
  3. Memobase: 76% overall accuracy
  4. mem0: 61% overall accuracy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published