A comprehensive evaluation framework for the CORE using the LOCOMO benchmark. This repository allows anyone to reproduce our benchmark results and compare their own memory systems against Core's performance.
Core demonstrates strong performance across all question categories in the LOCOMO evaluation:
- Single Hop Questions: 91% accuracy
- Multi Hop Questions: 85% accuracy
- Open Domain: 71% accuracy
- Temporal: 88% accuracy
- Overall Average: 85% accuracy
Comparison with other memory systems (mem0, memobase, zep) shows Core's consistent high performance across all categories.
- Node.js 16+
- OpenAI API key
- Core memory system running (or your own memory system API)
git clone https://github.com/RedPlanetHQ/core-benchmark.git
cd core-benchmark
npm install
cp .env.example .env
Edit .env
with your settings:
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key_here
# Core Memory System API
API_KEY=your_core_api_key
BASE_URL=http://localhost:3033
Before running evaluations, you need to ingest the LOCOMO conversations into your memory system:
node locomo/ingest_conversations.js
This will:
- Load conversations from
locomo/locomo10.json
- Send conversation data to your memory system's
/api/v1/ingest
endpoint - Track ingestion progress to avoid duplicates
- Prepare your memory system for evaluation
npm run evaluate
This will evaluate your memory system against the LOCOMO dataset and output detailed performance metrics.
The LOCOMO benchmark evaluates memory systems across four critical dimensions:
Direct factual recall from conversation history.
Q: "What restaurant did Sarah recommend?"
Expected: "Chez Laurent"
Complex reasoning across multiple conversation turns.
Q: "Based on the budget discussion, what's the revised timeline?"
Expected: "Pushed to Q3 due to 30% budget cut"
Questions requiring both memory and general knowledge.
Q: "What are the risks of the approach they discussed?"
Expected: "Market volatility and regulatory changes"
Time-sensitive queries and chronological reasoning.
Q: "What happened after the client meeting last Tuesday?"
Expected: "Sent revised proposal on Wednesday"
The benchmark integrates with your memory system via standard APIs:
// Search for relevant context
const searchResults = await searchService.search(question, userId, {
limit: 20,
scoreThreshold: 0.7,
});
// Generate answer using retrieved context
const answer = await makeModelCall(messages, context);
Each question goes through:
- Context Retrieval: Search memory for relevant information
- Answer Generation: LLM generates response using context
- Intelligent Scoring: Compare against gold standard using LLM evaluation
- Performance Metrics: Calculate accuracy, retrieval rates, and error analysis
Instead of simple string matching, we use GPT-4 for intelligent evaluation:
"Generated answer contains sufficient matching content with gold standard"
β CORRECT (Match ratio: 0.85)
core-benchmark/
βββ locomo/ # LOCOMO dataset and evaluation scripts
β βββ evaluate_qa.js # Main evaluation orchestrator
β βββ evals/ # Evaluation results
βββ services/ # Core evaluation services
β βββ qaService.js # Question answering pipeline
β βββ evaluateService.js # LLM-based answer evaluation
β βββ search.server.js # Memory system integration
βββ lib/ # Utilities and model integration
Your memory system needs these endpoints:
# Search/retrieval endpoint
POST /api/v1/search
{
"query": "search query here"
}
# Response format (flexible - we'll adapt)
{
"results": ["context1", "context2"],
"episodes": ["episode1", "episode2"],
"facts": [{"fact": "...", "validAt": "..."}]
}
- Update Search Integration (
services/search.server.js
):
async search(query, userId, options = {}) {
const response = await this.axios.post('/api/v1/search', {
query: query,
// Add your system's specific parameters
userId: userId,
...options
});
// Adapt response format for your system
return {
episodes: response.data.results || [],
facts: response.data.facts || []
};
}
- Set Your API Endpoint:
BASE_URL=https://your-memory-system.com
API_KEY=your_api_key
- Run Evaluation:
npm run evaluate
evaluation_results.json
: Complete metrics and analysisevaluation_locomo*.json
: Per-conversation detailed results- Console: Real-time progress and summary statistics
- Context Retrieval Rate: How often relevant context was found
- QA Success Rate: Percentage of questions that received answers
- Answer Accuracy: Percentage of correct answers (LLM-evaluated)
- Category Performance: Breakdown by question type
- Match Ratio: Semantic similarity scores
=== CORE MEMORY SYSTEM RESULTS ===
Total questions: 1,247
Questions with retrieved context: 1,189/1,247 (95.3%)
Questions with generated answers: 1,205/1,247 (96.6%)
Correct answers: 1,025/1,205 (85.1%)
=== CATEGORY BREAKDOWN ===
Single Hop: 245/269 (91.1%)
Multi Hop: 156/184 (84.8%)
Open Domain: 287/405 (70.9%)
Temporal: 337/384 (87.8%)
Replace locomo/locomo10.json
with your own conversation data:
[
{
"qa": [
{
"question": "What did Alice order for lunch?",
"answer": "Caesar salad with chicken",
"evidence": "Conversation excerpt supporting this answer",
"category": 1
}
]
}
]
Adjust batch size for your system's capabilities:
// In evaluate_qa.js
const batchSize = 15; // Reduce for rate-limited APIs
Modify evaluation criteria in services/evaluateService.js
:
const evaluationPrompt = `
Your custom evaluation criteria here...
Be generous with time format differences...
Focus on semantic similarity over exact matching...
`;
We welcome contributions to improve the benchmark:
- New Question Categories: Add specialized evaluation criteria
- Memory System Integrations: Support for more memory APIs
- Evaluation Improvements: Enhanced scoring mechanisms
- Performance Optimizations: Faster evaluation pipelines
- π Issues: GitHub Issues
- π§ Questions: Create a discussion in the repository
- π Documentation: Check our Wiki
Want to add your memory system to our benchmark comparison? Submit your results:
- Run the full evaluation on the LOCOMO dataset
- Share your
evaluation_results.json
- We'll add your system to the public leaderboard
Current standings:
- Core: 85% overall accuracy
- Zep: 75% overall accuracy
- Memobase: 76% overall accuracy
- mem0: 61% overall accuracy