Memory layer for AI.
Give your AI the ability to remember.
English | ็ฎไฝไธญๆ
Your AI has the memory of a goldfish. Every conversation? Fresh start. That thing you mentioned yesterday? Gone. Your preferences? Vanished into the void.
Replica fixes this. It's a memory layer that gives AI the ability to actually remember things. Not just for 5 minutes. Not just within a single chat. But across conversations, sessions, and time.
Think of it as RAM for your AI's brain. Except it doesn't forget when you close the tab.
๐ฌ "Remember when I told you about my trip to Tokyo last month?"
โ Replica searches 10,000+ memories
โ Finds: "User visited Tokyo in March 2026"
โ Returns relevant context in 50ms
AI amnesia is real. Without memory, your AI is like that friend who asks "wait, what were we talking about?" every 30 seconds.
- Every conversation starts from scratch
- Context windows are expensive (and finite)
- RAG alone doesn't cut it - you need structured memory, not just keyword matching
- Facts, events, plans, preferences... they all need different handling
Replica solves this. Automatically. No prompt engineering gymnastics required.
Replica doesn't just store text. It understands conversations and extracts structured memories:
- Episodes - "User discussed Python async programming best practices"
- Events - "User has a meeting tomorrow at 3 PM"
- Foresights - "User plans to learn Rust next week"
- User Profiles - Interests, skills, preferences, goals
Simple vector search? That's so 2023. Replica uses:
- Vector Search - Semantic similarity via pgvector
- Full-Text Search - PostgreSQL's battle-tested text search
- RRF Fusion - Reciprocal Rank Fusion (fancy way of saying "best of both")
- Temporal Decay - Recent stuff matters more (just like real memory)
- MMR Reranking - Diverse results, not 10 variations of the same thing
Long conversations? No problem. Replica automatically:
- Tracks token counts in real-time
- Compresses old messages when hitting limits
- Keeps recent context fresh and relevant
- Extracts important info before compression
Chat with your AI and watch memories being created in real-time:
Real-time streaming chat with memory context
Database explorer for debugging and inspection
| Component | Requirement |
|---|---|
| Python | โฅ 3.13 |
| PostgreSQL | 17 + pgvector |
| Package Manager | uv |
| Node Runtime | Bun (recommended) or Node.js |
| LLM / Embedding | vLLM or any OpenAI-compatible API |
docker run -d --name pgvector \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
pgvector/pgvector:pg17
docker exec -it pgvector psql -U postgres -c "CREATE DATABASE replica;"
docker exec -it pgvector psql -U postgres -d replica -c "CREATE EXTENSION IF NOT EXISTS vector;"uv sync
uv run alembic upgrade headEdit config/settings.yaml with your model endpoints:
llm:
provider: "vllm"
base_url: "http://localhost:19000/v1"
model: "Qwen3.5-122B-A10B-FP8"
embedding:
provider: "vllm"
base_url: "http://localhost:19001/v1"
model: "Qwen3-Embedding-4B"
dimensions: 2560๐ก Full config reference:
config/settings.yaml| Detailed guide:docs/guide.md
Backend API (port 8790):
uv run uvicorn replica.main:app --host 0.0.0.0 --port 8790 --reloadFrontend UI (port 8780):
cd web
bun install
bun run devThen visit:
| URL | Description |
|---|---|
http://localhost:8780 |
๐จ Web UI |
http://localhost:8790/docs |
๐ Swagger API Docs |
http://localhost:8790/health |
โค๏ธ Health Check |
1. User chats with AI
โ
2. Replica stores messages
โ
3. When conversation reaches a natural boundary...
โ
4. Extract structured memories:
โข Episodes (what happened)
โข Events (specific facts)
โข Foresights (future plans)
โข User profile updates
โ
5. Generate embeddings
โ
6. Store in knowledge base
โ
7. Next time user asks something...
โ
8. Hybrid search retrieves relevant memories
โ
9. Inject into AI context
โ
10. AI responds with full memory context
| Type | What It Stores | Example |
|---|---|---|
| Episode | Conversation summaries | "User asked about async/await patterns in Python and discussed event loops" |
| Event | Concrete facts | "User's birthday is March 15" |
| Foresight | Future intentions | "User wants to build a web scraper next month" |
| Evergreen | Long-term facts | "User is a software engineer living in Shanghai" |
import httpx
async with httpx.AsyncClient() as client:
# Create user
user = await client.post(
"http://localhost:8790/v1/users",
json={"external_id": "alice", "name": "Alice"}
)
user_id = user.json()["id"]
# Create session
session = await client.post(
f"http://localhost:8790/v1/users/{user_id}/sessions",
json={}
)
session_id = session.json()["id"]# Stream chat (Server-Sent Events)
async with client.stream(
"POST",
f"http://localhost:8790/v1/sessions/{session_id}/chat",
json={"content": "What did I tell you about my trip?", "use_memory": True}
) as response:
async for line in response.aiter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
if "token" in data:
print(data["token"], end="", flush=True)
elif "context" in data:
print("\n\n๐ Retrieved memories:", data["context"])# Batch memory extraction
response = await client.post(
"http://localhost:8790/v1/memories",
json={
"new_raw_data_list": [
{"role": "user", "content": "I'm planning a trip to Tokyo next month"},
{"role": "assistant", "content": "That sounds exciting! Have you been before?"},
{"role": "user", "content": "No, first time. I want to visit Shibuya and try real ramen."}
],
"user_id_list": ["alice"]
}
)
print(f"Extracted {response.json()['memory_count']} memories")# Semantic search
results = await client.post(
"http://localhost:8790/v1/knowledge/search",
json={
"user_id": user_id,
"query": "travel plans",
"top_k": 5
}
)
for memory in results.json():
print(f"[{memory['entry_type']}] {memory['content']} (score: {memory['score']:.2f})")Frontend โ React 19 web interface (:8780)
Backend โ FastAPI server (:8790)
- User/Session/Message APIs
- Memory extraction & knowledge search
- Context compression & embedding generation
LLM Services
- Main LLM (
:19000) - Chat completion & memory extraction - Embedding model (
:19001) - Vector generation
Storage โ PostgreSQL 17 + pgvector (:5432)
Backend (Python):
# Format code
uv run ruff format
# Lint & fix
uv run ruff check --fix
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=replicaFrontend (TypeScript/React):
cd web
# One-command check & fix (lint + format + import sorting)
bun run check
# Lint only
bun run lint
# Format only
bun run format- Complete Guide - Configuration, concepts, and usage
- API Reference - Full API documentation
- Swagger UI - Interactive API explorer
MIT License - see LICENSE for details.
Built by developers tired of explaining the same thing to AI twice