Extract, parse, and analyze chat histories from Claude, ChatGPT, and Google Gemini using a local LLM.
Convert raw exported conversations into structured, actionable insights: goal achievement rates, skill progression, learning patterns, session analytics, and cross-platform topic mapping.
This pipeline ingests AI chat exports (Claude, OpenAI/ChatGPT, Google Gemini via Takeout), normalizes them into a common format, summarizes each conversation using Qwen 3.5 27B via llama.cpp, and produces 11 analysis outputs plus an interactive HTML dashboard.
- Parse — Extract conversations from each source format (JSON/HTML)
- Normalize — Common intermediate representation
- Summarize — Local LLM (Qwen 3.5 27B) generates structured summaries
- Analyze — Cross-source analytics on 1,795+ conversations
| Output | Purpose |
|---|---|
| repeated_topics.json | Unique tools/concepts with frequency, first/last seen |
| timeline.md | Monthly progression narrative and statistics |
| unresolved.md | Open questions by domain |
| skills_progression.json | Learned → demonstrated skill tracking (fuzzy matching) |
| traversal_analysis.json | Deviation/return patterns (ADHD tangent quantification) |
| goal_completion.json | Goal achievement rates by source/domain/complexity/month |
| cross_platform_topics.json | Topics spanning multiple sources (core interests identified) |
| complexity_trajectory.json | Complexity distribution over time |
| session_patterns.json | When you work (day/hour analysis + productivity) |
| domain_flow.json | Same-day domain sequences (tangent structure) |
| orphaned_skills.json | One-time-only skills |
| dashboard.html | Interactive Charts.js dashboard combining all analyses |
- Python 3.10+
- Qwen 3.5 27B GGUF model loaded in llama.cpp server
- UV package manager (or pip)
- RTX 3090 24GB equivalent (for comfortable inference timing)
See DATA_EXPORT_GUIDE.md for detailed instructions per platform:
- Claude Cloud — JSON export from settings
- OpenAI/ChatGPT — Bulk export via account
- Google Gemini — Google Takeout → MyActivity
# Clone repo
git clone https://github.com/your-username/chat-distillation-pipeline.git
cd chat-distillation-pipeline
# Create virtual environment (using UV)
uv venv .venv
source .venv/bin/activate
# Install dependencies
uv pip install openai beautifulsoup4 lxml
# Or with standard pip
pip install openai beautifulsoup4 lxml# Terminal 1: Start llama.cpp server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf \
-c 32768 \
--port 8080
# Verify server is running
curl http://localhost:8080/v1/modelsYou should see:
{"object":"list","data":[{"id":"qwen3.5-27b","object":"model"}]}# Step 1: Parse all source files
python spike_claude_parser.py
python spike_gemini_parser.py data/MyActivity_Work.html work
python spike_gemini_parser.py data/MyActivity_Personal.html personal
python spike_openai_parser.py
# Outputs: parsed_*.json, skipped_*.json
# Step 2: Process through Qwen LLM
python process_all.py
# Outputs: output/{claude,gemini_work,gemini_personal,openai}/*.json (1,795 files)
# Step 3: Analyze and generate dashboard
python analyze.py
# Outputs: repeated_topics.json, timeline.md, ..., dashboard.html# Start a simple HTTP server
python -m http.server 8000
# Open browser
open http://localhost:8000/output/dashboard.htmldata/conversations.json ─► spike_claude_parser.py ─► parsed_claude.json ─┐
data/MyActivity_*.html ──► spike_gemini_parser.py ──► parsed_gemini_*.json ├─┐
data/openai/*.json ──────► spike_openai_parser.py ──► parsed_openai.json ─┤ │
│ │
prompts/qwen_analyst.xml ────────────────────────────────────► │ │
► process_all.py ─► output/{source}/*.json
Qwen 3.5 27B (llama.cpp)
│ │
analyze.py ─► 11 outputs
│ │
└─►dashboard.html
- Total raw conversations: 1,945
- Successfully parsed: 1,797 (92%)
- Skipped: 148 (empty or >100k chars)
- Failures: 2 (context overflow >32k tokens)
| Source | Raw | Parsed | Status |
|---|---|---|---|
| Claude | 162 | 145 | ✓ |
| Gemini Work | 506 | 474 | ✓ |
| Gemini Personal | 728 | 656 | ✓ |
| OpenAI/ChatGPT | 549 | 522 | ✓ |
Located in prompts/qwen_analyst.xml (raw text, loaded with Path.read_text()):
- Temperature: 0.1 (deterministic)
- Max tokens: 2,500 (balance between detail and inference speed)
- Context window: 32,768 tokens
Customize the prompt to change how conversations are summarized (e.g., focus on technical depth, emotional tone, etc.).
- Min/Max conversation length: 0–100,000 characters (configurable)
- Min/Max message count: 1+ (for meaningful conversations)
- Date range: No hard filter, but easily added per-source
See the parser scripts for tuning.
Skills learned → Skills demonstrated matching uses Jaccard similarity ≥ 0.35 (35% word overlap) to handle LLM paraphrasing.
Adjust SKILL_MATCH_THRESHOLD in analyze.py if results are too loose or tight.
.
├── README.md # This file
├── ARCHITECTURE.md # Detailed system design
├── pyproject.toml # Dependencies
│
├── spike_claude_parser.py # Parse Claude JSON exports
├── spike_gemini_parser.py # Parse Gemini HTML exports (Takeout)
├── spike_openai_parser.py # Parse OpenAI tree-structured JSONs
├── parse_xml.py # Shared XML→dict parser
│
├── process_all.py # Main pipeline: parse → summarize → save
├── analyze.py # Cross-source analysis (11 outputs + dashboard)
│
├── data/ # Raw exports (git-ignored)
│ ├── conversations.json # Claude export
│ ├── MyActivity_Work.html # Google Takeout → MyActivity
│ ├── MyActivity_Personal.html # Google Takeout → MyActivity
│ └── openai/ # OpenAI bulk export
│
├── prompts/ # LLM prompts
│ └── qwen_analyst.xml # Summarization prompt (raw text)
│
├── output/ # Analysis results (git-ignored)
│ ├── repeated_topics.json
│ ├── timeline.md
│ ├── dashboard.html
│ └── {claude,gemini_work,gemini_personal,openai}/ # Per-conversation summaries
│
├── docs/ # Documentation
│ ├── ARCHITECTURE.md # System design details
│ ├── DATA_EXPORT_GUIDE.md # Export instructions per platform
│ └── decisions/ # Architecture decision records
│
└── .github/ # GitHub config
└── copilot-instructions.md # Context for AI coding assistants
Each source has distinct structure:
- Claude (
conversations.json): Array of objects withchat_messages[] - Gemini (
MyActivity_Work/Personal.html): Flat HTML with divs and regex for grouping - OpenAI (
conversations-*.json): Tree structure withcurrent_nodepointer and parent links
Parsers normalize all to:
{
"id": "unique_id",
"name": "conversation_title",
"messages": [
{"role": "user", "text": "..."},
{"role": "assistant", "text": "..."}
],
"date": "ISO date string"
}Each conversation is sent to Qwen 3.5 27B with the prompt in prompts/qwen_analyst.xml.
Output schema (XML parsed to dict):
{
"primary_goal": "What the user was trying to accomplish",
"goal_achieved": true/false,
"summary": "Multi-sentence overview",
"traversal_path": ["Started with: X", "Deviated to: Y", "Resolved deviation: yes"],
"code_artifacts": true/false,
"domain": "category (SQL, Python, DevOps, etc.)",
"complexity": "basic|intermediate|advanced",
"tools_and_concepts": ["pandas", "BigQuery", ...],
"skills_learned": ["specific skill A", ...],
"skills_demonstrated": ["specific skill B", ...],
"unresolved": ["open question 1", ...]
}analyze.py reads 1,795 JSON files and generates:
- repeated_topics: Term frequency with temporal tracking
- timeline.md: Month-by-month narrative
- skills_progression: Fuzzy word-overlap matching between learned→demonstrated
- traversal_analysis: "Never returned" rate quantifies ADHD patterns
- goal_completion: Rates by domain, complexity, source, time
- cross_platform: Venn diagram of topics across sources
- complexity_trajectory: Skill curve visualization
- session_patterns: When you work + productivity per time
- domain_flow: Tangential topic jumping on same day
- orphaned_skills: One-time touches never revisited
- dashboard.html: Single-page Charts.js visualization
-
Create
spike_{source}_parser.pyimplementing:read_raw_file(path) → list[dict]- Each dict has
{"id", "name", "messages": [{"role", "text"}], "date"}
-
Add to
process_all.pyINPUTS list:("parsed_{source}.json", "{source}"),
-
Test with:
python spike_{source}_parser.py > parsed_{source}.json # Verify output structure python -c "import json; json.load(open('parsed_{source}.json'))"
Edit prompts/qwen_analyst.xml to change:
- Analysis focus (technical vs. emotional, depth vs. brevity)
- Output schema (add/remove fields)
- Temperature for determinism vs. creativity
Then re-run process_all.py (idempotent — overwrites output files).
- Inference time: ~30 sec/conversation (RTX 3090, q4_k_m quantization)
- Full pipeline: ~15 hours for 1,800 conversations
- Memory: 24GB VRAM (q4_k_m uses ~16GB)
- Cost: $0 (open-source, local)
Compare to OpenAI GPT-4: $0.003/K tokens × ~500 tokens/convo × 1,800 = ~$2.70 (much cheaper locally).
# Check if port 8080 is in use
lsof -i :8080
# Kill existing process
kill -9 <PID>
# Restart server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf -c 32768 --port 8080Two conversations exceeded 32k tokens even after chunking. Increase context with:
./llama-server ... -c 65536 # Double contextRe-run parser in isolation to check for errors:
python spike_claude_parser.py 2>&1 | grep ERRORAdjust SKILL_MATCH_THRESHOLD in analyze.py upward (less strict matching):
SKILL_MATCH_THRESHOLD = 0.5 # More permissiveMIT — Use and modify freely.
Questions or ideas? Open an issue or PR!