A-I-nsight

Extract, parse, and analyze chat histories from Claude, ChatGPT, and Google Gemini using a local LLM.

Convert raw exported conversations into structured, actionable insights: goal achievement rates, skill progression, learning patterns, session analytics, and cross-platform topic mapping.

What It Does

This pipeline ingests AI chat exports (Claude, OpenAI/ChatGPT, Google Gemini via Takeout), normalizes them into a common format, summarizes each conversation using Qwen 3.5 27B via llama.cpp, and produces 11 analysis outputs plus an interactive HTML dashboard.

Pipeline Stages

Parse — Extract conversations from each source format (JSON/HTML)
Normalize — Common intermediate representation
Summarize — Local LLM (Qwen 3.5 27B) generates structured summaries
Analyze — Cross-source analytics on 1,795+ conversations

Key Outputs

Output	Purpose
repeated_topics.json	Unique tools/concepts with frequency, first/last seen
timeline.md	Monthly progression narrative and statistics
unresolved.md	Open questions by domain
skills_progression.json	Learned → demonstrated skill tracking (fuzzy matching)
traversal_analysis.json	Deviation/return patterns (ADHD tangent quantification)
goal_completion.json	Goal achievement rates by source/domain/complexity/month
cross_platform_topics.json	Topics spanning multiple sources (core interests identified)
complexity_trajectory.json	Complexity distribution over time
session_patterns.json	When you work (day/hour analysis + productivity)
domain_flow.json	Same-day domain sequences (tangent structure)
orphaned_skills.json	One-time-only skills
dashboard.html	Interactive Charts.js dashboard combining all analyses

Quick Start

Prerequisites

Python 3.10+
Qwen 3.5 27B GGUF model loaded in llama.cpp server
UV package manager (or pip)
RTX 3090 24GB equivalent (for comfortable inference timing)

1. Export Your Chat Data

See DATA_EXPORT_GUIDE.md for detailed instructions per platform:

Claude Cloud — JSON export from settings
OpenAI/ChatGPT — Bulk export via account
Google Gemini — Google Takeout → MyActivity

2. Set Up Environment

# Clone repo
git clone https://github.com/your-username/chat-distillation-pipeline.git
cd chat-distillation-pipeline

# Create virtual environment (using UV)
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv pip install openai beautifulsoup4 lxml

# Or with standard pip
pip install openai beautifulsoup4 lxml

3. Start Qwen Server

# Terminal 1: Start llama.cpp server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf \
  -c 32768 \
  --port 8080

# Verify server is running
curl http://localhost:8080/v1/models

You should see:

{"object":"list","data":[{"id":"qwen3.5-27b","object":"model"}]}

4. Run the Pipeline

# Step 1: Parse all source files
python spike_claude_parser.py
python spike_gemini_parser.py data/MyActivity_Work.html work
python spike_gemini_parser.py data/MyActivity_Personal.html personal
python spike_openai_parser.py

# Outputs: parsed_*.json, skipped_*.json

# Step 2: Process through Qwen LLM
python process_all.py

# Outputs: output/{claude,gemini_work,gemini_personal,openai}/*.json (1,795 files)

# Step 3: Analyze and generate dashboard
python analyze.py

# Outputs: repeated_topics.json, timeline.md, ..., dashboard.html

5. View Dashboard

# Start a simple HTTP server
python -m http.server 8000

# Open browser
open http://localhost:8000/output/dashboard.html

Architecture

data/conversations.json ─► spike_claude_parser.py ─► parsed_claude.json ─┐
data/MyActivity_*.html ──► spike_gemini_parser.py ──► parsed_gemini_*.json ├─┐
data/openai/*.json ──────► spike_openai_parser.py ──► parsed_openai.json ─┤ │
                                                                           │ │
prompts/qwen_analyst.xml ────────────────────────────────────►           │ │
                                                                 ► process_all.py ─► output/{source}/*.json
                                                        Qwen 3.5 27B (llama.cpp)
                                                                           │ │
                                                                     analyze.py ─► 11 outputs
                                                                           │ │
                                                                           └─►dashboard.html

Processing Results

Total raw conversations: 1,945
Successfully parsed: 1,797 (92%)
Skipped: 148 (empty or >100k chars)
Failures: 2 (context overflow >32k tokens)

Source	Raw	Parsed	Status
Claude	162	145	✓
Gemini Work	506	474	✓
Gemini Personal	728	656	✓
OpenAI/ChatGPT	549	522	✓

Configuration

Qwen Summarization Prompt

Located in prompts/qwen_analyst.xml (raw text, loaded with Path.read_text()):

Temperature: 0.1 (deterministic)
Max tokens: 2,500 (balance between detail and inference speed)
Context window: 32,768 tokens

Customize the prompt to change how conversations are summarized (e.g., focus on technical depth, emotional tone, etc.).

Parsing Filters

Min/Max conversation length: 0–100,000 characters (configurable)
Min/Max message count: 1+ (for meaningful conversations)
Date range: No hard filter, but easily added per-source

See the parser scripts for tuning.

Skills Matching Threshold

Skills learned → Skills demonstrated matching uses Jaccard similarity ≥ 0.35 (35% word overlap) to handle LLM paraphrasing.

Adjust SKILL_MATCH_THRESHOLD in analyze.py if results are too loose or tight.

Project Structure

.
├── README.md                          # This file
├── ARCHITECTURE.md                    # Detailed system design
├── pyproject.toml                     # Dependencies
│
├── spike_claude_parser.py             # Parse Claude JSON exports
├── spike_gemini_parser.py             # Parse Gemini HTML exports (Takeout)
├── spike_openai_parser.py             # Parse OpenAI tree-structured JSONs
├── parse_xml.py                       # Shared XML→dict parser
│
├── process_all.py                     # Main pipeline: parse → summarize → save
├── analyze.py                         # Cross-source analysis (11 outputs + dashboard)
│
├── data/                              # Raw exports (git-ignored)
│   ├── conversations.json             # Claude export
│   ├── MyActivity_Work.html           # Google Takeout → MyActivity
│   ├── MyActivity_Personal.html       # Google Takeout → MyActivity
│   └── openai/                        # OpenAI bulk export
│
├── prompts/                           # LLM prompts
│   └── qwen_analyst.xml               # Summarization prompt (raw text)
│
├── output/                            # Analysis results (git-ignored)
│   ├── repeated_topics.json
│   ├── timeline.md
│   ├── dashboard.html
│   └── {claude,gemini_work,gemini_personal,openai}/ # Per-conversation summaries
│
├── docs/                              # Documentation
│   ├── ARCHITECTURE.md                # System design details
│   ├── DATA_EXPORT_GUIDE.md           # Export instructions per platform
│   └── decisions/                     # Architecture decision records
│
└── .github/                           # GitHub config
    └── copilot-instructions.md        # Context for AI coding assistants

How It Works

1. Parsing

Each source has distinct structure:

Claude (conversations.json): Array of objects with chat_messages[]
Gemini (MyActivity_Work/Personal.html): Flat HTML with divs and regex for grouping
OpenAI (conversations-*.json): Tree structure with current_node pointer and parent links

Parsers normalize all to:

{
  "id": "unique_id",
  "name": "conversation_title",
  "messages": [
    {"role": "user", "text": "..."},
    {"role": "assistant", "text": "..."}
  ],
  "date": "ISO date string"
}

2. Summarization (Qwen LLM)

Each conversation is sent to Qwen 3.5 27B with the prompt in prompts/qwen_analyst.xml.

Output schema (XML parsed to dict):

{
  "primary_goal": "What the user was trying to accomplish",
  "goal_achieved": true/false,
  "summary": "Multi-sentence overview",
  "traversal_path": ["Started with: X", "Deviated to: Y", "Resolved deviation: yes"],
  "code_artifacts": true/false,
  "domain": "category (SQL, Python, DevOps, etc.)",
  "complexity": "basic|intermediate|advanced",
  "tools_and_concepts": ["pandas", "BigQuery", ...],
  "skills_learned": ["specific skill A", ...],
  "skills_demonstrated": ["specific skill B", ...],
  "unresolved": ["open question 1", ...]
}

3. Analysis

analyze.py reads 1,795 JSON files and generates:

repeated_topics: Term frequency with temporal tracking
timeline.md: Month-by-month narrative
skills_progression: Fuzzy word-overlap matching between learned→demonstrated
traversal_analysis: "Never returned" rate quantifies ADHD patterns
goal_completion: Rates by domain, complexity, source, time
cross_platform: Venn diagram of topics across sources
complexity_trajectory: Skill curve visualization
session_patterns: When you work + productivity per time
domain_flow: Tangential topic jumping on same day
orphaned_skills: One-time touches never revisited
dashboard.html: Single-page Charts.js visualization

Development

Adding a New Source

Create spike_{source}_parser.py implementing:
- read_raw_file(path) → list[dict]
- Each dict has {"id", "name", "messages": [{"role", "text"}], "date"}
Add to process_all.py INPUTS list:
```
("parsed_{source}.json", "{source}"),
```

Test with:

python spike_{source}_parser.py > parsed_{source}.json
# Verify output structure
python -c "import json; json.load(open('parsed_{source}.json'))"

Tuning Summarization

Edit prompts/qwen_analyst.xml to change:

Analysis focus (technical vs. emotional, depth vs. brevity)
Output schema (add/remove fields)
Temperature for determinism vs. creativity

Then re-run process_all.py (idempotent — overwrites output files).

Cost & Speed

Inference time: ~30 sec/conversation (RTX 3090, q4_k_m quantization)
Full pipeline: ~15 hours for 1,800 conversations
Memory: 24GB VRAM (q4_k_m uses ~16GB)
Cost: $0 (open-source, local)

Compare to OpenAI GPT-4: $0.003/K tokens × ~500 tokens/convo × 1,800 = ~$2.70 (much cheaper locally).

Troubleshooting

Qwen Server Won't Start

# Check if port 8080 is in use
lsof -i :8080

# Kill existing process
kill -9 <PID>

# Restart server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf -c 32768 --port 8080

"Context overflow" Error

Two conversations exceeded 32k tokens even after chunking. Increase context with:

./llama-server ... -c 65536  # Double context

Parsed JSON Missing Fields

Re-run parser in isolation to check for errors:

python spike_claude_parser.py 2>&1 | grep ERROR

Skills Internalization Rate Too Low

Adjust SKILL_MATCH_THRESHOLD in analyze.py upward (less strict matching):

SKILL_MATCH_THRESHOLD = 0.5  # More permissive

License

MIT — Use and modify freely.

References

Questions or ideas? Open an issue or PR!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
docs		docs
prompts		prompts
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
GITHUB_UPLOAD_CHECKLIST.md		GITHUB_UPLOAD_CHECKLIST.md
README.md		README.md
analyze.py		analyze.py
parse_xml.py		parse_xml.py
process_all.py		process_all.py
pyproject.toml		pyproject.toml
spike_claude_parser.py		spike_claude_parser.py
spike_gemini_parser.py		spike_gemini_parser.py
spike_openai_parser.py		spike_openai_parser.py
spike_qwen_single.py		spike_qwen_single.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

A-I-nsight

What It Does

Pipeline Stages

Key Outputs

Quick Start

Prerequisites

1. Export Your Chat Data

2. Set Up Environment

3. Start Qwen Server

4. Run the Pipeline

5. View Dashboard

Architecture

Processing Results

Configuration

Qwen Summarization Prompt

Parsing Filters

Skills Matching Threshold

Project Structure

How It Works

1. Parsing

2. Summarization (Qwen LLM)

3. Analysis

Development

Adding a New Source

Tuning Summarization

Cost & Speed

Troubleshooting

Qwen Server Won't Start

"Context overflow" Error

Parsed JSON Missing Fields

Skills Internalization Rate Too Low

License

References

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages