Skip to content

fran98m/a-i-nsight

Repository files navigation

A-I-nsight

Extract, parse, and analyze chat histories from Claude, ChatGPT, and Google Gemini using a local LLM.

Convert raw exported conversations into structured, actionable insights: goal achievement rates, skill progression, learning patterns, session analytics, and cross-platform topic mapping.

What It Does

This pipeline ingests AI chat exports (Claude, OpenAI/ChatGPT, Google Gemini via Takeout), normalizes them into a common format, summarizes each conversation using Qwen 3.5 27B via llama.cpp, and produces 11 analysis outputs plus an interactive HTML dashboard.

Pipeline Stages

  1. Parse — Extract conversations from each source format (JSON/HTML)
  2. Normalize — Common intermediate representation
  3. Summarize — Local LLM (Qwen 3.5 27B) generates structured summaries
  4. Analyze — Cross-source analytics on 1,795+ conversations

Key Outputs

Output Purpose
repeated_topics.json Unique tools/concepts with frequency, first/last seen
timeline.md Monthly progression narrative and statistics
unresolved.md Open questions by domain
skills_progression.json Learned → demonstrated skill tracking (fuzzy matching)
traversal_analysis.json Deviation/return patterns (ADHD tangent quantification)
goal_completion.json Goal achievement rates by source/domain/complexity/month
cross_platform_topics.json Topics spanning multiple sources (core interests identified)
complexity_trajectory.json Complexity distribution over time
session_patterns.json When you work (day/hour analysis + productivity)
domain_flow.json Same-day domain sequences (tangent structure)
orphaned_skills.json One-time-only skills
dashboard.html Interactive Charts.js dashboard combining all analyses

Quick Start

Prerequisites

1. Export Your Chat Data

See DATA_EXPORT_GUIDE.md for detailed instructions per platform:

2. Set Up Environment

# Clone repo
git clone https://github.com/your-username/chat-distillation-pipeline.git
cd chat-distillation-pipeline

# Create virtual environment (using UV)
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv pip install openai beautifulsoup4 lxml

# Or with standard pip
pip install openai beautifulsoup4 lxml

3. Start Qwen Server

# Terminal 1: Start llama.cpp server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf \
  -c 32768 \
  --port 8080

# Verify server is running
curl http://localhost:8080/v1/models

You should see:

{"object":"list","data":[{"id":"qwen3.5-27b","object":"model"}]}

4. Run the Pipeline

# Step 1: Parse all source files
python spike_claude_parser.py
python spike_gemini_parser.py data/MyActivity_Work.html work
python spike_gemini_parser.py data/MyActivity_Personal.html personal
python spike_openai_parser.py

# Outputs: parsed_*.json, skipped_*.json

# Step 2: Process through Qwen LLM
python process_all.py

# Outputs: output/{claude,gemini_work,gemini_personal,openai}/*.json (1,795 files)

# Step 3: Analyze and generate dashboard
python analyze.py

# Outputs: repeated_topics.json, timeline.md, ..., dashboard.html

5. View Dashboard

# Start a simple HTTP server
python -m http.server 8000

# Open browser
open http://localhost:8000/output/dashboard.html

Architecture

data/conversations.json ─► spike_claude_parser.py ─► parsed_claude.json ─┐
data/MyActivity_*.html ──► spike_gemini_parser.py ──► parsed_gemini_*.json ├─┐
data/openai/*.json ──────► spike_openai_parser.py ──► parsed_openai.json ─┤ │
                                                                           │ │
prompts/qwen_analyst.xml ────────────────────────────────────►           │ │
                                                                 ► process_all.py ─► output/{source}/*.json
                                                        Qwen 3.5 27B (llama.cpp)
                                                                           │ │
                                                                     analyze.py ─► 11 outputs
                                                                           │ │
                                                                           └─►dashboard.html

Processing Results

  • Total raw conversations: 1,945
  • Successfully parsed: 1,797 (92%)
  • Skipped: 148 (empty or >100k chars)
  • Failures: 2 (context overflow >32k tokens)
Source Raw Parsed Status
Claude 162 145
Gemini Work 506 474
Gemini Personal 728 656
OpenAI/ChatGPT 549 522

Configuration

Qwen Summarization Prompt

Located in prompts/qwen_analyst.xml (raw text, loaded with Path.read_text()):

  • Temperature: 0.1 (deterministic)
  • Max tokens: 2,500 (balance between detail and inference speed)
  • Context window: 32,768 tokens

Customize the prompt to change how conversations are summarized (e.g., focus on technical depth, emotional tone, etc.).

Parsing Filters

  • Min/Max conversation length: 0–100,000 characters (configurable)
  • Min/Max message count: 1+ (for meaningful conversations)
  • Date range: No hard filter, but easily added per-source

See the parser scripts for tuning.

Skills Matching Threshold

Skills learned → Skills demonstrated matching uses Jaccard similarity ≥ 0.35 (35% word overlap) to handle LLM paraphrasing.

Adjust SKILL_MATCH_THRESHOLD in analyze.py if results are too loose or tight.

Project Structure

.
├── README.md                          # This file
├── ARCHITECTURE.md                    # Detailed system design
├── pyproject.toml                     # Dependencies
│
├── spike_claude_parser.py             # Parse Claude JSON exports
├── spike_gemini_parser.py             # Parse Gemini HTML exports (Takeout)
├── spike_openai_parser.py             # Parse OpenAI tree-structured JSONs
├── parse_xml.py                       # Shared XML→dict parser
│
├── process_all.py                     # Main pipeline: parse → summarize → save
├── analyze.py                         # Cross-source analysis (11 outputs + dashboard)
│
├── data/                              # Raw exports (git-ignored)
│   ├── conversations.json             # Claude export
│   ├── MyActivity_Work.html           # Google Takeout → MyActivity
│   ├── MyActivity_Personal.html       # Google Takeout → MyActivity
│   └── openai/                        # OpenAI bulk export
│
├── prompts/                           # LLM prompts
│   └── qwen_analyst.xml               # Summarization prompt (raw text)
│
├── output/                            # Analysis results (git-ignored)
│   ├── repeated_topics.json
│   ├── timeline.md
│   ├── dashboard.html
│   └── {claude,gemini_work,gemini_personal,openai}/ # Per-conversation summaries
│
├── docs/                              # Documentation
│   ├── ARCHITECTURE.md                # System design details
│   ├── DATA_EXPORT_GUIDE.md           # Export instructions per platform
│   └── decisions/                     # Architecture decision records
│
└── .github/                           # GitHub config
    └── copilot-instructions.md        # Context for AI coding assistants

How It Works

1. Parsing

Each source has distinct structure:

  • Claude (conversations.json): Array of objects with chat_messages[]
  • Gemini (MyActivity_Work/Personal.html): Flat HTML with divs and regex for grouping
  • OpenAI (conversations-*.json): Tree structure with current_node pointer and parent links

Parsers normalize all to:

{
  "id": "unique_id",
  "name": "conversation_title",
  "messages": [
    {"role": "user", "text": "..."},
    {"role": "assistant", "text": "..."}
  ],
  "date": "ISO date string"
}

2. Summarization (Qwen LLM)

Each conversation is sent to Qwen 3.5 27B with the prompt in prompts/qwen_analyst.xml.

Output schema (XML parsed to dict):

{
  "primary_goal": "What the user was trying to accomplish",
  "goal_achieved": true/false,
  "summary": "Multi-sentence overview",
  "traversal_path": ["Started with: X", "Deviated to: Y", "Resolved deviation: yes"],
  "code_artifacts": true/false,
  "domain": "category (SQL, Python, DevOps, etc.)",
  "complexity": "basic|intermediate|advanced",
  "tools_and_concepts": ["pandas", "BigQuery", ...],
  "skills_learned": ["specific skill A", ...],
  "skills_demonstrated": ["specific skill B", ...],
  "unresolved": ["open question 1", ...]
}

3. Analysis

analyze.py reads 1,795 JSON files and generates:

  • repeated_topics: Term frequency with temporal tracking
  • timeline.md: Month-by-month narrative
  • skills_progression: Fuzzy word-overlap matching between learned→demonstrated
  • traversal_analysis: "Never returned" rate quantifies ADHD patterns
  • goal_completion: Rates by domain, complexity, source, time
  • cross_platform: Venn diagram of topics across sources
  • complexity_trajectory: Skill curve visualization
  • session_patterns: When you work + productivity per time
  • domain_flow: Tangential topic jumping on same day
  • orphaned_skills: One-time touches never revisited
  • dashboard.html: Single-page Charts.js visualization

Development

Adding a New Source

  1. Create spike_{source}_parser.py implementing:

    • read_raw_file(path) → list[dict]
    • Each dict has {"id", "name", "messages": [{"role", "text"}], "date"}
  2. Add to process_all.py INPUTS list:

    ("parsed_{source}.json", "{source}"),
  3. Test with:

    python spike_{source}_parser.py > parsed_{source}.json
    # Verify output structure
    python -c "import json; json.load(open('parsed_{source}.json'))"

Tuning Summarization

Edit prompts/qwen_analyst.xml to change:

  • Analysis focus (technical vs. emotional, depth vs. brevity)
  • Output schema (add/remove fields)
  • Temperature for determinism vs. creativity

Then re-run process_all.py (idempotent — overwrites output files).

Cost & Speed

  • Inference time: ~30 sec/conversation (RTX 3090, q4_k_m quantization)
  • Full pipeline: ~15 hours for 1,800 conversations
  • Memory: 24GB VRAM (q4_k_m uses ~16GB)
  • Cost: $0 (open-source, local)

Compare to OpenAI GPT-4: $0.003/K tokens × ~500 tokens/convo × 1,800 = ~$2.70 (much cheaper locally).

Troubleshooting

Qwen Server Won't Start

# Check if port 8080 is in use
lsof -i :8080

# Kill existing process
kill -9 <PID>

# Restart server
./llama-server -m models/qwen3.5-27b-q4_k_m.gguf -c 32768 --port 8080

"Context overflow" Error

Two conversations exceeded 32k tokens even after chunking. Increase context with:

./llama-server ... -c 65536  # Double context

Parsed JSON Missing Fields

Re-run parser in isolation to check for errors:

python spike_claude_parser.py 2>&1 | grep ERROR

Skills Internalization Rate Too Low

Adjust SKILL_MATCH_THRESHOLD in analyze.py upward (less strict matching):

SKILL_MATCH_THRESHOLD = 0.5  # More permissive

License

MIT — Use and modify freely.

References


Questions or ideas? Open an issue or PR!

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages