AI development management system using Doc-Driven Story-based workflow: automatic specification generation, task decomposition, verification, and metrics.
Designed for Claude Code with support for Codex as an additional external execution agent.
What is it? System for managing AI code development with mandatory verification and metrics.
Quick start: ./install.sh β Create REQ-001.md β /makedesign REQ-001.md β /code T-001-001 β /codereview T-001-001 β /done T-001-001
Key features:
- Story-based task tracking (one file = all tasks)
- Mandatory code verification (no "wishful thinking")
- Automatic time tracking and efficiency metrics
- Support for Claude (built-in) and Codex (external CLI)
- Problem
- Solution: Story-based Architecture with Mandatory Verification
- Installation
- Quick Start
- Model Selection for All Commands
- Workflow Diagrams
- Metrics and Analytics
- Git Workflow Rules
- When to Create Story Manually
- Troubleshooting
- TODO
- License
- Created By
AI agents for code generation suffer from chronic issues:
- Ignore requirements - do what wasn't asked
- Don't follow system prompts - forget project rules
- "Wishful thinking" - report completion without actual code verification
Typical scenario:
You: "Implement task T-001-001 from Story"
AI: β Done! Here's what I did: [list of claims]
You: "Show me the code"
AI: [half not implemented or implemented incorrectly]
backlog/requirements/REQ-001.md (Requirement - user request)
β /makedesign
backlog/design/DSGN-001.md (Design - architecture design from AI)
β
backlog/stories/STORY-001.md (Story - file with all tasks)
βββ T-001-001: Task 1 β /code T-001-001
βββ T-001-002: Task 2 β /code T-001-002
βββ T-001-003: Task 3 β /code T-001-003
Key idea:
- REQ-XXX.md = Requirement (user request, written manually)
- DSGN-XXX.md = Design (architectural design, generated by
/makedesign) - STORY-XXX.md = Story (ONE file with ALL tasks, generated by
/makedesign) - T-XXX-YYY = Task ID format (T-001-001, T-001-002, ...)
- Each task: status, acceptance criteria, dependencies, estimate
| Command | Role | What It Does |
|---|---|---|
/makedesign [claude|codex] REQ-001.md |
Architect | Analyzes requirement β creates DSGN-001.md (design) + STORY-001.md (tasks) via Claude (default) or Codex CLI (alternative architecture) |
/code [claude|codex] TASK-ID |
Executor | Implements code via Claude (default, built-in) or Codex CLI (external utility) + tracks time β±οΈ |
/codereview [claude|codex] TASK-ID |
Critic | Checks ALL criteria via Claude (default, built-in) or Codex CLI (external utility) β shows code β bug report on failure + tracks time β±οΈ |
/fix [claude|codex] TASK-ID |
Fixer | Reads bug report β fixes via Claude (default, surgical edits) or Codex CLI (regeneration) + tracks time β±οΈ |
/test TASK-ID |
Tester π§ | In development: Testing β marks acceptance criteria [x] on successful check + tracks time β±οΈ |
/done TASK-ID |
Finalizer | Git workflow (commit β merge to master) β updates Story β calculates metrics π. Note: code checks done by git hooks, not /done |
/bug [claude|codex] "description" |
Bug Parser πͺ² | Parses user complaint via Claude (default) or Codex CLI β creates HOTFIX-XXX.md with criteria β suggests /code HOTFIX-XXX |
/report [STORY-ID] |
Analyst | Generates reports with Mermaid diagrams π (Gantt, Pie, Bar, Line charts) from time tracking |
git clone <repo-url> smart-agent-claude
cd smart-agent-claude
./install.shWhat happens:
Installation creates file hierarchy in ~/.claude/ - commands work globally in any project.
~/.claude/
βββ commands/ # 8 slash commands (work everywhere)
β βββ makedesign.md
β βββ code.md
β βββ codereview.md
β βββ fix.md
β βββ done.md
β βββ bug.md
β βββ report.md
β βββ test.md
βββ templates/ # Document templates
βββ story-template.md
βββ design-template.md
βββ requirements-template.md
cd /path/to/your-project
# Optional: create coding rules for the agent
touch CLAUDE.mdNote: The backlog/ structure is created automatically by commands on first use. No need to create manually!
Project structure:
your-project/
βββ backlog/
β βββ requirements/ # Freeform requirements (written manually)
β β βββ REQ-001.md
β βββ design/ # Design documents (generated by /makedesign)
β β βββ DSGN-001.md
β βββ stories/ # Stories with tasks (generated by /makedesign)
β β βββ STORY-001.md # T-001-001, T-001-002, T-001-003...
β βββ hotfix/ # Hotfix tasks (created by /bug)
β β βββ HOTFIX-001.md # Bugfixes from user complaints
β βββ issues/ # Bug reports (created by /codereview)
β βββ T-001-001-issues.md
βββ CLAUDE[/AGENTS].md # Project rules (optional)
βββ src/ # Your code
1. Create requirement (freeform text in REQ-001.md):
Email validation for registration using regex
2. Generate design and tasks:
/makedesign REQ-001.mdCreates:
DSGN-001.md- architecture designSTORY-001.md- task list (T-001-001, T-001-002, etc.)
Numbers are automatic: REQ-001 β DSGN-001 β STORY-001 β T-001-XXX
3. Implement task:
/code T-001-001Creates feature branch β writes code β provides evidence of changes
4. Verify implementation:
/codereview T-001-001Checks acceptance criteria β shows real code proof β verdict: PASSED/FAILED
5. Fix if needed:
/fix T-001-001 # If /codereview found issues
/codereview T-001-001 # Re-check6. Finalize:
/done T-001-001Commits β merges to master β calculates metrics (time, efficiency, overshoot)
User complaint β Structured fix:
/bug "Buttons don't click!" # Parses complaint β creates HOTFIX-001.md
/code HOTFIX-001 # Implements fix
/codereview HOTFIX-001 # Verifies fix
/fix HOTFIX-001 # If needed
/done HOTFIX-001 # FinalizesAll commands support model selection (Claude vs Codex):
| Command | Default | With Codex | With Claude |
|---|---|---|---|
/makedesign |
/makedesign REQ-001.md |
/makedesign codex REQ-001.md |
/makedesign claude REQ-001.md |
/code |
/code TASK-ID |
/code codex TASK-ID |
/code claude TASK-ID |
/codereview |
/codereview TASK-ID |
/codereview codex TASK-ID |
/codereview claude TASK-ID |
/fix |
/fix TASK-ID |
/fix codex TASK-ID |
/fix claude TASK-ID |
/bug |
/bug "description" |
/bug codex "description" |
/bug claude "description" |
/done |
/done TASK-ID (no model selection) |
- | - |
/report |
/report [STORY-ID|trends] (no model selection) |
- | - |
| Aspect | Claude (default) | Codex (optional) |
|---|---|---|
| Speed | β‘ Fast (built-in) | Slower (external CLI) |
| Edits | π― Surgical precise changes | π Can regenerate completely |
| Best for | Quick tasks, fixes, reviews | Alternative architecture, complex refactoring |
| Autonomy | Uses Read/Write/Edit tools | Full workspace access via CLI |
| Logs | No separate logs | Activity logs in .claude/codex/ |
| When to use | Default for all commands | When Claude fails after 1-2 iterations |
# Fast and efficient for most tasks
> /code TASK-001 # Claude writes code
> /codereview TASK-001 # Claude checks
# β
PASSED or β FAILED
# If FAILED:
> /fix TASK-001 # Claude surgically fixes
> /codereview TASK-001
# β
PASSED
> /done TASK-001# 1. First attempt: Claude writes code
> /code TASK-001
> /codereview TASK-001
# β FAILED: Found 3 Issues
# 2. First iteration: Claude (surgical edits)
> /fix TASK-001
> /codereview TASK-001
# β FAILED: 1 Issue remains
# 3. Second iteration: Codex (regeneration)
> /fix codex TASK-001
> /codereview TASK-001
# β
PASSED
> /done TASK-001Model is recorded in bug report and Story:
## Fix Iterations (backlog/issues/TASK-001-issues.md)
### Iteration #1 (Model: claude)
- Started: 2025-10-07 10:00
- Finished: 2025-10-07 10:15
- Duration: 15m
- Issues fixed: #1, #2
### Iteration #2 (Model: codex)
- Started: 2025-10-07 10:30
- Finished: 2025-10-07 10:45
- Duration: 15m
- Issues fixed: #3Story file (Time Tracking):
- **Time Tracking:**
- /code (codex): 10:00 β 10:45 (45m)
- /codereview (claude): 10:45 β 10:50 (5m) β FAILED
- /fix (iter 1, claude): 11:00 β 11:15 (15m)
- /codereview (claude): 11:15 β 11:18 (3m) β FAILED
- /fix (iter 2, codex): 11:30 β 11:45 (15m)
- /codereview (claude): 11:45 β 11:48 (3m) β PASSED
- /done: 11:48 β 11:50 (2m)/done breakdown:
- **Breakdown:**
- /code (codex): 45m (52%)
- /codereview (claude): 11m (13%)
- /fix (claude + codex): 30m (35%)
- /done: 2m (2%)
- **Models used:** codex (2x), claude (4x)
- **Iterations:** 2 (2 fix cycles)REQ β /makedesign β STORY β /code β /codereview β /done β
β (if FAILED)
/fix β /codereview
User complaint β /bug β HOTFIX-XXX.md β /code β /codereview β /fix (if needed) β /done β
All commands automatically track work time via bash!
-
Each command (
/code,/codereview,/fix,/done) records:- Start time (automatic)
- End time (automatic)
- Elapsed time (calculated via bash)
-
Recorded in Story:
- **Time Tracking:** - /code: 2025-10-06 10:00 β 10:45 (45m) - /codereview: 2025-10-06 10:45 β 10:52 (7m) β PASSED - /fix: 2025-10-06 11:00 β 11:30 (30m) - /codereview: 2025-10-06 11:30 β 11:35 (5m) β PASSED - /done: 2025-10-06 11:35 β 11:40 (5m) - **Actual:** 1h 32m β οΈ (+54% overshoot) - **Efficiency:** 65% - **Iterations:** 2 (1 fix cycle)
-
/donecalculates final metrics:- Total Actual: sum of all stages (automatically via bash)
- Overshoot:
((actual - estimate) / estimate) * 100 - Efficiency:
(estimate / actual) * 100 - Iterations: number of review passes (counts FAILED β /fix cycles)
Comparison:
| Aspect | /report |
/report STORY-ID |
/report trends |
|---|---|---|---|
| Purpose | Project overview | Single Story details | Improvement dynamics |
| Scope | All Stories (aggregated) | One Story (detailed) | All tasks (over time) |
| Grouping | By Story | By tasks within Story | By weeks/months + types |
| Question | "How are Stories going?" | "How is this Story?" | "Am I improving?" |
| Focus | Which Stories problematic | Which tasks problematic | Which task types difficult |
| Time | Current state snapshot | One Story snapshot | Change trend |
| Diagrams | Pie (time by Story), Bar (efficiency by Story) | Gantt (task timeline), Pie (stages), Burndown | Line (learning curve), Bar (type patterns) |
| Output file | ./backlog/report-common.md |
./backlog/report-STORY-ID.md |
./backlog/report-trends.md |
In simple terms:
/reportβ "Where are we now" (all Stories overview)/report STORY-IDβ "What about this Story" (task details)/report trendsβ "Where are we heading" (improvement over time)
-
Baseline for AI performance:
- Estimates = human time baseline
- Actual = real AI agent time
- Compare AI vs human speed
-
Finding bottlenecks:
- Which task types always overshoot
- Which stage takes most time (code/review/fix)
- Where better estimates are needed
-
Learning curve:
- Is efficiency improving over time
- Are estimates becoming more accurate
- Fewer fix cycles after learning
-
Planning:
- Time forecast for new Stories
- Buffers for risky tasks (refactoring +60%)
git checkout # use git switch
git restore # manage changes explicitly
git stash # commit, don't hide
git reset --hard # irreversible!# /code creates feature branch: feature/T-001-001-validate-email
# Work in branch β commit changes
# /done merges to master with --no-ff (preserves history)- Large features (5+ tasks)
- Complex dependencies
- Need architectural design
- Small features (2-3 tasks)
- Simple bugfixes
- Quick prototypes
# Copy template
cp ~/.claude/templates/story-template.md backlog/stories/STORY-099.md
# Edit
vim backlog/stories/STORY-099.md
# Use
> /code T-099-001
> /codereview T-099-001
> /done T-099-001Goal: Automatically mark acceptance criteria checkboxes based on test results.
Planned:
- Auto-detect test framework (pytest, jest, cargo test, etc.)
- Match test results to acceptance criteria
- Mark
- [ ]as- [x]for passing tests - Generate test coverage report
Status: Stub ready, outputs "in development"
MIT
Β© Artel Team