What is this? A tool that draws a map of your microservices — showing who talks to whom, how often, and whether anything looks broken. Think of it as Google Maps for your backend architecture.
It connects to Jaeger (or a local trace file), builds a directed call graph, flags services that are slower than usual, and exports timing data for root-cause analysis.
You can get up and running in under 30 seconds. Pick your path:
cd JaegerViz
pip install -e .Check it's working:
topology-map --help
# You should see: render, export, lag-windowsThis spins up a graph with 11 services, anomaly detection, everything — zero setup:
bash demo.shThat's it. Your browser opens with a fully interactive topology map. In ~5 seconds you get:
✓ 300 traces loaded
✓ 11 services discovered
✓ 11 call relationships mapped
✓ Anomaly scores computed (some cartservice spans are 5-15× slower)
✓ Interactive graph rendered with dark theme
What you're looking at:
- Orange dots = your services (bigger = more connections)
- Curved orange arrows = calls between them (thicker = busier edges)
- Click any node → side panel with that service's full stats
- Click any edge → latency distribution, lag window
- Drag to rearrange, scroll to zoom, Ctrl+F to search
The demo graph represents this system:
graph TD
F[frontend] -->|236 calls| CART[cartservice]
F -->|184 calls| PROD[productcatalogservice]
F -->|124 calls| REC[recommendationservice]
F -->|89 calls| CHECK[checkoutservice]
CART -->|172 calls| REDIS[redis-cart]
PROD -->|184 calls| AD[adservice]
CHECK -->|89 calls| SHIP[shippingservice]
CHECK -->|61 calls| PAY[paymentservice]
CHECK -->|46 calls| EMAIL[emailservice]
CHECK -->|26 calls| CURR[currencyservice]
SHIP -->|89 calls| EMAIL
Prefer to run commands yourself? Generate the sample data, then render:
# 1. Create the sample trace file
python3 generate_sample_traces.py
# 2. Build and view the graph
topology-map render --from-file sample_traces.json --highlight-anomalies
open topology.htmlSame output as Path A — you're just calling the steps manually. This is useful when you want to explore different commands on the same data.
Already have Jaeger collecting traces? Point the tool at it:
topology-map render --jaeger-url http://localhost:16686 --lookback 1h --highlight-anomaliesIf you have a Jaeger Docker container but no traces, generate some first:
# Option 1 — Use the sample file as a data source (simplest)
topology-map render --from-file sample_traces.json --highlight-anomalies
# Option 2 — Send real traces into your Jaeger, then fetch them back
# First need Jaeger with OTLP gRPC (port 4317) or Thrift UDP (port 6831) exposed
# Then use an OpenTelemetry trace generator to populate itReplace the URL, lookback window, and service name with your own values. Your Jaeger needs to be reachable at the given URL and have traces within the lookback window.
# ── Show the full interactive graph ──────────────────────
topology-map render --from-file sample_traces.json --highlight-anomalies
# ── Zoom into one service's neighbourhood ────────────────
topology-map render --from-file sample_traces.json --service cartservice --hops 2
# ── Save as a static image (needs: brew install graphviz) ─
topology-map render --from-file sample_traces.json --format png --output topology.png
# ── Export data for other tools ──────────────────────────
topology-map export --from-file sample_traces.json --format json # → graph.json
topology-map export --from-file sample_traces.json --format dot # → graph.dot
# ── Compute correlation time windows ─────────────────────
topology-map lag-windows --from-file sample_traces.json # → lag_windows.json
# ── If you have a real Jaeger running ────────────────────
topology-map render --jaeger-url http://localhost:16686 --lookback 1h --highlight-anomaliestopology-map render --from-file sample_traces.json --highlight-anomaliesWhat happens step by step:
flowchart LR
A["📄 sample_traces.json<br/>300 traces"] --> B["🔄 Parse traces<br/>find parent-child links"]
B --> C["🧠 Build DiGraph<br/>nodes=services, edges=calls"]
C --> D["🔍 Anomaly check<br/>per-service P95, flag >2× slower"]
D --> E["🎨 Render HTML<br/>orange nodes, yellow text"]
E --> F["🌐 topology.html<br/>open in browser!"]
All the render options:
| Flag | Default | What it does |
|---|---|---|
--from-file PATH |
— | Load traces from a local JSON file (offline mode) |
--jaeger-url URL |
http://localhost:16686 |
Point at a real Jaeger instance |
--lookback DURATION |
1h |
How far back to look (1h, 30m, 15m) |
--service NAME |
— | Zoom into one service's area |
--hops N |
1 |
How many steps out from that service |
--operation NAME |
— | Only show traces for a specific operation |
--format TYPE |
html |
html, png, or svg |
--output PATH |
auto | Where to save the file |
--highlight-anomalies |
off | Turn on the anomaly detector |
--limit N |
100 |
Max traces per page from Jaeger |
When you have 100+ services, you don't want all of them on screen at once. Think of hops like "how many friends-of-friends to include":
graph LR
subgraph "hops=0 (just cartservice)"
C0[cartservice]
end
subgraph "hops=1 (+ direct neighbours)"
C1[cartservice] --> F1[frontend]
C1 --> R1[redis-cart]
end
subgraph "hops=2 (+ next ring out)"
C2[cartservice] --> F2[frontend]
C2 --> R2[redis-cart]
F2 --> P2[productcatalogservice]
F2 --> CH2[checkoutservice]
F2 --> RC2[recommendationservice]
end
A hop = following one arrow in either direction. Here's how it counts:
cartservice ← hop 0 (your starting point)
│
├── frontend ← hop 1 (directly connected ✓)
│ │
│ ├── checkoutservice ← hop 2 (friend of a friend ✓)
│ │ │
│ │ └── shippingservice ← hop 3 (too far away ✗)
│ │
│ └── productcatalogservice ← hop 2 (friend of a friend ✓)
│ │
│ └── adservice ← hop 3 (too far away ✗)
│
└── redis-cart ← hop 1 (directly connected ✓)
Running --service cartservice --hops 2 keeps 6 services and removes 5 — only the ones within 2 steps of cartservice survive.
topology-map export --from-file sample_traces.json --format jsonThis writes graph.json — a machine-readable version of the graph that other tools can consume:
{
"nodes": [
{"id": "frontend"},
{"id": "cartservice"}
],
"links": [
{
"source": "frontend",
"target": "cartservice",
"weight": 236,
"avg_duration_ms": 45.2,
"p50_duration_ms": 44.1,
"p95_duration_ms": 52.3,
"p99_duration_ms": 58.7
}
]
}This is the format that spectrum-based causal analysis tools consume. --format dot gives you a graphviz file instead.
topology-map lag-windows --from-file sample_traces.jsonWrites lag_windows.json. Here's what you get and why:
flowchart TD
subgraph sync["🟢 Sync calls (fast RPCs)"]
S1["frontend → cartservice<br/>P99 = 58ms"] -->|"58/1000/60 × 10 = 0.01"| S2["clamped to 1.0 min"]
S3["cartservice → redis-cart<br/>P99 = 6ms"] -->|"6/1000/60 × 10 = 0.001"| S4["clamped to 1.0 min"]
end
subgraph async["🟠 Async calls (queues/email)"]
A1["shippingservice → email<br/>P99 = 120,000ms"] -->|"120/60 × 10 = 20"| A2["23.9 min"]
end
sync --> LAG["lag_windows.json → cross-correlation<br/>causal analysis pipeline"]
async --> LAG
Why the numbers differ:
| Edge type | Real P99 | Formula | Lag window |
|---|---|---|---|
| Sync RPC (200ms) | 58 ms | (0.058/60)×10 = 0.01 | 1.0 min (bumped to minimum) |
| Cache lookup | 6 ms | (0.006/60)×10 = 0.001 | 1.0 min (bumped to minimum) |
| Async message | 120,000 ms | (120/60)×10 = 20 | 23.9 min |
| Batch email | 200,000 ms | (200/60)×10 = 33.3 | 30.0 min (capped at max) |
Why ×10? Because during incidents, latencies can jump 5-10× normal. The safety multiplier makes sure the window is wide enough to still catch the correlation. We clamp it between 1 and 30 minutes so it's always reasonable.
┌────────────────────────────────────────────────────────────────┐
│ TOPOLOGY MAPPER | dependency graph NODES: 11 EDGES: 11 │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ 🔍 Search... │ │ │ │
│ └──────────────┘ │ Interactive graph │ │
│ │ (drag · zoom · click) │ │
│ │ │ │
│ │ │ │
│ └──────────────────────────┘ │
│ │
│ CLICK a node → side panel · DRAG to rearrange · SCROLL │
└────────────────────────────────────────────────────────────────┘
Side panel (slides in from the right when you click a node or edge):
┌─ SERVICE DETAILS ─────────────────── [✕] ─┐
│ │
│ STATUS │
│ ┌──────────────────┐ │
│ │ HEALTHY │ ← colour badge │
│ └──────────────────┘ │
│ │
│ METRICS │
│ Total Spans ··············· 472 │
│ Anomalous ··············· 3 │
│ Score ··············· 0.01 │
│ P95 Baseline ··············· 52ms │
│ │
│ CONNECTIONS │
│ Degree ··············· 4 │
│ In / Out ··············· 2 / 2 │
│ │
│ LATENCY PROFILE │
│ P95 Inbound ··············· 58ms │
│ P95 Outbound ··············· 6ms │
│ │
│ CALLED BY │
│ ← frontend │
│ │
│ CALLS TO │
│ redis-cart → │
└────────────────────────────────────────────┘
Keyboard shortcuts:
| Key | Does |
|---|---|
Ctrl+F |
Jump to the search box |
Esc |
Close side panel, clear everything |
Ctrl+0 |
Fit the whole graph on screen |
| Double-click a node | Zoom right into it |
Here's how data flows from a trace source all the way to the browser:
flowchart TD
SRC["📡 Trace Source<br/>Jaeger API OR sample_traces.json"] --> FETCH
subgraph FETCH["Fetcher (Component 1.1)"]
F1["Paginate API, rate-limit 100ms"]
F2["Parse JSON → Span/Trace objects"]
F3["Normalise parentSpanID: '' → None"]
F4["Convert µs timestamps"]
end
FETCH --> BUILD
subgraph BUILD["Graph Builder (Component 1.2)"]
B1["For each trace: build span_map"]
B2["Match child → parent span"]
B3["Record edge: parent.svc → child.svc"]
B4["Store: weight, durations, avg, P50, P95, P99"]
end
BUILD --> ANOM
BUILD --> LAG
subgraph ANOM["Anomaly Highlighter (1.3)"]
A1["Per service: collect all durations"]
A2["Compute P95 baseline"]
A3["Count spans > 2× P95"]
A4["Score → healthy | degraded | critical"]
end
subgraph LAG["Lag Window Computer (1.4)"]
L1["Per edge: P99 of durations"]
L2["Convert ms → minutes × 10"]
L3["Clamp between 1.0 and 30.0"]
end
ANOM --> RENDER
LAG --> LAG_FILE["lag_windows.json<br/>(→ causal analysis)"]
subgraph RENDER["Renderer (Component 1.5)"]
R1["Custom vis.js HTML"]
R2["Orange nodes + yellow text"]
R3["Side panel, search, tooltips"]
R4["Dark theme, glow shadows"]
end
RENDER --> OUT["topology.html<br/>Open in browser!"]
graph LR
subgraph Core["Core Logic"]
direction TB
MODELS["models/<br/>span.py + trace.py<br/>@dataclass contracts"]
FETCHER["fetcher/<br/>jaeger_client.py<br/>HTTP → Span/Trace"]
GRAPH["graph/<br/>builder.py<br/>anomalies.py<br/>lag_windows.py"]
end
subgraph Output["Output Layer"]
direction TB
RENDERER["renderer/<br/>interactive.py<br/>static.py"]
EXPORT["export/<br/>json_exporter.py<br/>dot_exporter.py"]
end
subgraph Interface["User Interface"]
CLI["cli/main.py<br/>Click: render, export,<br/>lag-windows"]
end
MODELS --> FETCHER --> GRAPH
GRAPH --> RENDERER
GRAPH --> EXPORT
CLI --> FETCHER --> GRAPH
GRAPH --> RENDERER --> HTML["topology.html"]
GRAPH --> EXPORT --> JSON["graph.json<br/>graph.dot"]
classDiagram
class Span {
+str trace_id
+str span_id
+str? parent_id
+str service_name
+str operation_name
+int start_time_micros
+int duration_micros
+bool is_error
+dict tags
+float duration_ms
+float duration_s
+bool is_root
Normalises: "0","" → None
}
class Trace {
+str trace_id
+list~Span~ spans
+str? root_service
+int num_spans
+bool is_simple
+dict~str,Span~ span_map
O(1) parent lookup
}
class DiGraph {
+set~str~ nodes
+dict edges
Per edge: weight, durations,
avg_ms, p50_ms, p95_ms, p99_ms,
min_ms, max_ms
}
class LagWindows {
+dict edge → minutes
1.0 min (sync RPCs)
23.9 min (async queues)
}
Span "*" -- "1" Trace : contained in
Trace "*" -- "1" DiGraph : builds
DiGraph "1" -- "1" LagWindows : computes
flowchart TD
START["For each service in the graph"] --> GATHER["Gather every span duration<br/>involving this service<br/>(all edges in + out)"]
GATHER --> BASELINE["Find the P95 — the speed<br/>that 95% of requests are<br/>faster than"]
BASELINE --> THRESHOLD["Double it: 2 × P95<br/>Anything slower than this<br/>is 'anomalous'"]
THRESHOLD --> COUNT["Count: how many spans<br/>crossed this line?"]
COUNT --> SCORE["Score = anomalous ÷ total"]
SCORE --> CHECK{"What's the score?"}
CHECK -->|"< 0.05 (less than 5%)"| HEALTHY["🟢 HEALTHY<br/>Everything looks normal"]
CHECK -->|"0.05 to 0.15"| DEGRADED["🟡 DEGRADED<br/>Something's a bit off"]
CHECK -->|"> 0.15 (more than 15%)"| CRITICAL["🔴 CRITICAL<br/>This service needs attention"]
HEALTHY --> EXAMPLE
DEGRADED --> EXAMPLE
CRITICAL --> EXAMPLE
EXAMPLE["Example: cartservice<br/>408 spans, P95=52ms<br/>threshold=104ms<br/>3 spans exceed → 3/408 = 0.007<br/>→ HEALTHY ✓"]
flowchart LR
EDGE["For each edge<br/>(source → target)"] --> P99["Step 1: Get P99<br/>numpy.percentile(durations, 99)"]
P99 --> MINUTES["Step 2: To minutes<br/>P99_ms ÷ 1000 ÷ 60"]
MINUTES --> SAFETY["Step 3: Safety margin<br/>× 10 (incidents make<br/>things 5-10× slower)"]
SAFETY --> CLAMP["Step 4: Keep it reasonable<br/>clamp between 1 and 30 min"]
CLAMP --> SYNC["Sync RPC: 58ms → 1.0 min"]
CLAMP --> ASYNC["Async msg: 120s → 23.9 min"]
Why this matters for causal analysis: When an upstream service fails, the downstream symptom might appear minutes later. The lag window tells the causal analysis engine exactly how many minutes back to search when cross-correlating upstream and downstream anomalies to find the root cause.
JaegerViz/
│
├── src/
│ ├── models/
│ │ ├── span.py ← A single span (one step in a request chain)
│ │ └── trace.py ← A full trace (all the steps in one request)
│ │
│ ├── fetcher/
│ │ └── jaeger_client.py ← Talks to Jaeger, fetches and parses traces
│ │
│ ├── graph/
│ │ ├── builder.py ← Turns traces into a NetworkX directed graph
│ │ ├── anomalies.py ← Finds services that are slower than usual
│ │ └── lag_windows.py ← Computes per-edge timing windows
│ │
│ ├── renderer/
│ │ ├── interactive.py ← Builds the full HTML/JS interactive graph
│ │ └── static.py ← Builds PNG/SVG images via graphviz
│ │
│ ├── export/
│ │ ├── json_exporter.py ← Writes graph as node-link JSON
│ │ └── dot_exporter.py ← Writes graph as DOT (graphviz format)
│ │
│ ├── cli/
│ │ └── main.py ← The command-line interface (Click)
│ │
│ └── utils/
│ └── timing.py ← Timestamp helpers, lookback parser
│
├── tests/ ← 45 tests across 7 files
├── generate_sample_traces.py ← Creates fake trace data for testing
├── sample_traces.json ← The generated fake trace file
├── pyproject.toml ← Package config + entry point
├── requirements.txt ← Python dependencies
└── README.md ← You are here
# Run everything
pytest tests/ -v
# Run one file
pytest tests/test_graph_builder.py -v
# See full error details when something fails
pytest tests/ -v --tb=long| Test file | What it checks |
|---|---|
test_models.py |
Span/Trace creation, parent normalisation, µs→ms conversion |
test_jaeger_client.py |
Trace JSON parsing, pagination, service name extraction |
test_graph_builder.py |
DiGraph building, parent-child matching, edge stats, subgraph filtering |
test_anomalies.py |
P95 baseline, 2× threshold, healthy/degraded/critical classification |
test_lag_windows.py |
P99→lag conversion, clamping, sync vs async edges, JSON export |
test_export.py |
JSON node-link format, DOT output, durations stripping |
test_renderer.py |
HTML generation, anomaly-aware rendering (graphviz tests skip if no dot) |
- Fetches traces from Jaeger API with pagination and rate limiting
- Loads traces from local JSON file (offline mode — no Jaeger needed)
- Builds accurate directed graph with per-edge call counts and raw durations
- Computes avg, P50, P95, P99 for every edge
- Anomaly detection: per-service P95 baseline, 2× threshold, three status levels
- Interactive HTML: dark theme, click-to-inspect side panel, search, keyboard shortcuts
- Per-edge lag windows: P99 × 10 safety factor, clamped [1, 30] minutes
- CLI with
render,export,lag-windows— all self-documenting via--help - JSON export (node-link) for downstream spectrum analysis · DOT export for graphviz
- Neighborhood filter:
--service+--hopsto zoom into one area - Handles edge cases: missing parents, empty responses, single-span traces
- Works with any Jaeger instance (not hardcoded to one demo)
- 42 passing tests across all components