Skip to content

Add OpenTelemetry observability to AWX#16462

Draft
chrismeyersfsu wants to merge 2 commits into
ansible:develfrom
chrismeyersfsu:telemetry
Draft

Add OpenTelemetry observability to AWX#16462
chrismeyersfsu wants to merge 2 commits into
ansible:develfrom
chrismeyersfsu:telemetry

Conversation

@chrismeyersfsu

@chrismeyersfsu chrismeyersfsu commented May 28, 2026

Copy link
Copy Markdown
Member
SUMMARY

Adds comprehensive OpenTelemetry instrumentation to AWX for distributed tracing and observability.

Code Instrumentation:

  • Dispatcher: Task execution spans with function names, UUIDs, correlation IDs, exception tracking
  • Management commands: Observability for dispatcherd, cache_clear, callback_receiver, rsyslog_configurer, ws_heartbeat, wsrelay
  • ASGI/WSGI entry points: Request tracing

Infrastructure:

  • Add Tempo trace storage backend to docker-compose
  • Configure OTEL collector to route logs→Loki, traces→Tempo, metrics→Prometheus
  • Add Grafana Tempo datasource with traces-to-logs integration
  • Update Loki config for OTLP ingestion and structured metadata

Dependencies:

  • Upgrade ansible-base for observability utilities
  • Add opentelemetry-* tracing packages
ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME
  • API
  • Other (Dispatcher, Management Commands, Observability Stack)
STEPS TO REPRODUCE AND EXTRA INFO

Usage:

OTEL=1 LOKI=1 TEMPO=1 GRAFANA=1 make docker-compose

Verification:

  1. Start AWX with observability enabled
  2. Trigger dispatcher tasks (visit UI, wait for periodic tasks)
  3. Query Tempo: curl -s "http://localhost:3200/api/search?tags=service.name=aap-controller-dispatcher" | jq
  4. View in Grafana at http://localhost:3001
    • Explore → Tempo datasource
    • Search for service: aap-controller-dispatcher
    • Verify span names = task function names
    • Verify attributes: task.uuid, task.name, task.module, correlation_id
    • Click trace→logs link to see correlated logs

Before:

  • No dispatcher tracing
  • No trace correlation with logs
  • Limited observability into async task execution

After:

  • Rich dispatcher spans with function names as routes
  • Full trace→log correlation via service.name and trace/span IDs
  • Exception tracking in spans
  • Metrics exported to Prometheus

Instrument AWX components with OpenTelemetry tracing:

Code Changes:
- Add telemetry to dispatcher task execution (dispatcherd.py, task.py)
  - Span name = task function name (e.g., 'apply_cluster_membership_policies')
  - Attributes: task.uuid, task.name, task.module, correlation_id
  - Handle lambda broker tasks with normalized span names
  - Capture exceptions with span.record_exception()

- Add observability to management commands
  - dispatcherd, run_cache_clear, run_callback_receiver
  - run_rsyslog_configurer, run_ws_heartbeat, run_wsrelay
  - Service names: aap-controller-{component}

- Add observability to ASGI/WSGI entry points
  - Service name: aap-controller-uwsgi

Infrastructure:
- Add TEMPO env var to Makefile
- Add enable_otel/enable_loki/enable_tempo flags to ansible
- Add Tempo container to docker-compose
  - Service mesh network for OTEL→Loki/Tempo communication
  - Volume mounts for configs and data storage

- Configure OTEL collector
  - Export logs→Loki, traces→Tempo, metrics→Prometheus
  - File exporter with compression for backup

- Configure Loki
  - Enable structured metadata (required for OTLP)
  - Increase rate limits for high-volume logging
  - 3y retention period

- Add Grafana Tempo datasource
  - Traces-to-logs integration
  - Map trace attributes to Loki labels
  - Enable trace/span ID filtering

Dependencies:
- Upgrade ansible-base for observability utilities
- Add opentelemetry-* packages for tracing

Usage: OTEL=1 LOKI=1 TEMPO=1 GRAFANA=1 make docker-compose
Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com>

Co-authored-by: Claude (Anthropic) <claude@anthropic.com>
@github-actions github-actions Bot added component:api dependencies Pull requests that update a dependency file labels May 28, 2026
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3cc7b034-67d6-40e8-9b94-75140d789975

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR integrates OpenTelemetry distributed tracing into AWX. It upgrades OpenTelemetry dependencies to 1.39.0+, adds observability initialization to all application entry points (ASGI, WSGI, management commands), instruments task dispatch workers and task execution with span wrapping and error tracking, and provisions development-time observability services (Grafana, Loki, Tempo, OTel Collector) via Docker Compose with updated configurations.

Changes

OpenTelemetry Observability Integration

Layer / File(s) Summary
OpenTelemetry Dependencies and Requirements
requirements/requirements.in, requirements/requirements.txt, requirements/requirements_git.txt
OpenTelemetry API and SDK bumped to >=1.39.0; full dependency lock regenerated with upgraded async/networking, cloud SDKs, Django stack, and observability tooling; django-ansible-base git requirement adds observability and resource-registry extras.
Service Entry Points — Observability Initialization
awx/asgi.py, awx/wsgi.py, awx/main/management/commands/*
ASGI (Daphne), WSGI (uWSGI), and six management commands (dispatcherd, cache-clear, callback-receiver, rsyslog-configurer, ws-heartbeat, wsrelay) import and invoke setup_observability() with service-specific names at startup.
Task Dispatch Worker Process — Observability and Span Wrapping
awx/main/dispatch/worker/dispatcherd.py
AWXTaskWorker.on_start() initializes observability in worker process (post-fork); run_callable() overridden to wrap task execution in OpenTelemetry spans with task metadata, correlation ID, span attributes, and structured error handling (status, exception recording, re-raise).
Task Execution — Span Instrumentation at Task Level
awx/main/dispatch/worker/task.py
run_callable() wrapped in OpenTelemetry span named after task function; span records task metadata, correlation ID, message delay, execution status, and exception details; successful execution marked OK, exceptions marked ERROR with metadata.
Development Environment — Observability Services and Configuration
Makefile, tools/docker-compose/ansible/roles/sources/defaults/main.yml, tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2, tools/loki/local-config.yaml, tools/otel/otel-collector-config.yaml, tools/grafana/datasources/tempo_source.yml
Makefile adds TEMPO toggle; Ansible defaults add enable_otel/enable_loki/enable_tempo flags; Docker Compose template conditionally deploys Grafana (new image/network/auth), OpenTelemetry Collector (updated image/command/mounts), Loki, and Tempo with inter-service dependencies; Loki migrated to schema v13 (tsdb storage) with structured metadata support for OTLP; OTel Collector routes logs to Loki and traces to Tempo; Grafana Tempo datasource provisioned with trace-to-logs mapping and node graph visualization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add OpenTelemetry observability to AWX' accurately summarizes the main objective of the pull request—comprehensive OpenTelemetry instrumentation and observability infrastructure additions across AWX services.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chrismeyersfsu chrismeyersfsu marked this pull request as draft May 28, 2026 18:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awx/main/dispatch/worker/dispatcherd.py`:
- Around line 79-80: The trace currently attaches the raw exception string via
span.set_attribute("task.error_message", str(e)) which may leak secrets; update
the error handling in dispatcherd (where span.set_attribute("task.error_type",
type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set)
to stop exporting the raw message: keep or set only the error type attribute,
and replace the raw text attachment with span.record_exception(e) (or, if you
must include text, apply a sanitization/redaction function before setting an
attribute) so traces contain error metadata without leaking sensitive content.

In `@awx/main/dispatch/worker/task.py`:
- Around line 87-88: The code currently attaches raw exception text via
span.set_attribute("task.error_message", str(e)) which can leak secrets/PII;
instead remove the raw message attribute and rely on span.record_exception(e) to
record the exception and keep span.set_attribute("task.error_type",
type(e).__name__) for structured type info, or if a message attribute is
required, set a strictly redacted/sanitized message (e.g., "redacted" or a small
safe summary extracted by a sanitize_exception function) before calling
span.set_attribute; update the error-handling block around span.set_attribute
and span.record_exception to implement this change (referencing span,
task.error_message, task.error_type, record_exception, and the exception
variable e).

In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`:
- Line 169: Replace the four observability images that currently use the :latest
tag with fixed, versioned tags or digests to make deployments reproducible:
update docker.io/grafana/grafana:latest,
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest,
docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific
semantic-version tags (or content digests) and commit those values into the
docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and
Tempo use pinned versions.
- Around line 184-192: The Grafana service currently renders a depends_on key
even when no backends are enabled; update the docker-compose Jinja template to
conditionally render the entire depends_on block only when at least one of
enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the
depends_on: and its list items with a single if that checks those flags) and
ensure the list still emits prometheus/loki/tempo entries using the existing
checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any
image references using :latest for the grafana, opentelemetry-collector-contrib,
loki, and tempo images with pinned version tags or digests so those symbols
(grafana image, opentelemetry-collector-contrib image, loki image, tempo image)
use deterministic, non-:latest identifiers.

In `@tools/grafana/datasources/tempo_source.yml`:
- Line 12: The Tempo datasource references jsonData.tracesToLogs.datasourceUid:
'P8E80F9AEF21F6940' (and dashboards also reference that UID) but the Loki
datasource provisioning lacks an explicit uid, so trace-to-logs will fail; open
the Loki datasource provisioning file where type: loki is defined and add uid:
'P8E80F9AEF21F6940' at the top level of that datasource entry so the Loki
datasource uid matches Tempo’s tracesToLogs.datasourceUid and the dashboard
references.

In `@tools/loki/local-config.yaml`:
- Line 9: Update the misleading comments in the Loki config: change the note
about match_max_concurrent to state that frontend_worker.match_max_concurrent is
still supported (remove "not supported in newer Loki versions") so the comment
next to frontend_worker.match_max_concurrent reflects support; and move or
update the comment for split_queries_by_interval to indicate that setting can be
set to 0 to disable query splitting but it belongs under limits_config (not
under query_range), so adjust the comment near split_queries_by_interval to
reference placement incompatibility rather than claiming "0" is unsupported.

In `@tools/otel/otel-collector-config.yaml`:
- Line 26: The Loki and Tempo exporter blocks currently set insecure: true which
disables TLS verification; update the otel collector config by changing
insecure: true to insecure: false for production and make this value
configurable (e.g., via an environment variable or config templating) so the
loki and tempo exporter sections can use a boolean flag (referencing the loki
and tempo exporter blocks and the insecure field) to switch between development
(true) and production (false) deployments; ensure docs/helm/manifest overrides
reflect the new env/config var.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 20606100-fafc-48d1-a189-23cc0ecca31c

📥 Commits

Reviewing files that changed from the base of the PR and between 9b922f7 and 7f82fe4.

📒 Files selected for processing (19)
  • Makefile
  • awx/asgi.py
  • awx/main/dispatch/worker/dispatcherd.py
  • awx/main/dispatch/worker/task.py
  • awx/main/management/commands/dispatcherd.py
  • awx/main/management/commands/run_cache_clear.py
  • awx/main/management/commands/run_callback_receiver.py
  • awx/main/management/commands/run_rsyslog_configurer.py
  • awx/main/management/commands/run_ws_heartbeat.py
  • awx/main/management/commands/run_wsrelay.py
  • awx/wsgi.py
  • requirements/requirements.in
  • requirements/requirements.txt
  • requirements/requirements_git.txt
  • tools/docker-compose/ansible/roles/sources/defaults/main.yml
  • tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
  • tools/grafana/datasources/tempo_source.yml
  • tools/loki/local-config.yaml
  • tools/otel/otel-collector-config.yaml

Comment on lines +79 to +80
span.set_attribute("task.error_type", type(e).__name__)
span.set_attribute("task.error_message", str(e))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid exporting raw exception messages to tracing attributes.

task.error_message = str(e) can leak sensitive values into traces. Keep error type and rely on record_exception(e) (or sanitize/redact before attaching text).

Suggested fix
-                span.set_attribute("task.error_type", type(e).__name__)
-                span.set_attribute("task.error_message", str(e))
+                span.set_attribute("task.error_type", type(e).__name__)
                 span.record_exception(e)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
span.set_attribute("task.error_type", type(e).__name__)
span.set_attribute("task.error_message", str(e))
span.set_attribute("task.error_type", type(e).__name__)
span.record_exception(e)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awx/main/dispatch/worker/dispatcherd.py` around lines 79 - 80, The trace
currently attaches the raw exception string via
span.set_attribute("task.error_message", str(e)) which may leak secrets; update
the error handling in dispatcherd (where span.set_attribute("task.error_type",
type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set)
to stop exporting the raw message: keep or set only the error type attribute,
and replace the raw text attachment with span.record_exception(e) (or, if you
must include text, apply a sanitization/redaction function before setting an
attribute) so traces contain error metadata without leaking sensitive content.

Comment on lines +87 to +88
span.set_attribute("task.error_type", type(e).__name__)
span.set_attribute("task.error_message", str(e))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not attach raw exception text to span attributes.

Publishing str(e) to task.error_message risks secret/PII exposure in telemetry backends. Keep structured type info and recorded exception, or apply strict redaction.

Suggested fix
             span.set_status(Status(StatusCode.ERROR, str(e)))
             span.set_attribute("task.error_type", type(e).__name__)
-            span.set_attribute("task.error_message", str(e))
             span.record_exception(e)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
span.set_attribute("task.error_type", type(e).__name__)
span.set_attribute("task.error_message", str(e))
span.set_attribute("task.error_type", type(e).__name__)
span.record_exception(e)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awx/main/dispatch/worker/task.py` around lines 87 - 88, The code currently
attaches raw exception text via span.set_attribute("task.error_message", str(e))
which can leak secrets/PII; instead remove the raw message attribute and rely on
span.record_exception(e) to record the exception and keep
span.set_attribute("task.error_type", type(e).__name__) for structured type
info, or if a message attribute is required, set a strictly redacted/sanitized
message (e.g., "redacted" or a small safe summary extracted by a
sanitize_exception function) before calling span.set_attribute; update the
error-handling block around span.set_attribute and span.record_exception to
implement this change (referencing span, task.error_message, task.error_type,
record_exception, and the exception variable e).

{% if enable_grafana|bool %}
grafana:
image: mirror.gcr.io/grafana/grafana-enterprise:12.3.4
image: docker.io/grafana/grafana:latest

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"

echo "== Check file exists =="
ls -la "$FILE"

echo
echo "== Show lines around 160-180 =="
nl -ba "$FILE" | sed -n '150,190p'

echo
echo "== Show lines mentioned by the review comment (243, 268, 283) =="
for L in 243 268 283; do
  echo "--- line $L ---"
  nl -ba "$FILE" | sed -n "${L},${L}p"
done

echo
echo "== Find all occurrences of ':latest' in the template =="
rg -n ":[Ll]atest\b" "$FILE" || true

echo
echo "== Find all occurrences of grafana/loki/tempo/otel in the template =="
rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || true

Repository: ansible/awx

Length of output: 299


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"

echo "== Show lines around 160-180 =="
awk -v start=150 -v end=190 'NR>=start && NR<=end {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,120p'

echo
echo "== Show lines mentioned by the review comment (243, 268, 283) =="
for L in 243 268 283; do
  awk -v n="$L" 'NR==n {printf "%d:%s\n", NR, $0}' "$FILE"
done

echo
echo "== Find all occurrences of ':latest' in the template =="
rg -n ":[Ll]atest\b" "$FILE" || true

echo
echo "== Find all occurrences of grafana/loki/tempo/otel in the template =="
rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || true

Repository: ansible/awx

Length of output: 3833


Pin observability container image tags instead of using :latest

tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 uses :latest for the observability stack, making local/dev deployments non-reproducible:

  • Grafana: docker.io/grafana/grafana:latest (169)
  • OpenTelemetry Collector: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest (243)
  • Loki: docker.io/grafana/loki:latest (268)
  • Tempo: docker.io/grafana/tempo:latest (283)

Pin these to specific version tags (or digests) for stable, predictable behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`
at line 169, Replace the four observability images that currently use the
:latest tag with fixed, versioned tags or digests to make deployments
reproducible: update docker.io/grafana/grafana:latest,
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest,
docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific
semantic-version tags (or content digests) and commit those values into the
docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and
Tempo use pinned versions.

Comment on lines +184 to +192
{% if enable_prometheus|bool %}
- prometheus
{% endif %}
{% if enable_loki|bool %}
- loki
{% endif %}
{% if enable_tempo|bool %}
- tempo
{% endif %}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"

echo "== File exists =="
test -f "$FILE" && echo "Found: $FILE"

echo
echo "== Inspect depends_on block around requested lines =="
sed -n '150,230p' "$FILE" | nl -ba

echo
echo "== Search for :latest in this template =="
rg -n --hidden --no-ignore -S ":latest" "$FILE" || true

echo
echo "== Search for grafana service section to see exact depends_on rendering =="
rg -n "^\s*grafana\s*:" -S "$FILE" || true
rg -n "depends_on" "$FILE" | head -n 50

echo
echo "== Show full depends_on block(s) for context =="
# Print chunks containing depends_on and a bit after/before
rg -n "depends_on" "$FILE" | while IFS=: read -r ln _; do
  start=$((ln-20)); end=$((ln+40));
  sed -n "${start},${end}p" "$FILE" | nl -ba | sed "s/^/$FILE (around depends_on at line $ln): /" | head -n 120
done

Repository: ansible/awx

Length of output: 272


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"

echo "== Inspect depends_on block around lines 150-230 (numbered via cat -n) =="
sed -n '150,230p' "$FILE" | cat -n

echo
echo "== Search for :latest in this template =="
rg -n --hidden --no-ignore -S ":latest" "$FILE" || true

echo
echo "== Show grafana service header (if present) =="
rg -n "grafana\s*:" -S "$FILE" || true

echo
echo "== Show depends_on occurrences =="
rg -n "depends_on" "$FILE" || true

echo
echo "== Print depends_on blocks with a bit of surrounding context =="
# For each depends_on line, print ~20 lines around it (bounded)
rg -n "depends_on" "$FILE" | head -n 50 | while IFS=: read -r ln _; do
  start=$((ln-20)); end=$((ln+60));
  (( start<1 )) && start=1
  echo "---- $FILE around depends_on at line $ln ----"
  sed -n "${start},${end}p" "$FILE" | cat -n
done

Repository: ansible/awx

Length of output: 17841


Guard Grafana depends_on and pin observability images

  • Grafana currently renders depends_on: unconditionally; if prometheus/loki/tempo are all disabled, the key renders with no entries (effectively depends_on: null), which can break Compose parsing. Wrap depends_on itself with the same condition as the list items.
Suggested fix
-    depends_on:
-{% if enable_prometheus|bool %}
+{% if enable_prometheus|bool or enable_loki|bool or enable_tempo|bool %}
+    depends_on:
+{% if enable_prometheus|bool %}
       - prometheus
 {% endif %}
 {% if enable_loki|bool %}
       - loki
 {% endif %}
 {% if enable_tempo|bool %}
       - tempo
 {% endif %}
+{% endif %}
  • Replace :latest tags for grafana, opentelemetry-collector-contrib, loki, and tempo with pinned versions (or digests) to restore deterministic builds and tighter supply-chain control.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`
around lines 184 - 192, The Grafana service currently renders a depends_on key
even when no backends are enabled; update the docker-compose Jinja template to
conditionally render the entire depends_on block only when at least one of
enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the
depends_on: and its list items with a single if that checks those flags) and
ensure the list still emits prometheus/loki/tempo entries using the existing
checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any
image references using :latest for the grafana, opentelemetry-collector-contrib,
loki, and tempo images with pinned version tags or digests so those symbols
(grafana image, opentelemetry-collector-contrib image, loki image, tempo image)
use deterministic, non-:latest identifiers.

jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: 'P8E80F9AEF21F6940'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check Loki datasource provisioning configuration for UID

# Find Loki datasource provisioning files
fd -e yml -e yaml . tools/grafana/datasources/ -x cat {}

# Look for Loki datasource UID in provisioning configs
rg -n "uid.*P8E80F9AEF21F6940" tools/grafana/

Repository: ansible/awx

Length of output: 1633


Set an explicit Loki datasource uid to match Tempo’s tracesToLogs.datasourceUid.

tools/grafana/datasources/tempo_source.yml uses jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and tools/grafana/dashboards/services_dashboard.json also references that UID), but the Loki datasource provisioning under tools/grafana/datasources/ does not define any uid: field—so the referenced UID is unlikely to exist and trace-to-logs will break. Add uid: 'P8E80F9AEF21F6940' to the Loki datasource provisioning so it matches Tempo (and the dashboard).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/grafana/datasources/tempo_source.yml` at line 12, The Tempo datasource
references jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and
dashboards also reference that UID) but the Loki datasource provisioning lacks
an explicit uid, so trace-to-logs will fail; open the Loki datasource
provisioning file where type: loki is defined and add uid: 'P8E80F9AEF21F6940'
at the top level of that datasource entry so the Loki datasource uid matches
Tempo’s tracesToLogs.datasourceUid and the dashboard references.


frontend_worker:
match_max_concurrent: true
# match_max_concurrent: true # Removed - not supported in newer Loki versions

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check Loki version and validate unsupported config settings

# Extract Loki image version from docker-compose
rg -n "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2

# Check Loki documentation for these settings
echo "Verify the following in Loki documentation for the version found above:"
echo "1. frontend_worker.match_max_concurrent support"
echo "2. query_range.split_queries_by_interval: 0 support"

Repository: ansible/awx

Length of output: 227


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Find Loki image tags in docker-compose template =="
rg -n --no-heading "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 || true

echo
echo "== Show commented settings in tools/loki/local-config.yaml =="
if [ -f tools/loki/local-config.yaml ]; then
  rg -n --no-heading "match_max_concurrent|split_queries_by_interval" tools/loki/local-config.yaml || true
  echo
  echo "== Show surrounding lines for context (lines 1-120) =="
  nl -ba tools/loki/local-config.yaml | sed -n '1,120p'
else
  echo "Missing tools/loki/local-config.yaml"
fi

Repository: ansible/awx

Length of output: 494


🌐 Web query:

Loki configuration frontend_worker.match_max_concurrent removed unsupported version

💡 Result:

The configuration setting frontend_worker.match_max_concurrent was not removed from Loki [1][2]; rather, it is a functional setting that superseded the older frontend_worker.parallelism configuration [1][2]. If you are encountering an error indicating an unsupported version or a failure to parse the configuration, it is likely due to one of the following reasons: 1. Deprecated or Removed Configuration: While match_max_concurrent remains, other configuration parameters in the frontend_worker or querier blocks may have been removed or moved in newer versions of Loki (e.g., v3.0+) [2][3][4]. Check your Loki logs for specific "field not found" errors, which will identify exactly which key is no longer supported [3][4]. 2. Configuration Syntax/Structure Changes: Loki often requires configuration fields to be moved to different parent blocks (e.g., moving settings under limits_config or changing how storage_config is defined) during major version upgrades [5][3]. 3. Version Compatibility: If you are upgrading across major versions (e.g., v2.x to v3.x), review the official upgrade documentation [2]. Settings like querier.worker-parallelism have been removed entirely, and their functionality is now handled by querier.max-concurrent [2]. Recommended Action: - Examine the exact error message in your Loki logs to identify the specific unsupported field [3][4]. - Consult the official Grafana Loki upgrade guide for your specific version jump to see which keys were deprecated or relocated [2]. - Ensure your configuration file structure matches the requirements for your currently installed Loki version, as automated installs or default configs can sometimes lag behind the latest binary changes [3][4].

Citations:


🌐 Web query:

Loki configuration query_range.split_queries_by_interval support value 0

💡 Result:

Loki’s split_queries_by_interval supports the value 0 as a valid “disable splitting” setting. When the effective interval is 0, Loki’s query-range splitting middleware skips splitting and forwards the request unchanged (interval == 0 => return next.Do(ctx, r)). [1] In older documentation/comments for this setting, “0 disables it” is stated alongside guidance to use a positive duration (e.g., in multiples of 24h) to avoid downloading/processing the same chunks across splits. [2] Note: as of Loki v2.5.0, split_queries_by_interval can only be configured under limits_config (not query_range), and the default became 30m rather than 0; Loki fails to start if split_queries_by_interval is left under query_range. [3]

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Find Loki image tags in repo (templates) =="
rg -n --no-heading "loki.*image:" . || true

echo
echo "== Show tools/loki/local-config.yaml with line numbers (1-120) =="
wc -l tools/loki/local-config.yaml || true
cat -n tools/loki/local-config.yaml | sed -n '1,140p'

echo
echo "== Show exact blocks around match_max_concurrent (line ~9) =="
sed -n '1,25p' tools/loki/local-config.yaml

echo
echo "== Show exact blocks around split_queries_by_interval (lines ~45-110) =="
sed -n '40,110p' tools/loki/local-config.yaml

Repository: ansible/awx

Length of output: 6453


Correct the Loki “unsupported” compatibility notes in tools/loki/local-config.yaml

  • Line 9: The note “match_max_concurrent … not supported in newer Loki versions” is misleading—frontend_worker.match_max_concurrent is still a supported Loki setting.
  • Line 84: split_queries_by_interval: 0 is a valid way to disable query splitting, but the setting is expected under limits_config (not query_range); the comment should reflect the placement incompatibility rather than claiming 0 is unsupported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loki/local-config.yaml` at line 9, Update the misleading comments in
the Loki config: change the note about match_max_concurrent to state that
frontend_worker.match_max_concurrent is still supported (remove "not supported
in newer Loki versions") so the comment next to
frontend_worker.match_max_concurrent reflects support; and move or update the
comment for split_queries_by_interval to indicate that setting can be set to 0
to disable query splitting but it belongs under limits_config (not under
query_range), so adjust the comment near split_queries_by_interval to reference
placement incompatibility rather than claiming "0" is unsupported.

otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Security: TLS verification disabled (acceptable for development only).

Both Loki and Tempo exporters have insecure: true, which disables TLS certificate verification. This is acceptable for local development but must not be used in production environments.

Also applies to: 31-31

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/otel/otel-collector-config.yaml` at line 26, The Loki and Tempo
exporter blocks currently set insecure: true which disables TLS verification;
update the otel collector config by changing insecure: true to insecure: false
for production and make this value configurable (e.g., via an environment
variable or config templating) so the loki and tempo exporter sections can use a
boolean flag (referencing the loki and tempo exporter blocks and the insecure
field) to switch between development (true) and production (false) deployments;
ensure docs/helm/manifest overrides reflect the new env/config var.

When GRAFANA=1 but PROMETHEUS=LOKI=TEMPO=0, template generated:
  grafana:
    depends_on:

Empty depends_on invalid. Add conditional wrapper around depends_on block.

Co-authored-by: Claude (Anthropic) <claude@anthropic.com>
Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:api dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant