Add OpenTelemetry observability to AWX by chrismeyersfsu · Pull Request #16462 · ansible/awx

chrismeyersfsu · 2026-05-28T18:47:42Z

SUMMARY

Adds comprehensive OpenTelemetry instrumentation to AWX for distributed tracing and observability.

Code Instrumentation:

Dispatcher: Task execution spans with function names, UUIDs, correlation IDs, exception tracking
Management commands: Observability for dispatcherd, cache_clear, callback_receiver, rsyslog_configurer, ws_heartbeat, wsrelay
ASGI/WSGI entry points: Request tracing

Infrastructure:

Add Tempo trace storage backend to docker-compose
Configure OTEL collector to route logs→Loki, traces→Tempo, metrics→Prometheus
Add Grafana Tempo datasource with traces-to-logs integration
Update Loki config for OTLP ingestion and structured metadata

Dependencies:

Upgrade ansible-base for observability utilities
Add opentelemetry-* tracing packages

ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME

API
Other (Dispatcher, Management Commands, Observability Stack)

STEPS TO REPRODUCE AND EXTRA INFO

Usage:

OTEL=1 LOKI=1 TEMPO=1 GRAFANA=1 make docker-compose

Verification:

Start AWX with observability enabled
Trigger dispatcher tasks (visit UI, wait for periodic tasks)
Query Tempo: curl -s "http://localhost:3200/api/search?tags=service.name=aap-controller-dispatcher" | jq
View in Grafana at http://localhost:3001
- Explore → Tempo datasource
- Search for service: aap-controller-dispatcher
- Verify span names = task function names
- Verify attributes: task.uuid, task.name, task.module, correlation_id
- Click trace→logs link to see correlated logs

Before:

No dispatcher tracing
No trace correlation with logs
Limited observability into async task execution

After:

Rich dispatcher spans with function names as routes
Full trace→log correlation via service.name and trace/span IDs
Exception tracking in spans
Metrics exported to Prometheus

Instrument AWX components with OpenTelemetry tracing: Code Changes: - Add telemetry to dispatcher task execution (dispatcherd.py, task.py) - Span name = task function name (e.g., 'apply_cluster_membership_policies') - Attributes: task.uuid, task.name, task.module, correlation_id - Handle lambda broker tasks with normalized span names - Capture exceptions with span.record_exception() - Add observability to management commands - dispatcherd, run_cache_clear, run_callback_receiver - run_rsyslog_configurer, run_ws_heartbeat, run_wsrelay - Service names: aap-controller-{component} - Add observability to ASGI/WSGI entry points - Service name: aap-controller-uwsgi Infrastructure: - Add TEMPO env var to Makefile - Add enable_otel/enable_loki/enable_tempo flags to ansible - Add Tempo container to docker-compose - Service mesh network for OTEL→Loki/Tempo communication - Volume mounts for configs and data storage - Configure OTEL collector - Export logs→Loki, traces→Tempo, metrics→Prometheus - File exporter with compression for backup - Configure Loki - Enable structured metadata (required for OTLP) - Increase rate limits for high-volume logging - 3y retention period - Add Grafana Tempo datasource - Traces-to-logs integration - Map trace attributes to Loki labels - Enable trace/span ID filtering Dependencies: - Upgrade ansible-base for observability utilities - Add opentelemetry-* packages for tracing Usage: OTEL=1 LOKI=1 TEMPO=1 GRAFANA=1 make docker-compose Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com> Co-authored-by: Claude (Anthropic) <claude@anthropic.com>

coderabbitai · 2026-05-28T18:48:01Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3cc7b034-67d6-40e8-9b94-75140d789975

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR integrates OpenTelemetry distributed tracing into AWX. It upgrades OpenTelemetry dependencies to 1.39.0+, adds observability initialization to all application entry points (ASGI, WSGI, management commands), instruments task dispatch workers and task execution with span wrapping and error tracking, and provisions development-time observability services (Grafana, Loki, Tempo, OTel Collector) via Docker Compose with updated configurations.

Changes

OpenTelemetry Observability Integration

Layer / File(s)	Summary
OpenTelemetry Dependencies and Requirements `requirements/requirements.in`, `requirements/requirements.txt`, `requirements/requirements_git.txt`	OpenTelemetry API and SDK bumped to >=1.39.0; full dependency lock regenerated with upgraded async/networking, cloud SDKs, Django stack, and observability tooling; django-ansible-base git requirement adds observability and resource-registry extras.
Service Entry Points — Observability Initialization `awx/asgi.py`, `awx/wsgi.py`, `awx/main/management/commands/*`	ASGI (Daphne), WSGI (uWSGI), and six management commands (dispatcherd, cache-clear, callback-receiver, rsyslog-configurer, ws-heartbeat, wsrelay) import and invoke `setup_observability()` with service-specific names at startup.
Task Dispatch Worker Process — Observability and Span Wrapping `awx/main/dispatch/worker/dispatcherd.py`	`AWXTaskWorker.on_start()` initializes observability in worker process (post-fork); `run_callable()` overridden to wrap task execution in OpenTelemetry spans with task metadata, correlation ID, span attributes, and structured error handling (status, exception recording, re-raise).
Task Execution — Span Instrumentation at Task Level `awx/main/dispatch/worker/task.py`	`run_callable()` wrapped in OpenTelemetry span named after task function; span records task metadata, correlation ID, message delay, execution status, and exception details; successful execution marked OK, exceptions marked ERROR with metadata.
Development Environment — Observability Services and Configuration `Makefile`, `tools/docker-compose/ansible/roles/sources/defaults/main.yml`, `tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`, `tools/loki/local-config.yaml`, `tools/otel/otel-collector-config.yaml`, `tools/grafana/datasources/tempo_source.yml`	Makefile adds TEMPO toggle; Ansible defaults add enable_otel/enable_loki/enable_tempo flags; Docker Compose template conditionally deploys Grafana (new image/network/auth), OpenTelemetry Collector (updated image/command/mounts), Loki, and Tempo with inter-service dependencies; Loki migrated to schema v13 (tsdb storage) with structured metadata support for OTLP; OTel Collector routes logs to Loki and traces to Tempo; Grafana Tempo datasource provisioned with trace-to-logs mapping and node graph visualization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add OpenTelemetry observability to AWX' accurately summarizes the main objective of the pull request—comprehensive OpenTelemetry instrumentation and observability infrastructure additions across AWX services.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awx/main/dispatch/worker/dispatcherd.py`:
- Around line 79-80: The trace currently attaches the raw exception string via
span.set_attribute("task.error_message", str(e)) which may leak secrets; update
the error handling in dispatcherd (where span.set_attribute("task.error_type",
type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set)
to stop exporting the raw message: keep or set only the error type attribute,
and replace the raw text attachment with span.record_exception(e) (or, if you
must include text, apply a sanitization/redaction function before setting an
attribute) so traces contain error metadata without leaking sensitive content.

In `@awx/main/dispatch/worker/task.py`:
- Around line 87-88: The code currently attaches raw exception text via
span.set_attribute("task.error_message", str(e)) which can leak secrets/PII;
instead remove the raw message attribute and rely on span.record_exception(e) to
record the exception and keep span.set_attribute("task.error_type",
type(e).__name__) for structured type info, or if a message attribute is
required, set a strictly redacted/sanitized message (e.g., "redacted" or a small
safe summary extracted by a sanitize_exception function) before calling
span.set_attribute; update the error-handling block around span.set_attribute
and span.record_exception to implement this change (referencing span,
task.error_message, task.error_type, record_exception, and the exception
variable e).

In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`:
- Line 169: Replace the four observability images that currently use the :latest
tag with fixed, versioned tags or digests to make deployments reproducible:
update docker.io/grafana/grafana:latest,
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest,
docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific
semantic-version tags (or content digests) and commit those values into the
docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and
Tempo use pinned versions.
- Around line 184-192: The Grafana service currently renders a depends_on key
even when no backends are enabled; update the docker-compose Jinja template to
conditionally render the entire depends_on block only when at least one of
enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the
depends_on: and its list items with a single if that checks those flags) and
ensure the list still emits prometheus/loki/tempo entries using the existing
checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any
image references using :latest for the grafana, opentelemetry-collector-contrib,
loki, and tempo images with pinned version tags or digests so those symbols
(grafana image, opentelemetry-collector-contrib image, loki image, tempo image)
use deterministic, non-:latest identifiers.

In `@tools/grafana/datasources/tempo_source.yml`:
- Line 12: The Tempo datasource references jsonData.tracesToLogs.datasourceUid:
'P8E80F9AEF21F6940' (and dashboards also reference that UID) but the Loki
datasource provisioning lacks an explicit uid, so trace-to-logs will fail; open
the Loki datasource provisioning file where type: loki is defined and add uid:
'P8E80F9AEF21F6940' at the top level of that datasource entry so the Loki
datasource uid matches Tempo’s tracesToLogs.datasourceUid and the dashboard
references.

In `@tools/loki/local-config.yaml`:
- Line 9: Update the misleading comments in the Loki config: change the note
about match_max_concurrent to state that frontend_worker.match_max_concurrent is
still supported (remove "not supported in newer Loki versions") so the comment
next to frontend_worker.match_max_concurrent reflects support; and move or
update the comment for split_queries_by_interval to indicate that setting can be
set to 0 to disable query splitting but it belongs under limits_config (not
under query_range), so adjust the comment near split_queries_by_interval to
reference placement incompatibility rather than claiming "0" is unsupported.

In `@tools/otel/otel-collector-config.yaml`:
- Line 26: The Loki and Tempo exporter blocks currently set insecure: true which
disables TLS verification; update the otel collector config by changing
insecure: true to insecure: false for production and make this value
configurable (e.g., via an environment variable or config templating) so the
loki and tempo exporter sections can use a boolean flag (referencing the loki
and tempo exporter blocks and the insecure field) to switch between development
(true) and production (false) deployments; ensure docs/helm/manifest overrides
reflect the new env/config var.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 20606100-fafc-48d1-a189-23cc0ecca31c

📥 Commits

Reviewing files that changed from the base of the PR and between 9b922f7 and 7f82fe4.

📒 Files selected for processing (19)

Makefile
awx/asgi.py
awx/main/dispatch/worker/dispatcherd.py
awx/main/dispatch/worker/task.py
awx/main/management/commands/dispatcherd.py
awx/main/management/commands/run_cache_clear.py
awx/main/management/commands/run_callback_receiver.py
awx/main/management/commands/run_rsyslog_configurer.py
awx/main/management/commands/run_ws_heartbeat.py
awx/main/management/commands/run_wsrelay.py
awx/wsgi.py
requirements/requirements.in
requirements/requirements.txt
requirements/requirements_git.txt
tools/docker-compose/ansible/roles/sources/defaults/main.yml
tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
tools/grafana/datasources/tempo_source.yml
tools/loki/local-config.yaml
tools/otel/otel-collector-config.yaml

coderabbitai · 2026-05-28T18:57:21Z

+                span.set_attribute("task.error_type", type(e).__name__)
+                span.set_attribute("task.error_message", str(e))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid exporting raw exception messages to tracing attributes.

task.error_message = str(e) can leak sensitive values into traces. Keep error type and rely on record_exception(e) (or sanitize/redact before attaching text).

Suggested fix

- span.set_attribute("task.error_type", type(e).__name__) - span.set_attribute("task.error_message", str(e)) + span.set_attribute("task.error_type", type(e).__name__) span.record_exception(e)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

span.set_attribute("task.error_type", type(e).__name__)

span.set_attribute("task.error_message", str(e))

span.set_attribute("task.error_type", type(e).__name__)

span.record_exception(e)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@awx/main/dispatch/worker/dispatcherd.py` around lines 79 - 80, The trace currently attaches the raw exception string via span.set_attribute("task.error_message", str(e)) which may leak secrets; update the error handling in dispatcherd (where span.set_attribute("task.error_type", type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set) to stop exporting the raw message: keep or set only the error type attribute, and replace the raw text attachment with span.record_exception(e) (or, if you must include text, apply a sanitization/redaction function before setting an attribute) so traces contain error metadata without leaking sensitive content.

coderabbitai · 2026-05-28T18:57:21Z

+            span.set_attribute("task.error_type", type(e).__name__)
+            span.set_attribute("task.error_message", str(e))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not attach raw exception text to span attributes.

Publishing str(e) to task.error_message risks secret/PII exposure in telemetry backends. Keep structured type info and recorded exception, or apply strict redaction.

Suggested fix

span.set_status(Status(StatusCode.ERROR, str(e))) span.set_attribute("task.error_type", type(e).__name__) - span.set_attribute("task.error_message", str(e)) span.record_exception(e)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

span.set_attribute("task.error_type", type(e).__name__)

span.set_attribute("task.error_message", str(e))

span.set_attribute("task.error_type", type(e).__name__)

span.record_exception(e)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@awx/main/dispatch/worker/task.py` around lines 87 - 88, The code currently attaches raw exception text via span.set_attribute("task.error_message", str(e)) which can leak secrets/PII; instead remove the raw message attribute and rely on span.record_exception(e) to record the exception and keep span.set_attribute("task.error_type", type(e).__name__) for structured type info, or if a message attribute is required, set a strictly redacted/sanitized message (e.g., "redacted" or a small safe summary extracted by a sanitize_exception function) before calling span.set_attribute; update the error-handling block around span.set_attribute and span.record_exception to implement this change (referencing span, task.error_message, task.error_type, record_exception, and the exception variable e).

coderabbitai · 2026-05-28T18:57:21Z

 {% if enable_grafana|bool %}
  grafana:
-    image: mirror.gcr.io/grafana/grafana-enterprise:12.3.4
+    image: docker.io/grafana/grafana:latest


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2" echo "== Check file exists ==" ls -la "$FILE" echo echo "== Show lines around 160-180 ==" nl -ba "$FILE" | sed -n '150,190p' echo echo "== Show lines mentioned by the review comment (243, 268, 283) ==" for L in 243 268 283; do echo "--- line $L ---" nl -ba "$FILE" | sed -n "${L},${L}p" done echo echo "== Find all occurrences of ':latest' in the template ==" rg -n ":[Ll]atest\b" "$FILE" || true echo echo "== Find all occurrences of grafana/loki/tempo/otel in the template ==" rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || true

Repository: ansible/awx

Length of output: 299

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2" echo "== Show lines around 160-180 ==" awk -v start=150 -v end=190 'NR>=start && NR<=end {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,120p' echo echo "== Show lines mentioned by the review comment (243, 268, 283) ==" for L in 243 268 283; do awk -v n="$L" 'NR==n {printf "%d:%s\n", NR, $0}' "$FILE" done echo echo "== Find all occurrences of ':latest' in the template ==" rg -n ":[Ll]atest\b" "$FILE" || true echo echo "== Find all occurrences of grafana/loki/tempo/otel in the template ==" rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || true

Repository: ansible/awx

Length of output: 3833

Pin observability container image tags instead of using :latest

tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 uses :latest for the observability stack, making local/dev deployments non-reproducible:

Grafana: docker.io/grafana/grafana:latest (169)

OpenTelemetry Collector: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest (243)

Loki: docker.io/grafana/loki:latest (268)

Tempo: docker.io/grafana/tempo:latest (283)

Pin these to specific version tags (or digests) for stable, predictable behavior.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2` at line 169, Replace the four observability images that currently use the :latest tag with fixed, versioned tags or digests to make deployments reproducible: update docker.io/grafana/grafana:latest, ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest, docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific semantic-version tags (or content digests) and commit those values into the docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and Tempo use pinned versions.

coderabbitai · 2026-05-28T18:57:21Z

+{% if enable_prometheus|bool %}
      - prometheus
+{% endif %}
+{% if enable_loki|bool %}
+      - loki
+{% endif %}
+{% if enable_tempo|bool %}
+      - tempo
+{% endif %}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2" echo "== File exists ==" test -f "$FILE" && echo "Found: $FILE" echo echo "== Inspect depends_on block around requested lines ==" sed -n '150,230p' "$FILE" | nl -ba echo echo "== Search for :latest in this template ==" rg -n --hidden --no-ignore -S ":latest" "$FILE" || true echo echo "== Search for grafana service section to see exact depends_on rendering ==" rg -n "^\s*grafana\s*:" -S "$FILE" || true rg -n "depends_on" "$FILE" | head -n 50 echo echo "== Show full depends_on block(s) for context ==" # Print chunks containing depends_on and a bit after/before rg -n "depends_on" "$FILE" | while IFS=: read -r ln _; do start=$((ln-20)); end=$((ln+40)); sed -n "${start},${end}p" "$FILE" | nl -ba | sed "s/^/$FILE (around depends_on at line $ln): /" | head -n 120 done

Repository: ansible/awx

Length of output: 272

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2" echo "== Inspect depends_on block around lines 150-230 (numbered via cat -n) ==" sed -n '150,230p' "$FILE" | cat -n echo echo "== Search for :latest in this template ==" rg -n --hidden --no-ignore -S ":latest" "$FILE" || true echo echo "== Show grafana service header (if present) ==" rg -n "grafana\s*:" -S "$FILE" || true echo echo "== Show depends_on occurrences ==" rg -n "depends_on" "$FILE" || true echo echo "== Print depends_on blocks with a bit of surrounding context ==" # For each depends_on line, print ~20 lines around it (bounded) rg -n "depends_on" "$FILE" | head -n 50 | while IFS=: read -r ln _; do start=$((ln-20)); end=$((ln+60)); (( start<1 )) && start=1 echo "---- $FILE around depends_on at line $ln ----" sed -n "${start},${end}p" "$FILE" | cat -n done

Repository: ansible/awx

Length of output: 17841

Guard Grafana depends_on and pin observability images

Grafana currently renders depends_on: unconditionally; if prometheus/loki/tempo are all disabled, the key renders with no entries (effectively depends_on: null), which can break Compose parsing. Wrap depends_on itself with the same condition as the list items.

Suggested fix

- depends_on: -{% if enable_prometheus|bool %} +{% if enable_prometheus|bool or enable_loki|bool or enable_tempo|bool %} + depends_on: +{% if enable_prometheus|bool %} - prometheus {% endif %} {% if enable_loki|bool %} - loki {% endif %} {% if enable_tempo|bool %} - tempo {% endif %} +{% endif %}

Replace :latest tags for grafana, opentelemetry-collector-contrib, loki, and tempo with pinned versions (or digests) to restore deterministic builds and tighter supply-chain control.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2` around lines 184 - 192, The Grafana service currently renders a depends_on key even when no backends are enabled; update the docker-compose Jinja template to conditionally render the entire depends_on block only when at least one of enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the depends_on: and its list items with a single if that checks those flags) and ensure the list still emits prometheus/loki/tempo entries using the existing checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any image references using :latest for the grafana, opentelemetry-collector-contrib, loki, and tempo images with pinned version tags or digests so those symbols (grafana image, opentelemetry-collector-contrib image, loki image, tempo image) use deterministic, non-:latest identifiers.

coderabbitai · 2026-05-28T18:57:21Z

+    jsonData:
+      httpMethod: GET
+      tracesToLogs:
+        datasourceUid: 'P8E80F9AEF21F6940'


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check Loki datasource provisioning configuration for UID # Find Loki datasource provisioning files fd -e yml -e yaml . tools/grafana/datasources/ -x cat {} # Look for Loki datasource UID in provisioning configs rg -n "uid.*P8E80F9AEF21F6940" tools/grafana/

Repository: ansible/awx

Length of output: 1633

Set an explicit Loki datasource uid to match Tempo’s tracesToLogs.datasourceUid.

tools/grafana/datasources/tempo_source.yml uses jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and tools/grafana/dashboards/services_dashboard.json also references that UID), but the Loki datasource provisioning under tools/grafana/datasources/ does not define any uid: field—so the referenced UID is unlikely to exist and trace-to-logs will break. Add uid: 'P8E80F9AEF21F6940' to the Loki datasource provisioning so it matches Tempo (and the dashboard).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/grafana/datasources/tempo_source.yml` at line 12, The Tempo datasource references jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and dashboards also reference that UID) but the Loki datasource provisioning lacks an explicit uid, so trace-to-logs will fail; open the Loki datasource provisioning file where type: loki is defined and add uid: 'P8E80F9AEF21F6940' at the top level of that datasource entry so the Loki datasource uid matches Tempo’s tracesToLogs.datasourceUid and the dashboard references.

coderabbitai · 2026-05-28T18:57:21Z


 frontend_worker:
-  match_max_concurrent: true
+  # match_max_concurrent: true  # Removed - not supported in newer Loki versions


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check Loki version and validate unsupported config settings # Extract Loki image version from docker-compose rg -n "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 # Check Loki documentation for these settings echo "Verify the following in Loki documentation for the version found above:" echo "1. frontend_worker.match_max_concurrent support" echo "2. query_range.split_queries_by_interval: 0 support"

Repository: ansible/awx

Length of output: 227

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Find Loki image tags in docker-compose template ==" rg -n --no-heading "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 || true echo echo "== Show commented settings in tools/loki/local-config.yaml ==" if [ -f tools/loki/local-config.yaml ]; then rg -n --no-heading "match_max_concurrent|split_queries_by_interval" tools/loki/local-config.yaml || true echo echo "== Show surrounding lines for context (lines 1-120) ==" nl -ba tools/loki/local-config.yaml | sed -n '1,120p' else echo "Missing tools/loki/local-config.yaml" fi

Repository: ansible/awx

Length of output: 494

🌐 Web query:

Loki configuration frontend_worker.match_max_concurrent removed unsupported version

💡 Result:

The configuration setting frontend_worker.match_max_concurrent was not removed from Loki [1][2]; rather, it is a functional setting that superseded the older frontend_worker.parallelism configuration [1][2]. If you are encountering an error indicating an unsupported version or a failure to parse the configuration, it is likely due to one of the following reasons: 1. Deprecated or Removed Configuration: While match_max_concurrent remains, other configuration parameters in the frontend_worker or querier blocks may have been removed or moved in newer versions of Loki (e.g., v3.0+) [2][3][4]. Check your Loki logs for specific "field not found" errors, which will identify exactly which key is no longer supported [3][4]. 2. Configuration Syntax/Structure Changes: Loki often requires configuration fields to be moved to different parent blocks (e.g., moving settings under limits_config or changing how storage_config is defined) during major version upgrades [5][3]. 3. Version Compatibility: If you are upgrading across major versions (e.g., v2.x to v3.x), review the official upgrade documentation [2]. Settings like querier.worker-parallelism have been removed entirely, and their functionality is now handled by querier.max-concurrent [2]. Recommended Action: - Examine the exact error message in your Loki logs to identify the specific unsupported field [3][4]. - Consult the official Grafana Loki upgrade guide for your specific version jump to see which keys were deprecated or relocated [2]. - Ensure your configuration file structure matches the requirements for your currently installed Loki version, as automated installs or default configs can sometimes lag behind the latest binary changes [3][4].

Citations:

1: operator: Replace frontend_worker parallelism with match_max_concurrent grafana/loki#5693

2: https://grafana.com/docs/loki/latest/setup/upgrade/

3: Broken configuration file when installing via apt on Ubuntu 22.04 grafana/loki#17736

4: [Bug] Loki service fails to start after installation on aarch64 grafana/loki#15421

5: Unrecognized storage client after schema upgrade grafana/loki#13141

🌐 Web query:

Loki configuration query_range.split_queries_by_interval support value 0

💡 Result:

Loki’s split_queries_by_interval supports the value 0 as a valid “disable splitting” setting. When the effective interval is 0, Loki’s query-range splitting middleware skips splitting and forwards the request unchanged (interval == 0 => return next.Do(ctx, r)). [1] In older documentation/comments for this setting, “0 disables it” is stated alongside guidance to use a positive duration (e.g., in multiples of 24h) to avoid downloading/processing the same chunks across splits. [2] Note: as of Loki v2.5.0, split_queries_by_interval can only be configured under limits_config (not query_range), and the default became 30m rather than 0; Loki fails to start if split_queries_by_interval is left under query_range. [3]

Citations:

1: https://github.com/grafana/loki/blob/06ce74b7/pkg/querier/queryrange/split_by_interval.go

2: loki failing when querying huge data grafana/loki#7084

3: https://grafana.com/docs/loki/latest/release-notes/v2-5/

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Find Loki image tags in repo (templates) ==" rg -n --no-heading "loki.*image:" . || true echo echo "== Show tools/loki/local-config.yaml with line numbers (1-120) ==" wc -l tools/loki/local-config.yaml || true cat -n tools/loki/local-config.yaml | sed -n '1,140p' echo echo "== Show exact blocks around match_max_concurrent (line ~9) ==" sed -n '1,25p' tools/loki/local-config.yaml echo echo "== Show exact blocks around split_queries_by_interval (lines ~45-110) ==" sed -n '40,110p' tools/loki/local-config.yaml

Repository: ansible/awx

Length of output: 6453

Correct the Loki “unsupported” compatibility notes in tools/loki/local-config.yaml

Line 9: The note “match_max_concurrent … not supported in newer Loki versions” is misleading—frontend_worker.match_max_concurrent is still a supported Loki setting.

Line 84: split_queries_by_interval: 0 is a valid way to disable query splitting, but the setting is expected under limits_config (not query_range); the comment should reflect the placement incompatibility rather than claiming 0 is unsupported.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loki/local-config.yaml` at line 9, Update the misleading comments in the Loki config: change the note about match_max_concurrent to state that frontend_worker.match_max_concurrent is still supported (remove "not supported in newer Loki versions") so the comment next to frontend_worker.match_max_concurrent reflects support; and move or update the comment for split_queries_by_interval to indicate that setting can be set to 0 to disable query splitting but it belongs under limits_config (not under query_range), so adjust the comment near split_queries_by_interval to reference placement incompatibility rather than claiming "0" is unsupported.

coderabbitai · 2026-05-28T18:57:21Z

+  otlphttp/loki:
+    endpoint: http://loki:3100/otlp
    tls:
      insecure: true


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Security: TLS verification disabled (acceptable for development only).

Both Loki and Tempo exporters have insecure: true, which disables TLS certificate verification. This is acceptable for local development but must not be used in production environments.

Also applies to: 31-31

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/otel/otel-collector-config.yaml` at line 26, The Loki and Tempo exporter blocks currently set insecure: true which disables TLS verification; update the otel collector config by changing insecure: true to insecure: false for production and make this value configurable (e.g., via an environment variable or config templating) so the loki and tempo exporter sections can use a boolean flag (referencing the loki and tempo exporter blocks and the insecure field) to switch between development (true) and production (false) deployments; ensure docs/helm/manifest overrides reflect the new env/config var.

When GRAFANA=1 but PROMETHEUS=LOKI=TEMPO=0, template generated: grafana: depends_on: Empty depends_on invalid. Add conditional wrapper around depends_on block. Co-authored-by: Claude (Anthropic) <claude@anthropic.com> Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com>

github-actions Bot added component:api dependencies Pull requests that update a dependency file labels May 28, 2026

chrismeyersfsu marked this pull request as draft May 28, 2026 18:57

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

chrismeyersfsu force-pushed the telemetry branch from 32ec97b to 7f82fe4 Compare May 28, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenTelemetry observability to AWX#16462

Add OpenTelemetry observability to AWX#16462
chrismeyersfsu wants to merge 2 commits into
ansible:develfrom
chrismeyersfsu:telemetry

chrismeyersfsu commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		span.set_attribute("task.error_type", type(e).__name__)
		span.set_attribute("task.error_message", str(e))

Conversation

chrismeyersfsu commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

ISSUE TYPE

COMPONENT NAME

STEPS TO REPRODUCE AND EXTRA INFO

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chrismeyersfsu commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading