Add OpenTelemetry observability to AWX#16462
Conversation
Instrument AWX components with OpenTelemetry tracing:
Code Changes:
- Add telemetry to dispatcher task execution (dispatcherd.py, task.py)
- Span name = task function name (e.g., 'apply_cluster_membership_policies')
- Attributes: task.uuid, task.name, task.module, correlation_id
- Handle lambda broker tasks with normalized span names
- Capture exceptions with span.record_exception()
- Add observability to management commands
- dispatcherd, run_cache_clear, run_callback_receiver
- run_rsyslog_configurer, run_ws_heartbeat, run_wsrelay
- Service names: aap-controller-{component}
- Add observability to ASGI/WSGI entry points
- Service name: aap-controller-uwsgi
Infrastructure:
- Add TEMPO env var to Makefile
- Add enable_otel/enable_loki/enable_tempo flags to ansible
- Add Tempo container to docker-compose
- Service mesh network for OTEL→Loki/Tempo communication
- Volume mounts for configs and data storage
- Configure OTEL collector
- Export logs→Loki, traces→Tempo, metrics→Prometheus
- File exporter with compression for backup
- Configure Loki
- Enable structured metadata (required for OTLP)
- Increase rate limits for high-volume logging
- 3y retention period
- Add Grafana Tempo datasource
- Traces-to-logs integration
- Map trace attributes to Loki labels
- Enable trace/span ID filtering
Dependencies:
- Upgrade ansible-base for observability utilities
- Add opentelemetry-* packages for tracing
Usage: OTEL=1 LOKI=1 TEMPO=1 GRAFANA=1 make docker-compose
Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com>
Co-authored-by: Claude (Anthropic) <claude@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR integrates OpenTelemetry distributed tracing into AWX. It upgrades OpenTelemetry dependencies to 1.39.0+, adds observability initialization to all application entry points (ASGI, WSGI, management commands), instruments task dispatch workers and task execution with span wrapping and error tracking, and provisions development-time observability services (Grafana, Loki, Tempo, OTel Collector) via Docker Compose with updated configurations. ChangesOpenTelemetry Observability Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awx/main/dispatch/worker/dispatcherd.py`:
- Around line 79-80: The trace currently attaches the raw exception string via
span.set_attribute("task.error_message", str(e)) which may leak secrets; update
the error handling in dispatcherd (where span.set_attribute("task.error_type",
type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set)
to stop exporting the raw message: keep or set only the error type attribute,
and replace the raw text attachment with span.record_exception(e) (or, if you
must include text, apply a sanitization/redaction function before setting an
attribute) so traces contain error metadata without leaking sensitive content.
In `@awx/main/dispatch/worker/task.py`:
- Around line 87-88: The code currently attaches raw exception text via
span.set_attribute("task.error_message", str(e)) which can leak secrets/PII;
instead remove the raw message attribute and rely on span.record_exception(e) to
record the exception and keep span.set_attribute("task.error_type",
type(e).__name__) for structured type info, or if a message attribute is
required, set a strictly redacted/sanitized message (e.g., "redacted" or a small
safe summary extracted by a sanitize_exception function) before calling
span.set_attribute; update the error-handling block around span.set_attribute
and span.record_exception to implement this change (referencing span,
task.error_message, task.error_type, record_exception, and the exception
variable e).
In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`:
- Line 169: Replace the four observability images that currently use the :latest
tag with fixed, versioned tags or digests to make deployments reproducible:
update docker.io/grafana/grafana:latest,
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest,
docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific
semantic-version tags (or content digests) and commit those values into the
docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and
Tempo use pinned versions.
- Around line 184-192: The Grafana service currently renders a depends_on key
even when no backends are enabled; update the docker-compose Jinja template to
conditionally render the entire depends_on block only when at least one of
enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the
depends_on: and its list items with a single if that checks those flags) and
ensure the list still emits prometheus/loki/tempo entries using the existing
checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any
image references using :latest for the grafana, opentelemetry-collector-contrib,
loki, and tempo images with pinned version tags or digests so those symbols
(grafana image, opentelemetry-collector-contrib image, loki image, tempo image)
use deterministic, non-:latest identifiers.
In `@tools/grafana/datasources/tempo_source.yml`:
- Line 12: The Tempo datasource references jsonData.tracesToLogs.datasourceUid:
'P8E80F9AEF21F6940' (and dashboards also reference that UID) but the Loki
datasource provisioning lacks an explicit uid, so trace-to-logs will fail; open
the Loki datasource provisioning file where type: loki is defined and add uid:
'P8E80F9AEF21F6940' at the top level of that datasource entry so the Loki
datasource uid matches Tempo’s tracesToLogs.datasourceUid and the dashboard
references.
In `@tools/loki/local-config.yaml`:
- Line 9: Update the misleading comments in the Loki config: change the note
about match_max_concurrent to state that frontend_worker.match_max_concurrent is
still supported (remove "not supported in newer Loki versions") so the comment
next to frontend_worker.match_max_concurrent reflects support; and move or
update the comment for split_queries_by_interval to indicate that setting can be
set to 0 to disable query splitting but it belongs under limits_config (not
under query_range), so adjust the comment near split_queries_by_interval to
reference placement incompatibility rather than claiming "0" is unsupported.
In `@tools/otel/otel-collector-config.yaml`:
- Line 26: The Loki and Tempo exporter blocks currently set insecure: true which
disables TLS verification; update the otel collector config by changing
insecure: true to insecure: false for production and make this value
configurable (e.g., via an environment variable or config templating) so the
loki and tempo exporter sections can use a boolean flag (referencing the loki
and tempo exporter blocks and the insecure field) to switch between development
(true) and production (false) deployments; ensure docs/helm/manifest overrides
reflect the new env/config var.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 20606100-fafc-48d1-a189-23cc0ecca31c
📒 Files selected for processing (19)
Makefileawx/asgi.pyawx/main/dispatch/worker/dispatcherd.pyawx/main/dispatch/worker/task.pyawx/main/management/commands/dispatcherd.pyawx/main/management/commands/run_cache_clear.pyawx/main/management/commands/run_callback_receiver.pyawx/main/management/commands/run_rsyslog_configurer.pyawx/main/management/commands/run_ws_heartbeat.pyawx/main/management/commands/run_wsrelay.pyawx/wsgi.pyrequirements/requirements.inrequirements/requirements.txtrequirements/requirements_git.txttools/docker-compose/ansible/roles/sources/defaults/main.ymltools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2tools/grafana/datasources/tempo_source.ymltools/loki/local-config.yamltools/otel/otel-collector-config.yaml
| span.set_attribute("task.error_type", type(e).__name__) | ||
| span.set_attribute("task.error_message", str(e)) |
There was a problem hiding this comment.
Avoid exporting raw exception messages to tracing attributes.
task.error_message = str(e) can leak sensitive values into traces. Keep error type and rely on record_exception(e) (or sanitize/redact before attaching text).
Suggested fix
- span.set_attribute("task.error_type", type(e).__name__)
- span.set_attribute("task.error_message", str(e))
+ span.set_attribute("task.error_type", type(e).__name__)
span.record_exception(e)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| span.set_attribute("task.error_type", type(e).__name__) | |
| span.set_attribute("task.error_message", str(e)) | |
| span.set_attribute("task.error_type", type(e).__name__) | |
| span.record_exception(e) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@awx/main/dispatch/worker/dispatcherd.py` around lines 79 - 80, The trace
currently attaches the raw exception string via
span.set_attribute("task.error_message", str(e)) which may leak secrets; update
the error handling in dispatcherd (where span.set_attribute("task.error_type",
type(e).__name__) and span.set_attribute("task.error_message", str(e)) are set)
to stop exporting the raw message: keep or set only the error type attribute,
and replace the raw text attachment with span.record_exception(e) (or, if you
must include text, apply a sanitization/redaction function before setting an
attribute) so traces contain error metadata without leaking sensitive content.
| span.set_attribute("task.error_type", type(e).__name__) | ||
| span.set_attribute("task.error_message", str(e)) |
There was a problem hiding this comment.
Do not attach raw exception text to span attributes.
Publishing str(e) to task.error_message risks secret/PII exposure in telemetry backends. Keep structured type info and recorded exception, or apply strict redaction.
Suggested fix
span.set_status(Status(StatusCode.ERROR, str(e)))
span.set_attribute("task.error_type", type(e).__name__)
- span.set_attribute("task.error_message", str(e))
span.record_exception(e)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| span.set_attribute("task.error_type", type(e).__name__) | |
| span.set_attribute("task.error_message", str(e)) | |
| span.set_attribute("task.error_type", type(e).__name__) | |
| span.record_exception(e) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@awx/main/dispatch/worker/task.py` around lines 87 - 88, The code currently
attaches raw exception text via span.set_attribute("task.error_message", str(e))
which can leak secrets/PII; instead remove the raw message attribute and rely on
span.record_exception(e) to record the exception and keep
span.set_attribute("task.error_type", type(e).__name__) for structured type
info, or if a message attribute is required, set a strictly redacted/sanitized
message (e.g., "redacted" or a small safe summary extracted by a
sanitize_exception function) before calling span.set_attribute; update the
error-handling block around span.set_attribute and span.record_exception to
implement this change (referencing span, task.error_message, task.error_type,
record_exception, and the exception variable e).
| {% if enable_grafana|bool %} | ||
| grafana: | ||
| image: mirror.gcr.io/grafana/grafana-enterprise:12.3.4 | ||
| image: docker.io/grafana/grafana:latest |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"
echo "== Check file exists =="
ls -la "$FILE"
echo
echo "== Show lines around 160-180 =="
nl -ba "$FILE" | sed -n '150,190p'
echo
echo "== Show lines mentioned by the review comment (243, 268, 283) =="
for L in 243 268 283; do
echo "--- line $L ---"
nl -ba "$FILE" | sed -n "${L},${L}p"
done
echo
echo "== Find all occurrences of ':latest' in the template =="
rg -n ":[Ll]atest\b" "$FILE" || true
echo
echo "== Find all occurrences of grafana/loki/tempo/otel in the template =="
rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || trueRepository: ansible/awx
Length of output: 299
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"
echo "== Show lines around 160-180 =="
awk -v start=150 -v end=190 'NR>=start && NR<=end {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,120p'
echo
echo "== Show lines mentioned by the review comment (243, 268, 283) =="
for L in 243 268 283; do
awk -v n="$L" 'NR==n {printf "%d:%s\n", NR, $0}' "$FILE"
done
echo
echo "== Find all occurrences of ':latest' in the template =="
rg -n ":[Ll]atest\b" "$FILE" || true
echo
echo "== Find all occurrences of grafana/loki/tempo/otel in the template =="
rg -n "(grafana|loki|tempo|otel|opentelemetry|collector)" "$FILE" || trueRepository: ansible/awx
Length of output: 3833
Pin observability container image tags instead of using :latest
tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 uses :latest for the observability stack, making local/dev deployments non-reproducible:
- Grafana:
docker.io/grafana/grafana:latest(169) - OpenTelemetry Collector:
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest(243) - Loki:
docker.io/grafana/loki:latest(268) - Tempo:
docker.io/grafana/tempo:latest(283)
Pin these to specific version tags (or digests) for stable, predictable behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`
at line 169, Replace the four observability images that currently use the
:latest tag with fixed, versioned tags or digests to make deployments
reproducible: update docker.io/grafana/grafana:latest,
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest,
docker.io/grafana/loki:latest, and docker.io/grafana/tempo:latest to specific
semantic-version tags (or content digests) and commit those values into the
docker-compose.yml.j2 template so Grafana, OpenTelemetry Collector, Loki, and
Tempo use pinned versions.
| {% if enable_prometheus|bool %} | ||
| - prometheus | ||
| {% endif %} | ||
| {% if enable_loki|bool %} | ||
| - loki | ||
| {% endif %} | ||
| {% if enable_tempo|bool %} | ||
| - tempo | ||
| {% endif %} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"
echo "== File exists =="
test -f "$FILE" && echo "Found: $FILE"
echo
echo "== Inspect depends_on block around requested lines =="
sed -n '150,230p' "$FILE" | nl -ba
echo
echo "== Search for :latest in this template =="
rg -n --hidden --no-ignore -S ":latest" "$FILE" || true
echo
echo "== Search for grafana service section to see exact depends_on rendering =="
rg -n "^\s*grafana\s*:" -S "$FILE" || true
rg -n "depends_on" "$FILE" | head -n 50
echo
echo "== Show full depends_on block(s) for context =="
# Print chunks containing depends_on and a bit after/before
rg -n "depends_on" "$FILE" | while IFS=: read -r ln _; do
start=$((ln-20)); end=$((ln+40));
sed -n "${start},${end}p" "$FILE" | nl -ba | sed "s/^/$FILE (around depends_on at line $ln): /" | head -n 120
doneRepository: ansible/awx
Length of output: 272
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2"
echo "== Inspect depends_on block around lines 150-230 (numbered via cat -n) =="
sed -n '150,230p' "$FILE" | cat -n
echo
echo "== Search for :latest in this template =="
rg -n --hidden --no-ignore -S ":latest" "$FILE" || true
echo
echo "== Show grafana service header (if present) =="
rg -n "grafana\s*:" -S "$FILE" || true
echo
echo "== Show depends_on occurrences =="
rg -n "depends_on" "$FILE" || true
echo
echo "== Print depends_on blocks with a bit of surrounding context =="
# For each depends_on line, print ~20 lines around it (bounded)
rg -n "depends_on" "$FILE" | head -n 50 | while IFS=: read -r ln _; do
start=$((ln-20)); end=$((ln+60));
(( start<1 )) && start=1
echo "---- $FILE around depends_on at line $ln ----"
sed -n "${start},${end}p" "$FILE" | cat -n
doneRepository: ansible/awx
Length of output: 17841
Guard Grafana depends_on and pin observability images
- Grafana currently renders
depends_on:unconditionally; if prometheus/loki/tempo are all disabled, the key renders with no entries (effectivelydepends_on: null), which can break Compose parsing. Wrapdepends_onitself with the same condition as the list items.
Suggested fix
- depends_on:
-{% if enable_prometheus|bool %}
+{% if enable_prometheus|bool or enable_loki|bool or enable_tempo|bool %}
+ depends_on:
+{% if enable_prometheus|bool %}
- prometheus
{% endif %}
{% if enable_loki|bool %}
- loki
{% endif %}
{% if enable_tempo|bool %}
- tempo
{% endif %}
+{% endif %}- Replace
:latesttags forgrafana,opentelemetry-collector-contrib,loki, andtempowith pinned versions (or digests) to restore deterministic builds and tighter supply-chain control.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2`
around lines 184 - 192, The Grafana service currently renders a depends_on key
even when no backends are enabled; update the docker-compose Jinja template to
conditionally render the entire depends_on block only when at least one of
enable_prometheus, enable_loki, or enable_tempo is true (i.e., wrap the
depends_on: and its list items with a single if that checks those flags) and
ensure the list still emits prometheus/loki/tempo entries using the existing
checks (enable_prometheus, enable_loki, enable_tempo). Additionally, replace any
image references using :latest for the grafana, opentelemetry-collector-contrib,
loki, and tempo images with pinned version tags or digests so those symbols
(grafana image, opentelemetry-collector-contrib image, loki image, tempo image)
use deterministic, non-:latest identifiers.
| jsonData: | ||
| httpMethod: GET | ||
| tracesToLogs: | ||
| datasourceUid: 'P8E80F9AEF21F6940' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check Loki datasource provisioning configuration for UID
# Find Loki datasource provisioning files
fd -e yml -e yaml . tools/grafana/datasources/ -x cat {}
# Look for Loki datasource UID in provisioning configs
rg -n "uid.*P8E80F9AEF21F6940" tools/grafana/Repository: ansible/awx
Length of output: 1633
Set an explicit Loki datasource uid to match Tempo’s tracesToLogs.datasourceUid.
tools/grafana/datasources/tempo_source.yml uses jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and tools/grafana/dashboards/services_dashboard.json also references that UID), but the Loki datasource provisioning under tools/grafana/datasources/ does not define any uid: field—so the referenced UID is unlikely to exist and trace-to-logs will break. Add uid: 'P8E80F9AEF21F6940' to the Loki datasource provisioning so it matches Tempo (and the dashboard).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/grafana/datasources/tempo_source.yml` at line 12, The Tempo datasource
references jsonData.tracesToLogs.datasourceUid: 'P8E80F9AEF21F6940' (and
dashboards also reference that UID) but the Loki datasource provisioning lacks
an explicit uid, so trace-to-logs will fail; open the Loki datasource
provisioning file where type: loki is defined and add uid: 'P8E80F9AEF21F6940'
at the top level of that datasource entry so the Loki datasource uid matches
Tempo’s tracesToLogs.datasourceUid and the dashboard references.
|
|
||
| frontend_worker: | ||
| match_max_concurrent: true | ||
| # match_max_concurrent: true # Removed - not supported in newer Loki versions |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check Loki version and validate unsupported config settings
# Extract Loki image version from docker-compose
rg -n "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
# Check Loki documentation for these settings
echo "Verify the following in Loki documentation for the version found above:"
echo "1. frontend_worker.match_max_concurrent support"
echo "2. query_range.split_queries_by_interval: 0 support"Repository: ansible/awx
Length of output: 227
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find Loki image tags in docker-compose template =="
rg -n --no-heading "loki.*image:" tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 || true
echo
echo "== Show commented settings in tools/loki/local-config.yaml =="
if [ -f tools/loki/local-config.yaml ]; then
rg -n --no-heading "match_max_concurrent|split_queries_by_interval" tools/loki/local-config.yaml || true
echo
echo "== Show surrounding lines for context (lines 1-120) =="
nl -ba tools/loki/local-config.yaml | sed -n '1,120p'
else
echo "Missing tools/loki/local-config.yaml"
fiRepository: ansible/awx
Length of output: 494
🌐 Web query:
Loki configuration frontend_worker.match_max_concurrent removed unsupported version
💡 Result:
The configuration setting frontend_worker.match_max_concurrent was not removed from Loki [1][2]; rather, it is a functional setting that superseded the older frontend_worker.parallelism configuration [1][2]. If you are encountering an error indicating an unsupported version or a failure to parse the configuration, it is likely due to one of the following reasons: 1. Deprecated or Removed Configuration: While match_max_concurrent remains, other configuration parameters in the frontend_worker or querier blocks may have been removed or moved in newer versions of Loki (e.g., v3.0+) [2][3][4]. Check your Loki logs for specific "field not found" errors, which will identify exactly which key is no longer supported [3][4]. 2. Configuration Syntax/Structure Changes: Loki often requires configuration fields to be moved to different parent blocks (e.g., moving settings under limits_config or changing how storage_config is defined) during major version upgrades [5][3]. 3. Version Compatibility: If you are upgrading across major versions (e.g., v2.x to v3.x), review the official upgrade documentation [2]. Settings like querier.worker-parallelism have been removed entirely, and their functionality is now handled by querier.max-concurrent [2]. Recommended Action: - Examine the exact error message in your Loki logs to identify the specific unsupported field [3][4]. - Consult the official Grafana Loki upgrade guide for your specific version jump to see which keys were deprecated or relocated [2]. - Ensure your configuration file structure matches the requirements for your currently installed Loki version, as automated installs or default configs can sometimes lag behind the latest binary changes [3][4].
Citations:
- 1: operator: Replace frontend_worker parallelism with match_max_concurrent grafana/loki#5693
- 2: https://grafana.com/docs/loki/latest/setup/upgrade/
- 3: Broken configuration file when installing via apt on Ubuntu 22.04 grafana/loki#17736
- 4: [Bug] Loki service fails to start after installation on aarch64 grafana/loki#15421
- 5: Unrecognized storage client after schema upgrade grafana/loki#13141
🌐 Web query:
Loki configuration query_range.split_queries_by_interval support value 0
💡 Result:
Loki’s split_queries_by_interval supports the value 0 as a valid “disable splitting” setting. When the effective interval is 0, Loki’s query-range splitting middleware skips splitting and forwards the request unchanged (interval == 0 => return next.Do(ctx, r)). [1] In older documentation/comments for this setting, “0 disables it” is stated alongside guidance to use a positive duration (e.g., in multiples of 24h) to avoid downloading/processing the same chunks across splits. [2] Note: as of Loki v2.5.0, split_queries_by_interval can only be configured under limits_config (not query_range), and the default became 30m rather than 0; Loki fails to start if split_queries_by_interval is left under query_range. [3]
Citations:
- 1: https://github.com/grafana/loki/blob/06ce74b7/pkg/querier/queryrange/split_by_interval.go
- 2: loki failing when querying huge data grafana/loki#7084
- 3: https://grafana.com/docs/loki/latest/release-notes/v2-5/
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find Loki image tags in repo (templates) =="
rg -n --no-heading "loki.*image:" . || true
echo
echo "== Show tools/loki/local-config.yaml with line numbers (1-120) =="
wc -l tools/loki/local-config.yaml || true
cat -n tools/loki/local-config.yaml | sed -n '1,140p'
echo
echo "== Show exact blocks around match_max_concurrent (line ~9) =="
sed -n '1,25p' tools/loki/local-config.yaml
echo
echo "== Show exact blocks around split_queries_by_interval (lines ~45-110) =="
sed -n '40,110p' tools/loki/local-config.yamlRepository: ansible/awx
Length of output: 6453
Correct the Loki “unsupported” compatibility notes in tools/loki/local-config.yaml
- Line 9: The note “match_max_concurrent … not supported in newer Loki versions” is misleading—
frontend_worker.match_max_concurrentis still a supported Loki setting. - Line 84:
split_queries_by_interval: 0is a valid way to disable query splitting, but the setting is expected underlimits_config(notquery_range); the comment should reflect the placement incompatibility rather than claiming0is unsupported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loki/local-config.yaml` at line 9, Update the misleading comments in
the Loki config: change the note about match_max_concurrent to state that
frontend_worker.match_max_concurrent is still supported (remove "not supported
in newer Loki versions") so the comment next to
frontend_worker.match_max_concurrent reflects support; and move or update the
comment for split_queries_by_interval to indicate that setting can be set to 0
to disable query splitting but it belongs under limits_config (not under
query_range), so adjust the comment near split_queries_by_interval to reference
placement incompatibility rather than claiming "0" is unsupported.
| otlphttp/loki: | ||
| endpoint: http://loki:3100/otlp | ||
| tls: | ||
| insecure: true |
There was a problem hiding this comment.
Security: TLS verification disabled (acceptable for development only).
Both Loki and Tempo exporters have insecure: true, which disables TLS certificate verification. This is acceptable for local development but must not be used in production environments.
Also applies to: 31-31
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/otel/otel-collector-config.yaml` at line 26, The Loki and Tempo
exporter blocks currently set insecure: true which disables TLS verification;
update the otel collector config by changing insecure: true to insecure: false
for production and make this value configurable (e.g., via an environment
variable or config templating) so the loki and tempo exporter sections can use a
boolean flag (referencing the loki and tempo exporter blocks and the insecure
field) to switch between development (true) and production (false) deployments;
ensure docs/helm/manifest overrides reflect the new env/config var.
When GRAFANA=1 but PROMETHEUS=LOKI=TEMPO=0, template generated:
grafana:
depends_on:
Empty depends_on invalid. Add conditional wrapper around depends_on block.
Co-authored-by: Claude (Anthropic) <claude@anthropic.com>
Signed-off-by: Chris Meyers <chris.meyers.fsu@gmail.com>
SUMMARY
Adds comprehensive OpenTelemetry instrumentation to AWX for distributed tracing and observability.
Code Instrumentation:
Infrastructure:
Dependencies:
ISSUE TYPE
New or Enhanced Feature
COMPONENT NAME
STEPS TO REPRODUCE AND EXTRA INFO
Usage:
Verification:
curl -s "http://localhost:3200/api/search?tags=service.name=aap-controller-dispatcher" | jqaap-controller-dispatcherBefore:
After: