-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Troubleshooting Istio Ambient
-
Install and setup issues
- Scenario: Ztunnel is not capturing my traffic
- Scenario: pod fails to run with
Failed to create pod sandbox - Scenario: Ztunnel fails with
failed to bind to address [::1]:15053: Cannot assign requested address - Scenario: Ztunnel fails with
failed to bind to address [::1]:15053: Address family not supported
-
Ztunnel Traffic Issues
- Understanding logs
- Scenario: Traffic timeout with Ztunnel
- Scenario: Readiness probes fail with Ztunnel
- Scenario: traffic fails with
timed out waiting for workload from xds - Scenario: traffic fails with
unknown source - Scenario: traffic fails with
no healthy upstream - Scenario: traffic fails with
http status: ... - Scenario: traffic fails with
connection closed due to connection drain - Scenario: ztunnel logs
HBONE ping timeout/errorandping timeout - Scenario: ztunnel is not sending egress traffic to waypoints
- Scenario: traffic from sidecars/gateways fails with
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end - Scenario: traffic fails with
Connection timed out
- Waypoint issues
- Common information
This document provides troubleshooting steps for common problems users have encountered when using ambient mode. Many of these failure scenarios are former bugs that have been fixed; where applicable, the section will indicate which version the issue is fixed in.
At the time of writing, this document covers Istio 1.23+. In the future, items impacting only older versions will be removed. It is strongly encourage to run on at least Istio 1.24, which is when ambient mesh went "GA". Despite this, reproducing the issue on the latest version is always a good first diagnostic step.
Before doing anything else please make sure you read and follow
- the latest Platform Requirements
- the latest Platform-Specific Prerequisites guide for your provider and CNI.
Failure to follow these guidelines will result in issues.
Follow these steps to troubleshoot Ztunnel not capturing traffic.
First, check the pod for the ambient.istio.io/redirection. This indicates istio-cni enabled redirection.
$ kubectl get pods shell-5b7cf9f6c4-npqgz -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
ambient.istio.io/redirection: enabledIf the annotation is missing: the pod was not enrolled in the mesh.
- Check the logs of the
istio-cni-nodepod on the same node as the pod for errors. Errors during enablement may be blocking the pod from getting traffic from Ztunnel. - Check the logs of the
istio-cni-podon the same node to verify it has ambient enabled. The pod should logAmbientEnabled: trueduring startup. If this isfalse, ensure you properly installed Istio with--set profile=ambient. - Check the pod is actually configured to have ambient enabled. The criteria is as follows:
- The pod OR namespace must have the
istio.io/dataplane-mode=ambientlabel set - The pod must not have the
sidecar.istio.io/statusannotation set (which is added automatically when a sidecar is injected) - The pod must not have
istio.io/dataplane-mode=noneset. - The pod must not have
spec.hostNetwork=true
If the annotation is present: this means Istio claims it enabled redirection for the pod, but apparently it isn't working.
- Check the iptables rules in the pod. Run a debug shell and run
iptables-save. You should see something like below:
# iptables-save
# Generated by iptables-save v1.8.10 on Wed Sep 25 22:06:16 2024
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_PRERT - [0:0]
-A PREROUTING -j ISTIO_PRERT
-A OUTPUT -j ISTIO_OUTPUT
-A ISTIO_OUTPUT -d 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_OUTPUT -p tcp -m mark --mark 0x111/0xfff -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -p tcp -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15001
-A ISTIO_PRERT -s 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_PRERT ! -d 127.0.0.1/32 -p tcp -m tcp ! --dport 15008 -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15006
The exact contents may vary, but if there is anything relating to Istio here, it means iptables rules are installed.
2. Check if ztunnel is running within the pod network. This can be done with netstat -ntl. You should see listeners on a few Istio ports (15001, 15006, etc):
# netstat -ntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 127.0.0.1:15053 0.0.0.0:* LISTEN
tcp6 0 0 ::1:15053 :::* LISTEN
tcp6 0 0 :::15001 :::* LISTEN
tcp6 0 0 :::15006 :::* LISTEN
tcp6 0 0 :::15008 :::* LISTEN
- Check the logs of Ztunnel. When sending traffic, you should see logs like
info access connection complete .... Note that these are logged when connections are closed, not when they are opened, so you may not see logs for your application if they use long-lived connections.
For pods in the mesh, Istio will run a CNI plugin during the pod 'sandbox' creation. This configures the networking rules. This may intermittently fail, in which case Kubernetes will automatically retry.
This can fail for a few reasons:
-
no ztunnel connection: this indicates that the CNI plugin is not connected to Ztunnel. Ensure Ztunnel is running on the same node and is healthy. -
failed to add IP ... to ipset istio-inpod-probes: exist: this indicates Istio attempted to add the workload. This can be caused by a race condition in the Kubernetes IP allocation, in which a retry can resolve the issue. On Istio 1.22.3 and older, there was a bug causing this to not recover; please upgrade if so. Other occurrences of this may be a bug.
This is fixed in Istio 1.23.1+, please upgrade. See issue.
This indicates your kernel does not support IPv6.
IPv6 support can be turned off by setting IPV6_ENABLED=false on Ztunnel.
When troubleshooting traffic issues, the first step should always be to analyze the access logs in Ztunnel. Note that there may be two Ztunnel pods involved in a request (the source and destination), so its useful to look at both sides.
Access logs by default log on each connection completion. Connection opening logs are available at debug level (see how to set log level).
An example log looks like:
2024-09-25T22:08:30.213996Z info access connection complete src.addr=10.244.0.33:50676 src.workload="shell-5b7cf9f6c4-7hfkc" src.namespace="default" src.identity="spiffe://cluster.local/ns/default/sa/default" dst.addr=10.244.0.29:15008 dst.hbone_addr=10.96.99.218:80 dst.service="echo.default.svc.cluster.local" dst.workload="waypoint-66f44865c4-l7btm" dst.namespace="default" dst.identity="spiffe://cluster.local/ns/default/sa/waypoint" direction="outbound" bytes_sent=67 bytes_recv=518 duration="2ms"
- The
src/dstaddr,workload,namespace, andidentityrepresent the information about the source and destination of the traffic. Not all information will be available for all traffic:-
identitywill only be set when mTLS is used. -
dst.namespaceanddst.workloadwill not be present when traffic is sent to an unknown destination (passthrough traffic)
-
-
dst.servicerepresents the destination service, if the call was to a service. This is not always the case, as an application can reach aPoddirectly. -
dst.hbone_addris set when using mTLS. In this case,hbone_addrrepresents the target of the traffic, whiledst.addrrepresents the actual address we connected to (for the tunnel). -
bytes_sentandbytes_recvindicate how many bytes were transferred during the connection. -
durationindicates how long the connection was open -
error, if present, indicates the connection had an error, and why.
In the above log, you can see that while the dst.service is echo, the dst.workload (and dst.addr) are for waypoint-....
This implies the traffic was sent to a waypoint proxy.
Traffic is blocked, showing a log, with errors like below:
error access connection complete direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="io error: deadline has elapsed"
error access connection complete direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008: deadline has elapsed"
- For the
connection timed outerror, this means the connection could not be established. This may be due to networking issues reaching the destination. A very common cause (hence the log) is to have a NetworkPolicy or other firewall rule blocking port15008. Istio mTLS traffic is tunneled over port15008, so this must be enabled (both on ingress and egress). - For the more generic errors like
io error: deadline has elapsed, this generally is the same root causes as above. However, if traffic works without ambient, it is unlikely to be a typical firewall rule, as the traffic should be sent identically as without ambient enabled. This likely indicates an incompatibility with your Kubernetes setup.
After enabling ambient mode, pod readiness probe fails. For example, you may see something like below:
Warning Unhealthy 92s (x6 over 4m2s) kubelet Readiness probe failed: Get "http://1.1.1.1:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Ambient mode intends to not capture or impact any readiness probe traffic.
It does this by applying a SNAT rule in the host, to rewrite any traffic from kubelet as coming from 169.254.7.127, then skipping redirection for any traffic matching this pattern.
Readiness probe failures that start when enabling ambient typically indicate an environmental issue with this traffic rewrite.
For instance:
- Cilium with
bpf.masquerade=truebreaks this (platform prequisites guide, issue) - Calico, before 3.29, with
bpfEnabledset, breaks this (issue) - AWS Security Groups may block this traffic (issue)
- Some CNIs, such as Cilium, apply NetworkPolicies to this traffic. See here for workarounds).
When traffic is sent from a pod, Ztunnel must first get information about the pod from Istiod (over the XDS protocol). If it fails to do so after 5s, it will reject the connection with this error.
Istiod is generally expected to return information substantially sooner than with 5s.
If this error happens intermittently, however, it may indicate this is not happening.
This could be caused by istiod being overloaded, or possible modifications that increase PILOT_DEBOUNCE_AFTER (which can slow down updates).
If the issue persistently happens, it is likely a bug; please file an issue.
Warning: prior to Istio 1.24, there were a few bugs that could trigger this unrelated to timing issues.
This indicates Ztunnel was unable to identify the source of traffic. In Istio 1.23, Ztunnel would attempt to map the source IP of traffic to a known workload. If the workload has multiple network interfaces, this may prevent Ztunnel from making this association.
Istio 1.24+ does not require this mapping.
This indicates traffic to a Service had no applicable backends.
We can see how Ztunnel views the Service's health:
$ istioctl zc services
NAMESPACE SERVICE NAME SERVICE VIP WAYPOINT ENDPOINTS
default echo 10.96.99.1 None 3/4This indicates there are 4 endpoints for the service, but 1 was unhealthy.
Next we can look at how Kubernetes views the service:
$ kubectl get endpointslices
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
echo-v76p9 IPv4 8080 10.244.0.20,10.244.0.36 + 1 more... 7h50mHere we also see 3 endpoints.
If Kubernetes shows zero healthy endpoints, it indicates there is not an issue in the Istio setup, but rather the service is actually unhealthy. Check to ensure its' labels select the expected workloads, and that those pods are marked as "ready".
If this is seen for the kubernetes service, this may be fixed in Istio 1.23+ and Istio 1.22.3+.
If this is seen for hostNetwork pods, or other scenarios where multiple workloads have the same IP address, this may be fixed in Istio 1.24+.
Ztunnel acts as a TCP proxy, and does not parse users HTTP traffic at all. So it may be confusing that Ztunnel reports an HTTP error.
This is the result of the tunneling protocol ("HBONE") ztunnel uses, which is over HTTP CONNECT. An error like this indicates ztunnel was able to establish an HBONE connection, but the stream was rejected.
When communicating to another Ztunnel, this may be caused by various issues:
-
400 Bad Request: the request was entirely invalid; this may indicate a bug -
401 Unauthorized: request was rejected by AuthorizationPolicy rules -
503 Service Unavailable: the destination is not available
When communicating with a waypoint proxy (Envoy), there is a wider range of response codes possible. 401 for AuthorizationPolicy rejection and 503 as a general catch-all are common.
When Ztunnel shuts down an instance of a proxy, it will close any outstanding connections.
This will be preceded with a log like inpod::statemanager pod delete request, shutting down proxy for the pod.
This can happen:
- If the Pod is actually deleted. In this case, the connections are generally already closed, though.
- If Ztunnel itself is shutting down.
- If the pod was un-enrolled from ambient mode.
See this blog post for more information.
(Fixed in Istio 1.23.1+)
These logs can be ignored. See issue for details.
Consider a ServiceEntry like:
apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
name: example.com
labels:
istio.io/use-waypoint: my-waypoint
spec:
hosts:
- example.com
ports:
- number: 80
name: http
protocol: HTTP
resolution: DNSUnlike a typical Service, this will not necesarily have two components needed for traffic capture to work:
- It will not have a stable Service IP address known to Istio (
example.commay have many, changing, IPs). - We do not have DNS setup to return such a stable IP address, if one did exist.
Istio handles has two features to resolve these:
-
values.pilot.env.PILOT_ENABLE_IP_AUTOALLOCATE=true(default in Istio 1.25+) enables a controller that will allocate an IP address for theServiceEntryand write it into the object. You can view it in the ServiceEntry itself:status: addresses: - host: example.com value: 240.240.0.3 - host: example.com value: 2001:2::3
-
values.cni.ambient.dnsCapture=true(default in Istio 1.25+) will enable the Ztunnel to handle DNS, which allows it to respond with the above IP addresses in response to a query toexample.com. Note you will need to restart workloads after changing this setting.
Together, this enables egress traffic to traverse a waypoint. To troubleshoot this:
- Ensure the
ServiceEntryhas an IP address in the status. - Check the pod is getting this IP address in DNS lookups.
- Check whether this IP shows up as the destination IP address in Ztunnel.
Scenario: traffic from sidecars/gateways fails with upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end
(Fixed in Istio 1.25)
This is caused by having a DestinationRule with TLS mode ISTIO_MUTUAL configured.
Prior to Istio 1.25, a bug prevented this configuration from working.
If you see this error, please upgrade or remove the DestinationRule configuration.
Note that Istio will automatically use mTLS when possible, even with any DestinationRule configuration.
Ztunnel traffic may show an error with the following symptoms: bytes_sent=5804 bytes_recv=14441 duration="5194336ms" error="io error: Connection timed out (os error 110)".
Note the bytes_sent and bytes_recv are non-zero, and the connection is long-lived.
The error message here is slightly misleading: this means its not the connection handshake that has timed out, but some other aspect of the connection.
There are a few possible areas that this can occur on:
- TCP Keepalive timeouts. As of Istio 1.24, Ztunnel will enable keepalives on connections by default. If these fail repeatedly, the connection will be terminated with this error.
- TCP retransmission timeouts. The kernel will attempt to retransmit TCP messages, and eventually timeout and close the connection.
There are two useful tools in understanding these timeouts:
-
ss -nto, the-oflag enables atimerscolumn. You may see an output like so:State Local Address:Port Peer Address:Port ESTAB 10.0.0.2:39130 10.0.0.3:8080 timer:(keepalive,15sec,0) ESTAB 10.0.0.2:39140 10.0.0.3:8080 FIN-WAIT-1 10.0.0.2:52554 10.0.0.4:8080 timer:(on,1min38sec,12)The
keepalivetimer is from TCP Keepalives added by Istio to the connection.15secmeans the next keepalive will be sent in 15 seconds, and the0indicates we are on attempt0(which means we got a response to our most recent keepalive probe). Notably, there are two connections: one from the application to ztunnel, and one from ztunnel to the destination. These both appear the same in the output. In the above example, the application does not utilize keepalives, so only 1 of 2 of the connections have a keepalive timer.The
ontimer for theFIN-WAIT-1is a TCP retransmission timer. -
nstatgives detailed counters of low-level networking events from the kernel. By default, the command will emit stats between the current and previous call, but this can be configured with-a(show everything) and-r(don't reset for the next call).This tool can be useful to understand kernel level TCP errors that are otherwise not particularly visible.
A common cause of the Connection timed out error is caused when the connection is not gracefully terminated properly, and attempts to do so are blocked by the server.
This is described in the Kernel Documentation. Reproducing this in an ambient mesh can produce the following counters:
$ nstat
#kernel
IpInReceives 2 0.0
IpInDelivers 2 0.0
IpOutRequests 2 0.0
IpOutTransmits 2 0.0
TcpInSegs 2 0.0
TcpOutSegs 2 0.0
TcpOutRsts 1 0.0
TcpExtTCPTimeouts 1 0.0
TcpExtTCPAbortOnTimeout 1 0.0
TcpExtTCPOrigDataSent 1 0.0
Along with the ztunnel error bytes_sent=5804 bytes_recv=14441 duration="5194336ms" error="io error: Connection timed out (os error 110)".
Note it is specifically the TcpExtTCPAbortOnTimeout that ultimately triggers a ETIMEDOUT to be returned on the connection.
First, we will want to see some signs that indicate traffic is traversing a waypoint:
- Requests sent to the waypoint will generally go through Envoy's HTTP processing, which will mutate the request. For example, by default headers will be translated to lowercase and a few Envoy headers are injected:
x-envoy-upstream-service-time: 2
server: istio-envoy
x-envoy-decorator-operation: echo.default.svc.cluster.local:80/*
Note this is not always the case, as traffic may be set as TCP.
-
Waypoint access logs, if enabled, will log each request. See here to enable access logs.
-
Ztunnel access logs, if enabled, will log each request. See here for an example log to a waypoint.
Traffic can be sent to a service or directly to a workload. While sending to a service is typical, see the ztunnel access logs to identify the type of traffic. Similarly, a waypoint can be associated with a service, a workload, or both. Mismatches between these can cause the waypoint to not be utilized.
Tip
Cilium with bpf-lb-sock requires bpf-lb-sock-hostns-only to be set, or all traffic will be incorrectly treated as direct-to-workload traffic. (issue).
Next, we can check if Ztunnel is configured to send to a waypoint:
$ istioctl zc services
NAMESPACE SERVICE NAME SERVICE VIP WAYPOINT ENDPOINTS
default echo 10.96.0.1 waypoint 1/1
default no-waypoint 10.96.0.2 None 1/1
$ istioctl zc workloads
NAMESPACE POD NAME ADDRESS NODE WAYPOINT PROTOCOL
default echo-79dcbf57cc-l2cdp 10.244.0.1 node None HBONE
default product-59896bc9f7-kp4lb 10.244.0.2 node waypoint HBONEThis indicates the echo Service, and the product-59896bc9f7-kp4lb Pod are bound to the waypoint.
If Ztunnel is configured to use the waypoint for the destination but traffic isn't going to the waypoint, it is likely traffic is actually going to the wrong destination.
Check the ztunnel access logs to verify the destination service/workload and ensure it matches.
If there None is found, Ztunnel isn't programmed to use the waypoint.
- Check the status on the object. This should give an indication whether it was attached to the waypoint or not. (Note: this is available in 1.24+, and currently only on
ServiceandServiceEntry)
$ kubectl get svc echo -oyaml
status:
conditions:
- lastTransitionTime: "2024-09-25T19:28:16Z"
message: Successfully attached to waypoint default/waypoint
reason: WaypointAccepted
status: "True"
type: istio.io/WaypointBound- Check what resources have been configure to use a waypoint:
$ kubectl get namespaces -L istio.io/use-waypoint
NAME STATUS AGE USE-WAYPOINT
namespace/default Active 1h waypoint
namespace/istio-system Active 1hYou will want to look at namespaces in all cases, services and serviceentries for service cases, and pods and workloadentries for workload cases.
This label must be set to associate a resource with a waypoint
3. If the label is present, this may be cause by the waypoint being missing or unhealthy. Check the Gateway objects and ensure the waypoint is deployed.
$ kubectl get gateways.gateway.networking.k8s.io
NAME CLASS ADDRESS PROGRAMMED AGE
waypoint istio-waypoint False 17sAbove shows an example of a waypoint that is deployed, but is not healthy. A waypoint will not be enabled until it becomes healthy at least once. If it is not healthy, check the status for more information.
If the Gateway isn't present at all, deploy one!
When deploying a waypoint, it is fully enabled for usage in the mesh without needing to enable ztunnel for it.
This is done by setting the istio.io/dataplane-mode: none label on the pods, which is automatically handled for you.
However, if you explicitly override the waypoint to istio.io/dataplane-mode: ambient it will attempt to add ztunnel to the waypoint pod, meaning there will be two components trying to handle mesh communications, which will conflict with each other.
When deploying a waypoint, you should not set istio.io/dataplane-mode.
Most pods have low privileges and few debug tools available.
For some diagnostics its helpful to run an ephemeral container with elevated privileges and utilities.
The istio/base image can be used for this, along with kubectl debug --profile sysadmin.
For example:
$ kubectl debug --image istio/base --profile sysadmin --attach -t -i shell-5b7cf9f6c4-npqgzTo view the current log level, run:
$ istioctl zc log ztunnel-cqg6c
ztunnel-cqg6c.istio-system:
current log level is infoTo set the log level:
$ istioctl zc log ztunnel-cqg6c --level=info,access=debug
ztunnel-cqg6c.istio-system:
current log level is hickory_server::server::server_future=off,access=debug,infoTo set at Ztunnel pod startup, configure the environment variable:
$ kubectl -n istio-system set env ds/ztunnel RUST_LOG=infoVisit istio.io to learn how to use Istio.
- Preparing for Development Mac
- Preparing for Development Linux
- Troubleshooting Development Environment
- Repository Map
- GitHub Workflow
- Github Gmail Filters
- Using the Code Base
- Developing with Minikube
- Remote Debugging
- Verify your Docker Environment
- Istio Test Framework
- Working with Prow
- Test Grid
- Code Coverage FAQ
- Writing Good Integration Tests
- Test Flakes
- Release Manager Expectations
- Preparing Istio Releases
- Release Automation Tool Information
- Release Process Template Checklist
- 1.5 Release Information
- 1.6 Release Information
- 1.7 Release Information
- 1.8 Release Information
- 1.9 Release Information
- 1.10 Release Information
- 1.11 Release Information
- 1.12 Release Information
- 1.13 Release Information
- 1.14 Release Information
- 1.15 Release Information
- 1.16 Release Information
- 1.17 Release Information
- 1.18 Release Information
- 1.19 Release Information
- 1.20 Release Information
- 1.21 Release Information
- 1.22 Release Information
- 1.23 Release Information
- 1.24 Release Information
- 1.25 Release Information
- 1.26 Release Information
- 1.27 Release Information
- 1.28 Release Information
- 1.29 Release Information