-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
What happened?
A combination of software releases in Beam dependency chain has surfaced a failure mode, that might cause unexplained pipeline stuckness. The issue affects Apache Beam 2.55.0 and 2.55.1, but may potentially affect other SDKs when the pipeline runtime environment has
google-api-core
version 2.17.0 or above, AND grpcio
version in the following range 1.59.0<=grpcio<=1.62.1
.
Symptoms
Beam pipelines might get stuck. Dataflow jobs might have errors like: Unable to retrieve status info from SDK harness
There are 10 consecutive failures obtaining SDK worker status info
, SDK worker appears to be permanently unresponsive. Aborting the SDK.
Mitigation
Upgrade to Apache Beam 2.56.0 or above once available, until then: install any of the following dependency combinations in the Beam pipeline runtime environment
- upgrade grpcio to version 1.62.2 or above OR
- downgrade grpcio and grpcio-status to 1.58.0 or below. OR
- downgrade google-api-core to version 2.16.2 or below
You can define dependencies in the pipeline runtime environment using a --requirements_file
pipeline option or other options outlined in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.
Users of Apache Beam 2.55.0 might be able to avoid the issue by downgrading to apache-beam==2.54.0
, since the default containers for the runtime environment has the set of dependencies that does not trigger the bug.
Rootcause
The issue was caused by a regression in grpcio==1.59.0
grpc/grpc#36265, which has been now fixed in grpcio==1.62.2 and above. The regression triggered the failure mode when used with google-api-core==2.17.0
and above.
Description updated: 2023-04-23.
Original description:
Update of the Python grpcio
dependency to version 1.62.1 caused Dataflow job stalling, with excessive waits for responses in GRCP Multi-threaded rendezvous probably somewhere in SDK worker. Upstream issue exists here: grpc/grpc#36256
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner