Skip to content

[Bug]: GRCPIO versions from 1.59.0 to 1.62.1 can cause Beam Python pipelines to get stuck #30867

@DerRidda

Description

@DerRidda

What happened?

A combination of software releases in Beam dependency chain has surfaced a failure mode, that might cause unexplained pipeline stuckness. The issue affects Apache Beam 2.55.0 and 2.55.1, but may potentially affect other SDKs when the pipeline runtime environment has

google-api-core version 2.17.0 or above, AND grpcio version in the following range 1.59.0<=grpcio<=1.62.1.

Symptoms

Beam pipelines might get stuck. Dataflow jobs might have errors like: Unable to retrieve status info from SDK harness

There are 10 consecutive failures obtaining SDK worker status info, SDK worker appears to be permanently unresponsive. Aborting the SDK.

Mitigation

Upgrade to Apache Beam 2.56.0 or above once available, until then: install any of the following dependency combinations in the Beam pipeline runtime environment

  • upgrade grpcio to version 1.62.2 or above OR
  • downgrade grpcio and grpcio-status to 1.58.0 or below. OR
  • downgrade google-api-core to version 2.16.2 or below

You can define dependencies in the pipeline runtime environment using a --requirements_file pipeline option or other options outlined in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.

Users of Apache Beam 2.55.0 might be able to avoid the issue by downgrading to apache-beam==2.54.0, since the default containers for the runtime environment has the set of dependencies that does not trigger the bug.

Rootcause

The issue was caused by a regression in grpcio==1.59.0 grpc/grpc#36265, which has been now fixed in grpcio==1.62.2 and above. The regression triggered the failure mode when used with google-api-core==2.17.0 and above.

Description updated: 2023-04-23.

Original description:

Update of the Python grpcio dependency to version 1.62.1 caused Dataflow job stalling, with excessive waits for responses in GRCP Multi-threaded rendezvous probably somewhere in SDK worker. Upstream issue exists here: grpc/grpc#36256

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions