Skip to content

Conversation

bitoku
Copy link
Contributor

@bitoku bitoku commented Jun 13, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Because c.stopTimeoutChan is consumed only until it starts to SIGKILL, it could be deadlock in this scenario.

  1. StopContainer(1)
    1. SetAsStopping
    2. StopLoopForContainer
    3. killContainer
  2. StopContainer(2)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == 1)
      c.stopLock.ULock()
  3. StopContainer(n = 3~11)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == n)
      c.stopLock.ULock()
  4. StopContainer(12~)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (BLOCKED!!)
  5. After the process was killed (back to 1)
    1. StopLoopForContainer done
    2. KillExecPIDs
    3. c.stopLock.Lock() (DEADLOCK!!)

Which issue(s) this PR fixes:

https://issues.redhat.com/browse/OCPBUGS-55485

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix the bug that pod can't be terminated when the process is uninterruptible sleep for a while.

@bitoku bitoku requested a review from mrunalp as a code owner June 13, 2025 06:18
@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Jun 13, 2025
@openshift-ci openshift-ci bot requested review from klihub and littlejawa June 13, 2025 06:19
Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.

Project coverage is 66.91%. Comparing base (3dce7d8) to head (1e751b4).
Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9256      +/-   ##
==========================================
- Coverage   67.05%   66.91%   -0.14%     
==========================================
  Files         198      198              
  Lines       27176    27189      +13     
==========================================
- Hits        18222    18193      -29     
- Misses       7449     7495      +46     
+ Partials     1505     1501       -4     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// because it is not controlled by the timeout anymore.
stopTimeoutChan chan int64
stopWatchers []chan struct{}
stopKillLoop bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe stopKillLoopBegun or something? making clear it's part of the 'stop' operation and not an instruction whether to stop the kill loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! done

Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
@bitoku bitoku changed the title Fix deadlock when stopping uninterruptible container OCPBUGS-55485: Fix deadlock when stopping uninterruptible container Jun 19, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 19, 2025
@openshift-ci-robot
Copy link

@bitoku: This pull request references Jira Issue OCPBUGS-55485, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

Because c.stopTimeoutChan is consumed only until it starts to SIGKILL, it could be deadlock in this scenario.

  1. StopContainer(1)
    1. SetAsStopping
    2. StopLoopForContainer
    3. killContainer
  2. StopContainer(2)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == 1)
      c.stopLock.ULock()
  3. StopContainer(n = 3~11)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == n)
      c.stopLock.ULock()
  4. StopContainer(12~)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (BLOCKED!!)
  5. After the process was killed (back to 1)
    1. StopLoopForContainer done
    2. KillExecPIDs
    3. c.stopLock.Lock() (DEADLOCK!!)

Which issue(s) this PR fixes:

https://issues.redhat.com/browse/OCPBUGS-55485

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix the bug that pod can't be terminated when the process is uninterruptible sleep for a while.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jun 19, 2025

/jira refresh

@openshift-ci-robot
Copy link

@bitoku: This pull request references Jira Issue OCPBUGS-55485, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jun 19, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 19, 2025
@openshift-ci-robot
Copy link

@bitoku: This pull request references Jira Issue OCPBUGS-55485, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jun 19, 2025

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lyman9966.

Note that only cri-o members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@bitoku: This pull request references Jira Issue OCPBUGS-55485, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@haircommander
Copy link
Member

/override ci/prow/ci-e2e-evented-pleg
/approve

LGTM, @cri-o/cri-o-maintainers PTAL

Copy link
Contributor

openshift-ci bot commented Jun 20, 2025

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-e2e-evented-pleg

In response to this:

/override ci/prow/ci-e2e-evented-pleg
/approve

LGTM, @cri-o/cri-o-maintainers PTAL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2025
Copy link
Member

@sohankunkerkar sohankunkerkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2025
Copy link
Contributor

openshift-ci bot commented Jun 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, haircommander, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [haircommander,sohankunkerkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bitoku
Copy link
Contributor Author

bitoku commented Jun 24, 2025

/retest

@bitoku
Copy link
Contributor Author

bitoku commented Jun 24, 2025

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit 43ed0a1 into cri-o:main Jun 24, 2025
86 of 89 checks passed
@openshift-ci-robot
Copy link

@bitoku: Jira Issue OCPBUGS-55485: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-55485 has been moved to the MODIFIED state.

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

Because c.stopTimeoutChan is consumed only until it starts to SIGKILL, it could be deadlock in this scenario.

  1. StopContainer(1)
    1. SetAsStopping
    2. StopLoopForContainer
    3. killContainer
  2. StopContainer(2)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == 1)
      c.stopLock.ULock()
  3. StopContainer(n = 3~11)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (len(c.stopTimeoutChan) == n)
      c.stopLock.ULock()
  4. StopContainer(12~)
    1. SetAsStopping => false
    2. WaitOnStopTimeout
      c.stopLock.Lock()
      c.stopTimeoutChan <- timeout (BLOCKED!!)
  5. After the process was killed (back to 1)
    1. StopLoopForContainer done
    2. KillExecPIDs
    3. c.stopLock.Lock() (DEADLOCK!!)

Which issue(s) this PR fixes:

https://issues.redhat.com/browse/OCPBUGS-55485

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix the bug that pod can't be terminated when the process is uninterruptible sleep for a while.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jul 4, 2025

/cherry-pick release-1.33

@bitoku bitoku deleted the fix-deadlock branch July 4, 2025 07:02
@openshift-cherrypick-robot

@bitoku: new pull request created: #9320

In response to this:

/cherry-pick release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jul 8, 2025

/cherry-pick release-1.32

@bitoku
Copy link
Contributor Author

bitoku commented Jul 8, 2025

/cherry-pick release-1.31

@openshift-cherrypick-robot

@bitoku: #9256 failed to apply on top of branch "release-1.32":

Applying: fix deadlock when the container is in uninterruptible sleep
Using index info to reconstruct a base tree...
M	internal/oci/container.go
M	internal/oci/runtime_oci.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_oci.go
CONFLICT (content): Merge conflict in internal/oci/runtime_oci.go
Auto-merging internal/oci/container.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix deadlock when the container is in uninterruptible sleep

In response to this:

/cherry-pick release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jul 8, 2025

/cherry-pick release-1.30

@openshift-cherrypick-robot

@bitoku: #9256 failed to apply on top of branch "release-1.31":

Applying: fix deadlock when the container is in uninterruptible sleep
Using index info to reconstruct a base tree...
M	internal/oci/container.go
M	internal/oci/runtime_oci.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_oci.go
CONFLICT (content): Merge conflict in internal/oci/runtime_oci.go
Auto-merging internal/oci/container.go
CONFLICT (content): Merge conflict in internal/oci/container.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix deadlock when the container is in uninterruptible sleep

In response to this:

/cherry-pick release-1.31

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@bitoku: #9256 failed to apply on top of branch "release-1.30":

Applying: fix deadlock when the container is in uninterruptible sleep
Using index info to reconstruct a base tree...
M	internal/oci/container.go
M	internal/oci/runtime_oci.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_oci.go
CONFLICT (content): Merge conflict in internal/oci/runtime_oci.go
Auto-merging internal/oci/container.go
CONFLICT (content): Merge conflict in internal/oci/container.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix deadlock when the container is in uninterruptible sleep

In response to this:

/cherry-pick release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants