Skip to content

Conversation

@Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Aug 14, 2025

Summary

Changes

Test Files Migrated

  • AttemptSysMsgRedeliverySpec
  • ClientDowningNodeThatIsUnreachableSpec
  • ClusterDeathWatchSpec
  • ConvergenceSpec
  • LeaderDowningAllOtherNodesSpec
  • LeaderDowningNodeThatIsUnreachableSpec
  • SingletonClusterSpec
  • SplitBrainResolverDowningSpec
  • SplitBrainSpec
  • SurviveNetworkInstabilitySpec
  • UnreachableNodeJoinsAgainSpec

Migration Pattern Applied

  1. Added using System.Threading.Tasks; import
  2. Converted test methods from public void to public async Task
  3. Replaced EnterBarrier() with await EnterBarrierAsync()
  4. Replaced TestConductor.X().Wait() with await TestConductor.XAsync()
  5. Replaced RunOn() with await RunOnAsync() for async operations
  6. Replaced Within() with await WithinAsync() for async timing constraints

Test Plan

  • All files compile successfully
  • No build warnings or errors
  • Multi-node tests pass in CI

Related

@Aaronontheweb Aaronontheweb force-pushed the claude-wt-MNTR1 branch 2 times, most recently from 7dfe6fa to 0ab9059 Compare August 14, 2025 02:51
- Convert all test methods to async Task
- Replace TestConductor blocking calls with async versions
- Replace EnterBarrier with EnterBarrierAsync
- Use RunOnAsync for async operations
- Use WithinAsync for async timing constraints
- Add using System.Threading.Tasks
- Update migration tracking document
- Fix all non-awaited RunOn calls inside async contexts
- Remove unnecessary async/await patterns with Task.CompletedTask

NOTE: UnreachableNodeJoinsAgainSpec has a known issue where the victim
node is shut down mid-test and cannot participate in async coordination.
This test may require special handling.

This migration prevents thread pool starvation and improves test reliability
in CI environments.

Specs migrated:
- AttemptSysMsgRedeliverySpec
- ClientDowningNodeThatIsUnreachableSpec
- ClusterDeathWatchSpec
- ConvergenceSpec
- LeaderDowningAllOtherNodesSpec
- LeaderDowningNodeThatIsUnreachableSpec
- SingletonClusterSpec
- SplitBrainResolverDowningSpec
- SplitBrainSpec
- SurviveNetworkInstabilitySpec
- UnreachableNodeJoinsAgainSpec (with known issue)
This test has a unique pattern where the victim node is completely shut down
and restarted mid-test. This pattern is incompatible with async coordination
as the victim node cannot participate in async barriers after shutdown.

Keeping this test synchronous to maintain its functionality.
@Aaronontheweb
Copy link
Member Author

Update: UnreachableNodeJoinsAgainSpec Reverted

After thorough testing, I've reverted UnreachableNodeJoinsAgainSpec back to its synchronous implementation. This test has a unique pattern where:

  1. The "victim" node (fourth) is completely shut down mid-test via TestConductor.Shutdown()
  2. The victim then restarts itself with a new ActorSystem on the same port
  3. Other nodes continue with barriers that the victim cannot participate in

This pattern is fundamentally incompatible with async coordination because:

  • After shutdown, the victim node cannot participate in EnterBarrierAsync() calls
  • The test relies on the victim executing its recovery logic synchronously while other nodes wait
  • Attempting to make this async causes timeout failures as the victim tries to enter barriers it's no longer connected to

Current Status:

  • ✅ 10 out of 11 tests successfully migrated to async
  • ⚠️ UnreachableNodeJoinsAgainSpec remains synchronous (by design)

All other tests in this batch have been successfully migrated and are working correctly with the async patterns.

I've saved a detailed debugging analysis for future reference with all the attempted solutions and why they failed.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Arkatufus Arkatufus enabled auto-merge (squash) August 14, 2025 18:31
@Aaronontheweb
Copy link
Member Author

The DistributedPubSubRestartSpecs, which are historically racy and unmodified in this PR, are what failed.

@Aaronontheweb Aaronontheweb disabled auto-merge August 14, 2025 20:49
@Aaronontheweb Aaronontheweb merged commit 31adf92 into akkadotnet:dev Aug 14, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants