Tags: NSXBet/batcher
Tags
fix: prevent goroutine leak in shutdown test
Replace `select{}` (blocks forever) with a channel that t.Cleanup
closes, so the processor goroutine terminates when the test ends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: resolve goroutine leaks during batcher shutdown This commit fixes critical goroutine leak issues that occurred when Close() was called on an active batcher, particularly during consumer crash recovery scenarios or rapid start/stop cycles. Root Causes: 1. Improper channel draining: The startProcessing() goroutine would exit without draining the batchesChan, leaving internal rill pipeline goroutines blocked trying to send remaining batches. 2. Broken timeout in Join(): The timeout logic was creating a new timer on each loop iteration, effectively resetting the timeout continuously, preventing it from ever expiring. 3. Unbounded Close() timeout: The calculated timeout could be very long or even indefinite, causing Close() to block for extended periods. Fixes: - startProcessing(): Added proper channel draining in deferred cleanup to ensure all rill pipeline goroutines can exit cleanly. Added channel close detection and ensured correct cleanup order (close input → drain output → close errors). - Join(): Fixed timeout logic to use a deadline instead of recreating timers, ensuring timeouts actually work as expected. - Close(): Added maximum timeout of 10 seconds to prevent indefinite blocking, minimum timeout of 100ms for cleanup, and 50ms grace period for deferred cleanup to complete. Testing: - Added comprehensive goroutine leak test suite (8 new tests) covering: * Shutdown with failing processors * Shutdown with pending messages * Shutdown during active processing * Rapid start/stop cycles * Multiple concurrent Close() calls * Timeout behavior verification - All 26 tests pass (18 existing + 8 new) - No race conditions detected with -race flag - No performance regression (~430 ns/op maintained) - Full backward compatibility maintained Impact: All goroutines now clean up properly within 2-5 seconds of Close() being called, regardless of processor state, pending messages, or timing conditions. This resolves goroutine leaks observed in production integrations with reliable-redis-queues during crash recovery tests. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>