Skip to content

Flaky test: TestLeaderBalancedNodeAdded times out after 10m #1019

@mattisonchao

Description

@mattisonchao

Summary

TestLeaderBalancedNodeAdded in tests/balancer intermittently hangs and exceeds the 10-minute test timeout, causing the whole tests/balancer package to fail. This is a recurrence of #936, which was previously closed.

Failing CI run

Failure excerpt

panic: test timed out after 10m0s
    running tests:
        TestLeaderBalancedNodeAdded (9m14s)

goroutine 1 [chan receive, 9 minutes]:
testing.(*T).Run(0xc0003be248, ...)

Dumped goroutines show multiple streamReader[...].handleServerMessageOnce goroutines parked in gRPC RecvMsg for ~9 minutes (in oxiad/dataserver/assignment/stream_reader.go:70), suggesting the test setup reached a state where shard-assignment streams stop progressing but the test keeps waiting.

Repro

Not reproducible locally so far; appears only under CI load. Retrying the job typically succeeds.

Suspected area

tests/balancer/leader_balancer_test.go — the test's assert.Eventually/wait loop for balanced leader distribution can deadlock against a stuck assignment stream.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions