Skip to content

fix(inkless:systest): fix sigstop and slow consumer giving false negatives#659

Open
giuseppelillo wants to merge 2 commits into
mainfrom
glillo/fix-switch-systests
Open

fix(inkless:systest): fix sigstop and slow consumer giving false negatives#659
giuseppelillo wants to merge 2 commits into
mainfrom
glillo/fix-switch-systests

Conversation

@giuseppelillo

@giuseppelillo giuseppelillo commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Commit 1: avoid false negatives due to slow consumers

Commit 2: actually send SIGSTOP to the broker and verify that it really stops

@giuseppelillo giuseppelillo changed the title fix(inkless:systest): fix sigstop and fix(inkless:systest): fix sigstop and slow consumer giving false negatives Jun 17, 2026
The test passed KafkaService.java_class_name() (regex kafka\.Kafka)
to Trogdor's ProcessStopFaultSpec, but Trogdor's worker matches
the target JVM by literal substring against jcmd -l.
The escaped form never matched the real kafka.Kafka line,
so SIGSTOP/SIGCONT were sent to zero pids and the leader was never
actually frozen — the scenario passed without testing anything.

Fix by passing the literal main-class name (kafka.Kafka) so the
signal reaches the broker, and verify the fault actually took
effect: assert the broker JVM reaches ps state T (stopped) during
the pause and returns to running after SIGCONT, so any future
no-op fails loudly instead of silently exercising nothing.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the inkless classic→diskless topic switch system test to reduce false negatives by (1) making the “consume exact count” path resilient to temporarily slow diskless fetch tails and (2) ensuring the SIGSTOP-based leader fault injection actually stops (and resumes) the broker JVM.

Changes:

  • Increase/parameterize console-consumer idle timeout for exact-count reads and adjust completion logic in _consume_all_from_beginning.
  • Fix Trogdor SIGSTOP targeting by using a literal jcmd -l match string for the broker process.
  • Add verification helpers to assert the broker actually enters stopped (ps state T) and later resumes after SIGCONT in the sigstop scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +435 to +438
# For an exact-count read, keep the consumer alive across a slow diskless
# tail; a short idle timeout is fine when the caller only wants a minimum.
consumer_idle_ms = (self.CONSUME_COMPLETION_IDLE_SEC * 1000
if wait_for_completion else 30000)
Comment on lines +460 to 466
# Done as soon as every expected record has been delivered.
if len(consumer.messages_consumed[1]) >= expected_count:
return True
# The consumer drained and exited on its own short of the expected
# count: stop waiting so the caller sees the shortfall (genuine data
# loss) instead of blocking until timeout_sec.
return consumer_seen_alive[0] and not is_alive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants