-
Notifications
You must be signed in to change notification settings - Fork 95
fix(conductor): increase retries in case ingress restarts #2270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Increased retry backoff max delay from 10s to 5 minutes for fetching sequencer commits and validators to handle ingress/sequencer restarts better.
- Increased
max_delayfrom 10s to 300s (5 minutes) infetch_commit_with_retryandfetch_validators_with_retry - Added debug log when non-retryable errors are encountered (but with incorrect syntax)
- Uses exponential backoff starting at 100ms, doubling until reaching the 5-minute cap
- Only retries on HTTP/Timeout errors; other errors break immediately
Confidence Score: 3/5
- This PR has a syntax error that will prevent logging from working correctly
- The retry logic change is sensible and addresses a real operational issue. However, there's a syntax error in the new debug log that needs to be fixed. Additionally, the hard-coded 300s value should ideally be configurable per the custom instructions, though this is a style concern rather than a blocker
- Pay close attention to
crates/astria-conductor/src/celestia/verify.rs- fix the debug log syntax error before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| crates/astria-conductor/src/celestia/verify.rs | 3/5 | Increased retry max delay from 10s to 5m and added debug log with incorrect field syntax |
Sequence Diagram
sequenceDiagram
participant Conductor
participant CometBftRetryStrategy
participant SequencerClient
participant Sequencer
Conductor->>SequencerClient: fetch_commit_with_retry(height)
activate SequencerClient
loop Retry with exponential backoff (max 5 min)
SequencerClient->>Sequencer: commit(height)
alt Retryable Error (HTTP/Timeout)
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
else Non-retryable Error
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
Note over CometBftRetryStrategy: New debug log added
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
SequencerClient-->>Conductor: Return Error
else Success
Sequencer-->>SequencerClient: Commit Response
SequencerClient-->>Conductor: Return Commit
end
end
deactivate SequencerClient
Conductor->>SequencerClient: fetch_validators_with_retry(height)
activate SequencerClient
loop Retry with exponential backoff (max 5 min)
SequencerClient->>Sequencer: validators(height)
alt Retryable Error (HTTP/Timeout)
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
else Non-retryable Error
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
Note over CometBftRetryStrategy: New debug log added
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
SequencerClient-->>Conductor: Return Error
else Success
Sequencer-->>SequencerClient: Validators Response
SequencerClient-->>Conductor: Return Validators
end
end
deactivate SequencerClient
1 file reviewed, 3 comments
| let retry_config = RetryFutureConfig::new(u32::MAX) | ||
| .custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100))) | ||
| .max_delay(Duration::from_secs(10)) | ||
| .max_delay(Duration::from_secs(300)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments
Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/astria-conductor/src/celestia/verify.rs
Line: 298:298
Comment:
**style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments
**Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde))
How can I resolve this? If you propose a fix, please make it concise.| let retry_config = RetryFutureConfig::new(u32::MAX) | ||
| .custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100))) | ||
| .max_delay(Duration::from_secs(10)) | ||
| .max_delay(Duration::from_secs(300)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments
Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/astria-conductor/src/celestia/verify.rs
Line: 333:333
Comment:
**style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments
**Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde))
How can I resolve this? If you propose a fix, please make it concise.
Summary
Gives conductor a larger backup when attempting to fetch a block commit and validator set, and adds a debug log so we can see if getting errors which are not retried through logs.
Background
When a sequencer restarts or ingress has a bounce, we tend to get lagging firm commits because we can't fetch the sequencer information and conductor drops the block from its queue. This change increases the amount of time it will go, noting the retries here are on an exponential backoff.
Changes