-
Notifications
You must be signed in to change notification settings - Fork 53
fix Redis LiveStore #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Redis LiveStore #827
Conversation
WalkthroughRefactors tree signing session coordination from global state to per-round maps with per-round cancellable contexts and RWMutex protection; adds context-aware polling watchers and lifecycle controls to the confirmation sessions store, introducing cancellable context management, safer channel signaling, and guarded resets. Changes
Sequence DiagramsequenceDiagram
participant Client
participant TreeMgr as TreeSigningManager
participant Round as Per-Round State
participant Redis
participant Watcher
participant ConfMgr as ConfirmationStore
participant ConfWatcher
Client->>TreeMgr: StartRound(roundId)
activate TreeMgr
TreeMgr->>Round: create ctxs[roundId], nonceChs[roundId], sigsChs[roundId]
TreeMgr->>Watcher: start watchNoncesCollected(ctxs[roundId])
TreeMgr->>Watcher: start watchSigsCollected(ctxs[roundId])
deactivate TreeMgr
par Per-round watchers
loop poll until threshold or ctx cancel
Watcher->>Redis: GET nonce/sig state
Redis-->>Watcher: current counts
alt threshold met
Watcher->>Round: send on nonceChs/sigsChs (safe send)
Watcher-->Watcher: return
else ctx cancelled
Watcher-->Watcher: exit
end
end
end
Client->>ConfMgr: NewConfirmationStore()
activate ConfMgr
ConfMgr->>ConfWatcher: start watchSessionCompletion(ctx)
deactivate ConfMgr
loop confirmation polling
ConfWatcher->>Redis: poll session state
Redis-->>ConfWatcher: state
alt completed
ConfWatcher->>ConfMgr: safe send on sessionCompleteCh (once)
else ctx cancelled
ConfWatcher-->ConfWatcher: exit
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
internal/infrastructure/live-store/redis/tree_signing_session.go (4)
26-33: Per‑round maps + RWMutex look correct, but rely on strictNew/DeletedisciplineThe move to
lock+{nonce,sigs}Chs/ctxsmaps is sound and removes the previous global‑state coupling; map access is properly guarded withRWMutex.However, this design assumes:
- Every
New(roundId, …)is eventually paired withDelete(roundId); otherwise the maps (and cancel funcs/channels) grow without bound.Newis not called twice for the sameroundIdwithout an interveningDelete; currently a secondNewoverwritesctxs[roundId]and channels without cancelling/closing the previous ones, which is a leak and can leave stale watchers around.Consider at least:
- Documenting these invariants at the interface level, and/or
- Adding a defensive check in
Newthat logs (or even rejects) if a session already exists forroundIdbefore overwriting.Also applies to: 38-41
45-49: Lock scope inNewis broader than needed and includes Redis I/O
Newtakess.lock.Lock()around:
- Redis writes (
HSet), and- Per‑round map setup + watcher goroutine spawning.
Only the map mutations and the hand‑off of per‑round channels/ctxs actually require the mutex; holding the lock across
HSetcan unnecessarily serialize creation of independent sessions and blockDelete/NoncesCollected/SignaturesCollectedfor other rounds if Redis is slow.You could narrow the critical section roughly like:
func (s *treeSigningSessionsStore) New( roundId string, uniqueSignersPubKeys map[string]struct{}, ) *ports.MusigSigningSession { - s.lock.Lock() - defer s.lock.Unlock() - - ctx := context.Background() + ctx := context.Background() metaKey := fmt.Sprintf(treeSessMetaKeyFmt, roundId) cosignersBytes, _ := json.Marshal(uniqueSignersPubKeys) meta := map[string]interface{}{ "Cosigners": cosignersBytes, "NbCosigners": len(uniqueSignersPubKeys) + 1, // operator included } s.rdb.HSet(ctx, metaKey, meta) - watchCtx, cancel := context.WithCancel(context.Background()) - s.ctxs[roundId] = cancel - s.nonceChs[roundId] = make(chan struct{}) - s.sigsChs[roundId] = make(chan struct{}) - - go s.watchNoncesCollected(watchCtx, roundId) - go s.watchSigsCollected(watchCtx, roundId) + watchCtx, cancel := context.WithCancel(context.Background()) + s.lock.Lock() + s.ctxs[roundId] = cancel + s.nonceChs[roundId] = make(chan struct{}) + s.sigsChs[roundId] = make(chan struct{}) + s.lock.Unlock() + + go s.watchNoncesCollected(watchCtx, roundId) + go s.watchSigsCollected(watchCtx, roundId)This keeps correctness while avoiding holding the lock during Redis I/O.
Also applies to: 60-66
127-149: Delete semantics are safe, but round cleanup depends entirely on callersThe
Deleteimplementation looks race‑safe:
- You take the write lock while cancelling the ctx and closing/removing per‑round channels, so map access is well‑synchronized.
- Closing channels under the lock and accessing them via
RLockin watchers avoids concurrent map writes.Two behavioral points to be aware of:
If
Delete(roundId)is never called for a completed/aborted round, the per‑round ctx cancel func and channels remain in the maps indefinitely. Over time this is a memory/leak risk on a long‑lived node.If
Deleteis called before the watcher has signaled, it will:
- Cancel the watcher via ctx, causing it to return without signaling, and
- Close the per‑round channel; any goroutine blocked on
<-NoncesCollected(roundId)/<-SignaturesCollected(roundId)will then unblock due to close rather than an explicit send.Both behaviors may be acceptable, but they are observable. I’d recommend:
- Explicitly documenting that
Deleteboth tears down the watcher and “unblocks waiters” via channel close, and- Ensuring call‑sites always pair
NewwithDeleteonce the session is done.
186-231: Watcher loops busy‑poll Redis and ignore some errors; usepollIntervaland improve loggingBoth
watchNoncesCollectedandwatchSigsCollected:
- Run a tight
forloop with no sleep/backoff; on “not ready yet” they justcontinue, effectively hammering Redis and burning CPU.- Treat
HGetAllerrors the same as “missing meta” (len(meta) == 0) and just spin without logging or backoff.- Log parse errors and recovered panics without including
roundId, which makes multi‑round debugging harder.- Duplicate almost identical logic between the two functions.
Given
pollIntervalalready exists, you can introduce inexpensive backoff and better error logging with a small change. For example, inwatchNoncesCollected:func (s *treeSigningSessionsStore) watchNoncesCollected(ctx context.Context, roundId string) { metaKey := fmt.Sprintf(treeSessMetaKeyFmt, roundId) noncesKey := fmt.Sprintf(treeSessNoncesKeyFmt, roundId) for { select { case <-ctx.Done(): return default: - meta, err := s.rdb.HGetAll(ctx, metaKey).Result() - if err != nil || len(meta) == 0 { - continue - } + meta, err := s.rdb.HGetAll(ctx, metaKey).Result() + if err != nil { + log.Warnf("watchNoncesCollected: failed to fetch meta for round %s: %v", roundId, err) + time.Sleep(s.pollInterval) + continue + } + if len(meta) == 0 { + time.Sleep(s.pollInterval) + continue + } nbCosigners := 0 if _, err := fmt.Sscanf(meta["NbCosigners"], "%d", &nbCosigners); err != nil { - log.Warnf("watchNoncesCollected:failed to parse NbCosigners: %v", err) + log.Warnf("watchNoncesCollected: failed to parse NbCosigners for round %s: %v", roundId, err) + time.Sleep(s.pollInterval) continue } noncesMap, _ := s.rdb.HGetAll(ctx, noncesKey).Result() if len(noncesMap) == nbCosigners-1 { // unchanged send logic... return } + time.Sleep(s.pollInterval) } } }And mirror the same changes (including
roundIdin log messages and atime.Sleep(s.pollInterval)on “not ready / error” paths) inwatchSigsCollected.Optionally, you could also factor the common watcher logic into a shared helper that takes the key suffix and log prefix, to avoid duplication.
Also applies to: 233-278
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/infrastructure/live-store/redis/tree_signing_session.go(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/infrastructure/live-store/redis/tree_signing_session.go (1)
internal/core/ports/live_store.go (2)
TreeSigningSessionsStore(70-80)MusigSigningSession(104-110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: unit tests
- GitHub Check: integration tests
- GitHub Check: Build and Scan
| func (s *treeSigningSessionsStore) NoncesCollected(roundId string) <-chan struct{} { | ||
| return s.nonceCh | ||
| s.lock.RLock() | ||
| defer s.lock.RUnlock() | ||
| return s.nonceChs[roundId] | ||
| } | ||
|
|
||
| func (s *treeSigningSessionsStore) SignaturesCollected(roundId string) <-chan struct{} { | ||
| return s.sigsCh | ||
| s.lock.RLock() | ||
| defer s.lock.RUnlock() | ||
| return s.sigsChs[roundId] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
NoncesCollected / SignaturesCollected can return nil channels
Both access the maps under RLock and return the channel value directly:
s.lock.RLock()
defer s.lock.RUnlock()
return s.nonceChs[roundId]If roundId is unknown (no New yet, or already Deleted), this returns a nil channel. Any caller doing a bare <-store.NoncesCollected(id) / <-store.SignaturesCollected(id) will then block forever, which is a change from the previous “single global channel” behavior.
At minimum, consider:
func (s *treeSigningSessionsStore) NoncesCollected(roundId string) <-chan struct{} {
- s.lock.RLock()
- defer s.lock.RUnlock()
- return s.nonceChs[roundId]
+ s.lock.RLock()
+ ch, ok := s.nonceChs[roundId]
+ s.lock.RUnlock()
+ if !ok {
+ log.Warnf("NoncesCollected called for unknown roundId %s", roundId)
+ }
+ return ch
}(and same for SignaturesCollected).
Please also verify that all call‑sites only call these methods after a successful New(roundId, …), and that they handle the possibility of nil appropriately.
I'll verify the review comment by examining the code implementation and all call sites. Let me gather the necessary information.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the file and examine its structure
fd -t f "tree_signing_session.go" -x cat {}
Length of output: 11190
🏁 Script executed:
#!/bin/bash
# Find all call sites of NoncesCollected and SignaturesCollected
rg -n "NoncesCollected|SignaturesCollected" --type go -B2 -A2Length of output: 5938
🏁 Script executed:
#!/bin/bash
# Search for the New() and Delete() methods to understand lifecycle
rg -n "func.*New|func.*Delete" "internal/infrastructure/live-store/redis/tree_signing_session.go" -A10Length of output: 1204
🏁 Script executed:
#!/bin/bash
# Check how channels are used in select statements or other patterns
rg -n "<-.*NoncesCollected|<-.*SignaturesCollected" --type go -B2 -A2Length of output: 983
🏁 Script executed:
#!/bin/bash
# Look for nil channel handling patterns in the codebase
rg -n "if.*nil" "internal/infrastructure/live-store/redis/tree_signing_session.go" -B2 -A2Length of output: 2684
I need to examine the context around the call sites in service.go to verify whether New() is guaranteed to be called first.
Now let me examine the exact context of how these methods are called in service.go:
Let me examine the actual service.go code context around those call sites:
🏁 Script executed:
#!/bin/bash
# Get the context around the NoncesCollected call in service.go
sed -n '2500,2530p' internal/core/application/service.goLength of output: 1155
🏁 Script executed:
#!/bin/bash
# Get the context around the SignaturesCollected call in service.go
sed -n '2580,2610p' internal/core/application/service.goLength of output: 1279
🏁 Script executed:
#!/bin/bash
# Check if there's a Get() call before NoncesCollected to verify session exists
sed -n '2500,2530p' internal/core/application/service.go | grep -A5 -B5 "Get\|New"Length of output: 851
🏁 Script executed:
#!/bin/bash
# Look for the entire select block for nonces
sed -n '2480,2550p' internal/core/application/service.goLength of output: 2341
🏁 Script executed:
#!/bin/bash
# Look for the entire select block for sigs
sed -n '2550,2620p' internal/core/application/service.goLength of output: 2294
Add nil checks to NoncesCollected and SignaturesCollected with explicit error logging
The nil channel issue is confirmed: when roundId is unknown, both methods return a nil channel from the map. In the Redis implementation, this causes the select statement to ignore the channel case and fire the timeout instead, resulting in a misleading "signing session timed out" error rather than an explicit "session not found" message. While the timeout protection prevents actual hangs, the confusing error message masks the real problem.
Call sites in service.go (lines 2486–2519 and 2486–2592) do call New(roundId, ...) before accessing these methods in normal flow, so the happy path is protected. However, the defensive approach suggested in the review is sound: adding explicit nil checks and logging improves clarity and error diagnostics.
Additionally, note that the in-memory implementation (internal/infrastructure/live-store/inmemory/tree_signing_session.go lines 119–125) lacks locks entirely on these methods, creating a race condition when reading from the maps—a more serious issue than the nil return problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applicaton/service.go always call New(roundId) before getting the channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@louisinger, you're absolutely right that application/service.go always calls New(roundId) before accessing the channels in the current implementation, so the happy path is protected.
The suggestion for nil checks was more about defensive programming—making the code more resilient to future changes or unexpected call patterns, and providing clearer error messages if something goes wrong (e.g., "session not found" vs. a timeout). However, since you have control over the call flow and it's working correctly, the defensive checks are optional.
Feel free to keep the implementation as-is if you're confident about the call ordering! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go (1)
191-212: Consider simplifying the signaling logic.The nested selects and panic recovery add complexity:
- The outer select (line 192) and inner select (line 205) both check
ctx.Done(), which is redundant protection against a TOCTOU issue that the current lock pattern doesn't fully prevent anyway- The panic recovery (lines 197-204) guards against nil channel sends, but the nil check at line 190 already covers this
While this defensive coding isn't harmful, consider simplifying to improve maintainability:
if ch != nil { chOnce.Do(func() { select { case ch <- struct{}{}: case <-ctx.Done(): } }) } return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go (1)
internal/core/ports/live_store.go (1)
ConfirmationSessionsStore(61-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build and Scan
- GitHub Check: unit tests
- GitHub Check: integration tests
🔇 Additional comments (4)
internal/infrastructure/live-store/redis/confirmation_sessions.go (4)
49-63: Constructor setup looks good.The cancellable context and watcher initialization follow good practices for lifecycle management.
171-175: LGTM!Proper use of read lock to safely return the channel.
184-185: Usecontext.Background()for Redis operations.The Redis
Get()calls use the watcher's cancellation contextctx. IfReset()cancels this context while a Get operation is in progress, the operation could fail prematurely with a context cancellation error, even though the Redis operation itself is still valid.Redis operations should use a separate context (e.g.,
context.Background()or a context with timeout) rather than the watcher's lifecycle context.This is addressed in the diff in the previous comment.
⛔ Skipped due to learnings
Learnt from: louisinger Repo: arkade-os/arkd PR: 686 File: internal/core/application/fraud.go:47-61 Timestamp: 2025-08-28T08:21:01.170Z Learning: In reactToFraud function in internal/core/application/fraud.go, the goroutine that waits for confirmation and schedules checkpoint sweep should use context.Background() instead of the request context, as this is intentional design to decouple the checkpoint sweep scheduling from the request lifetime.
142-163: Line number mismatch in original review; however, the race condition is valid but likely benign.The review references signaling logic at "lines 191-212" but the actual
watchSessionCompletionspans lines 177-208. The concern itself is correct: the old watcher can pass thectx.Done()check (line 181) and hold a stale reference tosessionCompleteCh(read at line 188 under lock, then released at line 189). Between lock release and the nested select at lines 205-207,Reset()can cancel the context and create a new channel. However, the old watcher's send to the old channel is likely safe because:
- Anyone calling
SessionCompleted()beforeReset()gets the old channel reference and should receive its completion signal- New callers of
SessionCompleted()afterReset()get the new channel- The multiple
ctx.Done()checks (lines 192, 206) provide layered protectionThe vulnerability remains: if the old watcher hasn't exited by the time the new session processes confirmations, stale Redis state (from before reset) could cause the old watcher to emit a completion signal while the new session is active. Verify this scenario won't cause issues in your usage pattern—specifically, confirm that old channel listeners from before
Reset()won't interfere with the new session's logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go (1)
177-219: Watcher loop now usespollInterval; consider minor context and error‑handling tweaks.The reworked
watchSessionCompletioncorrectly:
- Uses a
time.NewTicker(s.pollInterval)instead of busy-waiting,- Is cancelable via
ctx.Done(),- Uses
RWMutexto safely snapshotsessionCompleteCh,- Guards the notification with
sync.Onceand panic recovery.A few potential improvements:
Error handling on Redis reads.
BothGetcalls ignore errors:numIntents, _ := s.rdb.Get(ctx, confirmationNumIntentsKey).Int() numConfirmed, _ := s.rdb.Get(ctx, confirmationNumConfirmedKey).Int()This can mask Redis connectivity or serialization issues and silently stall completion detection. Even a lightweight
if err != nil { log.Warnf(...) ; continue }would improve observability.# Check for other places where Redis errors are dropped in this package. rg -n "_, _ := s\.rdb\." internal/infrastructure/live-store/redis -C2Context choice for Redis
Getcalls.
You’re currently using the watcher’s cancelablectxfor theGetoperations. A previous review suggested usingcontext.Background()for these reads so they aren’t aborted by watcher cancellation, relying only on the select around the loop for termination. Whether that’s preferable depends on your desired shutdown behavior (fast cancellation vs. completing the last read). Because this mirrors past feedback, treat it as an optional design choice rather than a bug.In go-redis/v9, what are the semantics of using a canceled context vs. context.Background() for simple GET operations in long-running polling loops? Are there recommended patterns for shutdown behavior in such watchers?Docs vs implementation for completion signaling.
The header comment talks about “closing the localsessionCompleteChchannel when the session is complete”, but the implementation sends a singlestruct{}{}instead of closing. If multiple goroutines may wait onSessionCompleted(), closing is a broadcast whereas a single send only wakes one waiter. Either is valid, but it would be good to align the comment with the actual behavior or vice versa, depending on how call sites are written.Overall, the loop changes themselves are a solid improvement over the previous busy-waiting implementation.
Based on learnings
🧹 Nitpick comments (2)
internal/infrastructure/live-store/redis/confirmation_sessions.go (2)
39-47: New fields look appropriate;ctxfield currently unused outside Reset/New.
lock,cancel, andpollIntervalmake sense for coordinating the watcher and guardingsessionCompleteCh. The storedctxfield, however, is not read anywhere in this file; if it is not used by other packages via unsafe reflection/type assertions, you could drop it to reduce state surface area.
142-163: Race condition is confirmed, but callers have defensive timeout handling; refactoring suggestions remain valid but not critical.The race condition you identified is real: if
Reset()is called whilewatchSessionCompletion()is in the nested select (lines 208–212), thectx.Done()case can win before the signal send, causing the completion to go unnoticed.However, examination of the callers shows defensive handling: both the test (line 315–318) and the application (service.go ~2240) use select statements with timeout fallbacks. If the completion signal is missed, the timeout fires and execution continues—making this a timing inefficiency rather than a correctness failure.
The suggested refactors (delaying
cancel(), using generation IDs in Redis) are sound improvements that eliminate the race window, but the current implementation is safe in practice due to these defensive timeouts.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-28T08:21:01.170Z
Learnt from: louisinger
Repo: arkade-os/arkd PR: 686
File: internal/core/application/fraud.go:47-61
Timestamp: 2025-08-28T08:21:01.170Z
Learning: In reactToFraud function in internal/core/application/fraud.go, the goroutine that waits for confirmation and schedules checkpoint sweep should use context.Background() instead of the request context, as this is intentional design to decouple the checkpoint sweep scheduling from the request lifetime.
Applied to files:
internal/infrastructure/live-store/redis/confirmation_sessions.go
🧬 Code graph analysis (1)
internal/infrastructure/live-store/redis/confirmation_sessions.go (1)
internal/core/ports/live_store.go (1)
ConfirmationSessionsStore(61-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration tests
- GitHub Check: unit tests
- GitHub Check: Build and Scan
🔇 Additional comments (2)
internal/infrastructure/live-store/redis/confirmation_sessions.go (2)
49-61: Constructor wiring and watcher startup look sound.Creating a store-level cancellable context from
context.Background()and startingwatchSessionCompletion(ctx)here matches the intended decoupling from request lifetimes, and a 100ms defaultpollIntervalis a reasonable compromise between latency and load. No functional issues spotted in this segment.
171-175: RWMutex aroundSessionCompleted()accessor is appropriate.Using
RLock/RUnlockto guard reads ofsessionCompleteChensures that callers won’t observe a torn update whileReset()is replacing the channel under a write lock. The returned type is<-chan struct{}, so external code cannot close or send on it, which keeps ownership clear.
@altafan @sekulicd please review
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.