Summary
RetryEnd.Register reports success after the underlying connection has been re-established, but the re-registration packets do not always reach the server side's RemoteRegistration delegate. The result is a silent desync: the client believes its RPC is registered, the server has the connection alive, but Servicebound.GetServiceByRPC returns ErrRecordNotFound for the same RPC — until the client process is restarted.
We hit this on production (one geminio service end inside a long-running Go server, talking to singchia/frontier v1.2.3-rc.2). After ~3 days uptime, a single TCP-level disconnect happened, the retry layer reconnected and logged retry client offline and retry succeed, and from that point all new edges were rejected by frontier with service not online for 2 hours and 29 minutes, until the user did docker restart on the client.
Environment
- geminio:
v1.1.1
- frontier:
v1.2.3-rc.2 (using service.NewService(dialer, ...) which goes through newRetryServiceEnd → client.NewRetryEndWithDialer)
- Go: 1.24
- OS: linux/amd64 (manager) + linux/amd64 (frontier), in two separate containers on the same host
Symptom
The orphan probe on our side gave a perfectly clean signal of when the server-side RPC routing table contained the registered RPC. Before the disconnect: server returns record not found (RPC reached our handler, key not in DB). After the supposed-successful reconnect: server returns ErrServiceNotOnline (frontier's translation of apis.ErrRecordNotFound from Servicebound.GetServiceByRPC).
2026-05-01 00:00:04 record not found ← handler reachable
2026-05-01 08:58:25 service not online ← handler gone (disconnect)
... 2h 29m ...
2026-05-01 11:27:40 record not found ← handler back (process restart)
The transition times align exactly with frontier's service_onoff events. So:
- 08:58:24 server logs
service offline, serviceID: X
- 08:58:27 server logs
service online, serviceID: Y (new ID, the retry-end re-dialed)
- 08:58:28 client logs
retry client offline and retry succeed
- frontier sees the new connection alive the whole window after that, but its
repo.GetServiceRPC(...) does not have any RPCs for service Y
The server side RemoteRegistration delegate (frontier/pkg/frontier/servicebound/service_manager.go:167) only logs at klog.V(2), so we can't directly see whether it was called for service Y. But the orphan probe makes it conclusive that effectively no RPC was registered for service Y on the server.
Suspected mechanism
client/end_retry.go reinit() flow (paraphrased):
re.retry.Lock()
defer re.retry.Unlock()
if cur != old { return nil }
old.Close()
time.Sleep(3 * time.Second)
new, err := re.getEnd()
if err != nil { return err }
atomic.StorePointer(&re.end, unsafe.Pointer(new)) // ← state is "new connection installed"
time.Sleep(1 * time.Second)
// re-register memorized RPCs
for method, rpc := range re.rpcs {
err = re.register(context.TODO(), method, rpc, false)
if err != nil { return err }
}
re.register ultimately reaches application/rpc.go Register:
sm.writeInCh <- pkt
select {
case event := <-sync.C():
if event.Error != nil { return event.Error }
case <-ctx.Done():
...
}
The sync.C() channel is signalled when a packet with the matching ID is received back. As far as we can tell, this is the protocol-level ack for the REGISTER packet — i.e. "the underlying connection has accepted the packet and the peer has acknowledged it". It does not wait for the server-side RemoteRegistration delegate to actually run and update the server's RPC routing repo.
If the server-side RemoteRegistration callback fails to fire (or fires but repo.CreateServiceRPC fails) for any reason, the client's Register still returns nil. There is no second-channel verification.
We don't have a deterministic reproducer yet, but the production timeline (single disconnect → reconnect → succeed log → 2h 29m of broken auth) is consistent with a one-time loss on the registration path. Possible triggers we suspect:
- The 3 s + 1 s sleep windows in
reinit may overlap with peer-side mux/dialogue setup races, and a REGISTER packet sent right after the sleeps lands on a connection where the server has not yet wired up the delegate path.
- Server-side
RemoteRegistration is on klog.V(2) only, and the call is spawned via the mux layer; an early connection close on the server side could discard pending packets after the protocol ack but before the delegate runs.
Suggested fixes (any one would close the silent failure)
-
Server-side ack ordering: have the server only send the REGISTER ack after the delegate's RemoteRegistration callback has returned (and the repo write has succeeded). This is the most invasive but the most correct — it makes End.Register truly end-to-end.
-
Client-side end-to-end self-test: add an optional verifyRegister mode where, after End.Register returns success, the retry layer triggers a no-op self-call against the just-registered method (e.g. via the local stream layer using a sentinel argument that the server special-cases). On ErrServiceNotOnline, force a reconnect.
-
Client-side periodic re-register: lowest risk, just expose a knob on RetryEndOptions to re-send all memorized registers every N seconds. We're already doing this in our own application code as a workaround, but it logically belongs in the retry layer because the retry layer is the one that owns the "connection just came back up" event.
-
Expose EndReOnline more reliably / document it: the existing delegate.EndReOnline callback fires at the right moment for application-level recovery, but it's not surfaced via service.Service in singchia/frontier's service_end.go, so applications can't easily hook into it without dropping down to geminio.End. A first-class OnReOnline(func()) option on the retry-end would make it routine for callers to add their own re-verification.
Workaround we deployed
Application-level periodic re-register every 30 s, plus an explicit re-register on every retry "succeed". Brings the worst-case bad-state window from 2.5 hours to ≤30 seconds. Postmortem with full timeline is at https://github.com/liaisonio/liaison-cloud/blob/main/spec/incidents/INC-20260501-frontier-rpc-stale-after-reconnect.md (private).
What I can do next
Happy to send a PR for option 3 (the periodic re-register knob) or 4 (the OnReOnline hook) if you'd accept either of those — both are local changes and don't touch the wire protocol. Option 1 needs your call as it's a protocol change.
Summary
RetryEnd.Registerreports success after the underlying connection has been re-established, but the re-registration packets do not always reach the server side'sRemoteRegistrationdelegate. The result is a silent desync: the client believes its RPC is registered, the server has the connection alive, butServicebound.GetServiceByRPCreturnsErrRecordNotFoundfor the same RPC — until the client process is restarted.We hit this on production (one geminio service end inside a long-running Go server, talking to
singchia/frontierv1.2.3-rc.2). After ~3 days uptime, a single TCP-level disconnect happened, the retry layer reconnected and loggedretry client offline and retry succeed, and from that point all new edges were rejected by frontier withservice not onlinefor 2 hours and 29 minutes, until the user diddocker restarton the client.Environment
v1.1.1v1.2.3-rc.2(usingservice.NewService(dialer, ...)which goes throughnewRetryServiceEnd→client.NewRetryEndWithDialer)Symptom
The orphan probe on our side gave a perfectly clean signal of when the server-side RPC routing table contained the registered RPC. Before the disconnect: server returns
record not found(RPC reached our handler, key not in DB). After the supposed-successful reconnect: server returnsErrServiceNotOnline(frontier's translation ofapis.ErrRecordNotFoundfromServicebound.GetServiceByRPC).The transition times align exactly with frontier's
service_onoffevents. So:service offline, serviceID: Xservice online, serviceID: Y(new ID, the retry-end re-dialed)retry client offline and retry succeedrepo.GetServiceRPC(...)does not have any RPCs for service YThe server side
RemoteRegistrationdelegate (frontier/pkg/frontier/servicebound/service_manager.go:167) only logs atklog.V(2), so we can't directly see whether it was called for service Y. But the orphan probe makes it conclusive that effectively no RPC was registered for service Y on the server.Suspected mechanism
client/end_retry.goreinit()flow (paraphrased):re.registerultimately reachesapplication/rpc.goRegister:The
sync.C()channel is signalled when a packet with the matching ID is received back. As far as we can tell, this is the protocol-level ack for the REGISTER packet — i.e. "the underlying connection has accepted the packet and the peer has acknowledged it". It does not wait for the server-sideRemoteRegistrationdelegate to actually run and update the server's RPC routing repo.If the server-side
RemoteRegistrationcallback fails to fire (or fires butrepo.CreateServiceRPCfails) for any reason, the client'sRegisterstill returnsnil. There is no second-channel verification.We don't have a deterministic reproducer yet, but the production timeline (single disconnect → reconnect →
succeedlog → 2h 29m of broken auth) is consistent with a one-time loss on the registration path. Possible triggers we suspect:reinitmay overlap with peer-side mux/dialogue setup races, and a REGISTER packet sent right after the sleeps lands on a connection where the server has not yet wired up the delegate path.RemoteRegistrationis onklog.V(2)only, and the call is spawned via the mux layer; an early connection close on the server side could discard pending packets after the protocol ack but before the delegate runs.Suggested fixes (any one would close the silent failure)
Server-side ack ordering: have the server only send the REGISTER ack after the delegate's
RemoteRegistrationcallback has returned (and the repo write has succeeded). This is the most invasive but the most correct — it makesEnd.Registertruly end-to-end.Client-side end-to-end self-test: add an optional
verifyRegistermode where, afterEnd.Registerreturns success, the retry layer triggers a no-op self-call against the just-registered method (e.g. via the local stream layer using a sentinel argument that the server special-cases). OnErrServiceNotOnline, force a reconnect.Client-side periodic re-register: lowest risk, just expose a knob on
RetryEndOptionsto re-send all memorized registers every N seconds. We're already doing this in our own application code as a workaround, but it logically belongs in the retry layer because the retry layer is the one that owns the "connection just came back up" event.Expose
EndReOnlinemore reliably / document it: the existingdelegate.EndReOnlinecallback fires at the right moment for application-level recovery, but it's not surfaced viaservice.Serviceinsingchia/frontier'sservice_end.go, so applications can't easily hook into it without dropping down togeminio.End. A first-classOnReOnline(func())option on the retry-end would make it routine for callers to add their own re-verification.Workaround we deployed
Application-level periodic re-register every 30 s, plus an explicit re-register on every retry "succeed". Brings the worst-case bad-state window from 2.5 hours to ≤30 seconds. Postmortem with full timeline is at https://github.com/liaisonio/liaison-cloud/blob/main/spec/incidents/INC-20260501-frontier-rpc-stale-after-reconnect.md (private).
What I can do next
Happy to send a PR for option 3 (the periodic re-register knob) or 4 (the
OnReOnlinehook) if you'd accept either of those — both are local changes and don't touch the wire protocol. Option 1 needs your call as it's a protocol change.