bug: RetryEnd.Register reports success after reconnect but server-side RPC routing is not updated (silent desync)

### Summary

`RetryEnd.Register` reports success after the underlying connection has been re-established, but the re-registration packets do not always reach the server side's `RemoteRegistration` delegate. The result is a silent desync: the client believes its RPC is registered, the server has the connection alive, but `Servicebound.GetServiceByRPC` returns `ErrRecordNotFound` for the same RPC — until the client process is restarted.

We hit this on production (one geminio service end inside a long-running Go server, talking to `singchia/frontier` v1.2.3-rc.2). After ~3 days uptime, a single TCP-level disconnect happened, the retry layer reconnected and logged `retry client offline and retry succeed`, and from that point all new edges were rejected by frontier with `service not online` for **2 hours and 29 minutes**, until the user did `docker restart` on the client.

### Environment

- geminio: `v1.1.1`
- frontier: `v1.2.3-rc.2` (using `service.NewService(dialer, ...)` which goes through `newRetryServiceEnd` → `client.NewRetryEndWithDialer`)
- Go: 1.24
- OS: linux/amd64 (manager) + linux/amd64 (frontier), in two separate containers on the same host

### Symptom

The orphan probe on our side gave a perfectly clean signal of when the server-side RPC routing table contained the registered RPC. Before the disconnect: server returns `record not found` (RPC reached our handler, key not in DB). After the supposed-successful reconnect: server returns `ErrServiceNotOnline` (frontier's translation of `apis.ErrRecordNotFound` from `Servicebound.GetServiceByRPC`).

```
2026-05-01 00:00:04   record not found       ← handler reachable
2026-05-01 08:58:25   service not online     ← handler gone (disconnect)
... 2h 29m ...
2026-05-01 11:27:40   record not found       ← handler back (process restart)
```

The transition times align exactly with frontier's `service_onoff` events. So:

- 08:58:24 server logs `service offline, serviceID: X`
- 08:58:27 server logs `service online, serviceID: Y` (new ID, the retry-end re-dialed)
- 08:58:28 client logs `retry client offline and retry succeed`
- frontier sees the new connection alive the whole window after that, **but its `repo.GetServiceRPC(...)` does not have any RPCs for service Y**

The server side `RemoteRegistration` delegate (`frontier/pkg/frontier/servicebound/service_manager.go:167`) only logs at `klog.V(2)`, so we can't directly see whether it was called for service Y. But the orphan probe makes it conclusive that effectively *no RPC was registered for service Y on the server*.

### Suspected mechanism

`client/end_retry.go` `reinit()` flow (paraphrased):

```go
re.retry.Lock()
defer re.retry.Unlock()
if cur != old { return nil }
old.Close()
time.Sleep(3 * time.Second)
new, err := re.getEnd()
if err != nil { return err }
atomic.StorePointer(&re.end, unsafe.Pointer(new))   // ← state is "new connection installed"
time.Sleep(1 * time.Second)
// re-register memorized RPCs
for method, rpc := range re.rpcs {
    err = re.register(context.TODO(), method, rpc, false)
    if err != nil { return err }
}
```

`re.register` ultimately reaches `application/rpc.go` `Register`:

```go
sm.writeInCh <- pkt
select {
case event := <-sync.C():
    if event.Error != nil { return event.Error }
case <-ctx.Done():
    ...
}
```

The `sync.C()` channel is signalled when a packet with the matching ID is received back. As far as we can tell, this is the **protocol-level ack** for the REGISTER packet — i.e. "the underlying connection has accepted the packet and the peer has acknowledged it". It does **not** wait for the server-side `RemoteRegistration` delegate to actually run and update the server's RPC routing repo.

If the server-side `RemoteRegistration` callback fails to fire (or fires but `repo.CreateServiceRPC` fails) for any reason, the client's `Register` still returns `nil`. There is no second-channel verification.

We don't have a deterministic reproducer yet, but the production timeline (single disconnect → reconnect → `succeed` log → 2h 29m of broken auth) is consistent with a one-time loss on the registration path. Possible triggers we suspect:

1. The 3 s + 1 s sleep windows in `reinit` may overlap with peer-side mux/dialogue setup races, and a REGISTER packet sent right after the sleeps lands on a connection where the server has not yet wired up the delegate path.
2. Server-side `RemoteRegistration` is on `klog.V(2)` only, and the call is spawned via the mux layer; an early connection close on the server side could discard pending packets after the protocol ack but before the delegate runs.

### Suggested fixes (any one would close the silent failure)

1. **Server-side ack ordering**: have the server only send the REGISTER ack *after* the delegate's `RemoteRegistration` callback has returned (and the repo write has succeeded). This is the most invasive but the most correct — it makes `End.Register` truly end-to-end.

2. **Client-side end-to-end self-test**: add an optional `verifyRegister` mode where, after `End.Register` returns success, the retry layer triggers a no-op self-call against the just-registered method (e.g. via the local stream layer using a sentinel argument that the server special-cases). On `ErrServiceNotOnline`, force a reconnect.

3. **Client-side periodic re-register**: lowest risk, just expose a knob on `RetryEndOptions` to re-send all memorized registers every N seconds. We're already doing this in our own application code as a workaround, but it logically belongs in the retry layer because the retry layer is the one that owns the "connection just came back up" event.

4. **Expose `EndReOnline` more reliably / document it**: the existing `delegate.EndReOnline` callback fires at the right moment for application-level recovery, but it's not surfaced via `service.Service` in `singchia/frontier`'s `service_end.go`, so applications can't easily hook into it without dropping down to `geminio.End`. A first-class `OnReOnline(func())` option on the retry-end would make it routine for callers to add their own re-verification.

### Workaround we deployed

Application-level periodic re-register every 30 s, plus an explicit re-register on every retry "succeed". Brings the worst-case bad-state window from 2.5 hours to ≤30 seconds. Postmortem with full timeline is at https://github.com/liaisonio/liaison-cloud/blob/main/spec/incidents/INC-20260501-frontier-rpc-stale-after-reconnect.md (private).

### What I can do next

Happy to send a PR for option 3 (the periodic re-register knob) or 4 (the `OnReOnline` hook) if you'd accept either of those — both are local changes and don't touch the wire protocol. Option 1 needs your call as it's a protocol change.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: RetryEnd.Register reports success after reconnect but server-side RPC routing is not updated (silent desync) #117

Summary

Environment

Symptom

Suspected mechanism

Suggested fixes (any one would close the silent failure)

Workaround we deployed

What I can do next

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: RetryEnd.Register reports success after reconnect but server-side RPC routing is not updated (silent desync) #117

Description

Summary

Environment

Symptom

Suspected mechanism

Suggested fixes (any one would close the silent failure)

Workaround we deployed

What I can do next

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions