Skip to content

Handle reconnects while waiting for responses#203

Closed
jlouis wants to merge 1 commit into
inaka:masterfrom
tjek:jl/fix/reconnecting-while-receiving
Closed

Handle reconnects while waiting for responses#203
jlouis wants to merge 1 commit into
inaka:masterfrom
tjek:jl/fix/reconnecting-while-receiving

Conversation

@jlouis

@jlouis jlouis commented Sep 7, 2017

Copy link
Copy Markdown

This is a bit of a contended PR, but I'll describe it anyway

Symptom

  • Closing the lid of your development machine, bike home and try to use apns again, you get an error like the following: exception exit: timeout in function wpool_pool:call_available_worker/3
  • The error also happens if you run on AWS and you have a NAT-box between you and the internet.

Analysis

While apns_connection is waiting for an apns_response message, chatterbox realizes the underlying HTTP/2 is severed. So it tells the apns_connection process. This process tells the rest of the world that we are currently reconnecting, but the receive clause never realizes this.

It is not wrong to time out. But the current state of the connection is actually known to the underlying system. It just doesn't tell us right away and we have to go around the worker pool and timeout/reconnect each worker before we are back in operation. This incurs latency which isn't necessary.

Current fix

Detect reconnects, and resend the message into the mailbox. It is not a perfect fix and requires some maintenance in the future, see the following section:

Discussion

We probably need a better solution than this over time. The key is that pushing notifications over HTTP/2 ought to be asynchronous. We want a system which can push many messages over the same HTTP/2 connection, pipeline them aggressively and wait for responses. This suggests the system should be more async than it currently is. The other right thing to do here is to lift the apns_response message up so it sits in the same event loop as {reconnecting, _} and {connection_up, _}. That way, you don't have to handle it specially.

We may get some ideas from Loic's gun project here. When he awaits a response, he also handles the gun_down situation (and adds a monitor on top to boot, which is good style anyway). Pushes are always async and he reifies synchronous messages by going straight into the await loop, much like apns is currently doing. But it might make sense to streamline the event types apns_down, apns_up and so on.

We also need to understand the fabric between chatterbox and apns. We currently run a worker pool and keep N connections to APNS. But inside a single HTTP/2 stream, we can run many connections as well. My experience from RabbitMQ is that I can easily push and receive 50k RPC messages over rabbit per second on a single connection. So perhaps having a worker pool ends up being overkill: keep a single HTTP/2 connection and fill it up with streams?

If a reconnect happens while we are waiting for a response, we will
time out, even though we know exactly what is wrong in that case.

Detect the reconnecting message, redeliver it to the mailbox and
respond back with an immediate message.
@ferigis

ferigis commented Sep 7, 2017

Copy link
Copy Markdown
Member

This is a great topic!! thanks for helping us with Apns4erl. I am going to take a look carefully to your discussion. We started this apns4erl v2 using gun but we couldn't deploy to hex because gun is not there. I like the way gun handle the connections too. Let me take a look and we could find a fix together! :)

@jlouis

jlouis commented Sep 7, 2017

Copy link
Copy Markdown
Author

We don't necessarily need to use gun, just take its way of operating into account. Especially its gun:await/A calls, which are a nice way to handle the async/sync behavior of callers.

I've deployed the above for now to our infrastructure, and I can get back a reconnecting message whenever a worker in our pool is reconnecting. I just wait a short while and try again, eventually tricking every worker in the pool to reconnect. I.e., it works, but I'm not convinced it is the right solution at all.

@ferigis

ferigis commented Dec 5, 2018

Copy link
Copy Markdown
Member

I think we can close this since we moved from chatterbox to gun

@ferigis ferigis closed this Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants