Add support for streaming API #36

adilhafeez · 2024-08-21T16:19:40Z

By default all the requests made to LLMs (whether API based or open-source) entire completion response is generated before sending response to client. This creates bad user experience.

Enter Streaming API. With streaming API client will receive updates for as model is generating output tokens. This improves user experience greatly but adds significant load on network.

junr03 · 2024-08-27T23:07:01Z

I believe there is a misunderstanding somewhere here.

Let me see if I understand the problem statement correctly: is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response?

If that is the problem statement, then that is not the case. As long as Envoy has prior knowledge that the LLM API is capable of handling http 2, then Envoy will stream response bytes back to the client. That is, as long as none of the installed filters holds response bytes until completion -- which as of right now the gateway filter does not.

On the other side: the request API is currently not streaming bytes up to the LLM API given that AFAIK the local routing heuristics need a full request body before making a routing decision. Is that not the case?

adilhafeez · 2024-08-29T18:55:02Z

is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response

Yes

However it's not http-2 or http-1 issue. It's how OpenAI chose to implement streaming. LLMs generate text one token at a time (by taking into account all tokens LLM has generated so far). Without streaming support OpenAI (and all other SDKs and models) wait for all the text to be generated before sending response back to client.

With streaming enabled LLMs will send a response for every single token.

For example take a look at this code (from this link)

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

junr03 · 2024-08-30T17:36:01Z

Alright, this makes more sense now. It looks like:
a) The request has to authorize the server to stream
b) The client has to be able to parse server events in order to correctly stream response tokens.

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

adilhafeez · 2024-09-03T21:31:11Z

a) The request has to authorize the server to stream

To start streaming response the developer has to configure the client with stream=true. When streaming is set, envoy will expect LLM to send a response for every single token it generated that will be sent back to developer. This will continue until LLM has generated all tokens or it has hit ratelimit.

b) The client has to be able to parse server events in order to correctly stream response tokens.

Yes, developer will be aware of the response they are receiving as they would've configured the client with stream=true

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

I think simplest implementation would be to stream the response from LLM back to developer. Client would understand and parse the response.

adilhafeez · 2024-10-10T04:46:42Z

Needs verification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for streaming API #36

Add support for streaming API #36

adilhafeez commented Aug 21, 2024 •

edited

Loading

junr03 commented Aug 27, 2024

adilhafeez commented Aug 29, 2024

junr03 commented Aug 30, 2024

adilhafeez commented Sep 3, 2024 •

edited

Loading

adilhafeez commented Oct 10, 2024

Add support for streaming API #36

Add support for streaming API #36

Comments

adilhafeez commented Aug 21, 2024 • edited Loading

junr03 commented Aug 27, 2024

adilhafeez commented Aug 29, 2024

junr03 commented Aug 30, 2024

adilhafeez commented Sep 3, 2024 • edited Loading

adilhafeez commented Oct 10, 2024

adilhafeez commented Aug 21, 2024 •

edited

Loading

adilhafeez commented Sep 3, 2024 •

edited

Loading