Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for streaming API #36

Open
adilhafeez opened this issue Aug 21, 2024 · 5 comments
Open

Add support for streaming API #36

adilhafeez opened this issue Aug 21, 2024 · 5 comments

Comments

@adilhafeez
Copy link
Contributor

adilhafeez commented Aug 21, 2024

By default all the requests made to LLMs (whether API based or open-source) entire completion response is generated before sending response to client. This creates bad user experience.

Enter Streaming API. With streaming API client will receive updates for as model is generating output tokens. This improves user experience greatly but adds significant load on network.

@junr03
Copy link
Collaborator

junr03 commented Aug 27, 2024

I believe there is a misunderstanding somewhere here.

Let me see if I understand the problem statement correctly: is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response?

If that is the problem statement, then that is not the case. As long as Envoy has prior knowledge that the LLM API is capable of handling http 2, then Envoy will stream response bytes back to the client. That is, as long as none of the installed filters holds response bytes until completion -- which as of right now the gateway filter does not.

On the other side: the request API is currently not streaming bytes up to the LLM API given that AFAIK the local routing heuristics need a full request body before making a routing decision. Is that not the case?

@adilhafeez
Copy link
Contributor Author

is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response

Yes

However it's not http-2 or http-1 issue. It's how OpenAI chose to implement streaming. LLMs generate text one token at a time (by taking into account all tokens LLM has generated so far). Without streaming support OpenAI (and all other SDKs and models) wait for all the text to be generated before sending response back to client.

With streaming enabled LLMs will send a response for every single token.

For example take a look at this code (from this link)

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

@junr03
Copy link
Collaborator

junr03 commented Aug 30, 2024

Alright, this makes more sense now. It looks like:
a) The request has to authorize the server to stream
b) The client has to be able to parse server events in order to correctly stream response tokens.

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

@adilhafeez
Copy link
Contributor Author

adilhafeez commented Sep 3, 2024

a) The request has to authorize the server to stream

To start streaming response the developer has to configure the client with stream=true. When streaming is set, envoy will expect LLM to send a response for every single token it generated that will be sent back to developer. This will continue until LLM has generated all tokens or it has hit ratelimit.

b) The client has to be able to parse server events in order to correctly stream response tokens.

Yes, developer will be aware of the response they are receiving as they would've configured the client with stream=true

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

I think simplest implementation would be to stream the response from LLM back to developer. Client would understand and parse the response.

@adilhafeez
Copy link
Contributor Author

Needs verification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants