-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for streaming API #36
Comments
I believe there is a misunderstanding somewhere here. Let me see if I understand the problem statement correctly: is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response? If that is the problem statement, then that is not the case. As long as Envoy has prior knowledge that the LLM API is capable of handling http 2, then Envoy will stream response bytes back to the client. That is, as long as none of the installed filters holds response bytes until completion -- which as of right now the gateway filter does not. On the other side: the request API is currently not streaming bytes up to the LLM API given that AFAIK the local routing heuristics need a full request body before making a routing decision. Is that not the case? |
Yes However it's not http-2 or http-1 issue. It's how OpenAI chose to implement streaming. LLMs generate text one token at a time (by taking into account all tokens LLM has generated so far). Without streaming support OpenAI (and all other SDKs and models) wait for all the text to be generated before sending response back to client. With streaming enabled LLMs will send a response for every single token. For example take a look at this code (from this link) from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="") |
Alright, this makes more sense now. It looks like: In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library. |
To start streaming response the developer has to configure the client with
Yes, developer will be aware of the response they are receiving as they would've configured the client with
I think simplest implementation would be to stream the response from LLM back to developer. Client would understand and parse the response. |
Needs verification |
By default all the requests made to LLMs (whether API based or open-source) entire completion response is generated before sending response to client. This creates bad user experience.
Enter Streaming API. With streaming API client will receive updates for as model is generating output tokens. This improves user experience greatly but adds significant load on network.
The text was updated successfully, but these errors were encountered: