Parse input_audio message content#47
Merged
Merged
Conversation
Audio arrives in a chat message the same way images do: as a typed content part. OpenAI carries it inline as base64 under an input_audio part with a format field. Add the InputAudio type to ContentPart and AudioRefs on Message to decode the clips out, alongside HasAudio for a cheap check. Text flattening still ignores non-text parts, so audio rides along on the message for a transcription stage to read rather than being dropped or leaking into the prompt text. A payload that does not decode is reported as an error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Audio reaches a chat message the same way images do: as a typed content part.
The OpenAI format carries it inline as base64 under an
input_audiopart with aformatfield, rather than by URL.InputAudiois added toContentPart.Message.AudioRefsdecodes each clip's base64 into bytes paired with itsformat.
Message.HasAudiois the cheap check for whether a message carries audio.Text flattening continues to ignore non-text parts, so audio rides along on the
message for a transcription stage to read instead of being dropped or leaking
into the prompt text.
Scope
This is the api parsing layer, mirroring the earlier image content change. Pure
Go and fully unit tested: base64 decoding with the format preserved, rejection
of a malformed payload, skipping empty or missing audio, the HasAudio check, and
that TextContent still ignores audio parts. Decoding the audio container and
running speech-to-text land later; this makes the audio bytes available instead
of discarding them.
Test
go test ./...green.go vet ./...clean.