A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)
{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>
Used in Rhasspy and Home Assistant for communication with voice services.
- Voice satellites
- Satellite for Home Assistant
- Audio input/output
- Wake word detection
- Speech-to-text
- Text-to-speech
- Intent handling
- A JSON object header as a single line with
\n
(UTF-8, required)type
- event type (string, required)data
- event data (object, optional)data_length
- bytes of additional data (int, optional)payload_length
- bytes of binary payload (int, optional)
- Additional data (UTF-8, optional)
- JSON object with additional event-specific data
- Merged on top of header
data
- Exactly
data_length
bytes long - Immediately follows header
\n
- Payload
- Typically PCM audio but can be any binary data
- Exactly
payload_length
bytes long - Immediately follows additional data or header
\n
if no additional data
Available events with type
and fields.
Send raw audio and indicate begin/end of audio streams.
audio-chunk
- chunk of raw PCM audiorate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp of audio chunk in milliseconds (int, optional)- Payload is raw PCM audio samples
audio-start
- start of an audio streamrate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp in milliseconds (int, optional)
audio-stop
- end of an audio streamtimestamp
- timestamp in milliseconds (int, optional)
Describe available services.
describe
- request for available voice servicesinfo
- response describing available voice servicesasr
- list speech recognition services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)
tts
- list text to speech services (optional)models
- list of available modelsname
- unique name (required)languages
- supported languages by model (list of string, required)speakers
- list of speakers (optional)name
- unique name of speaker (required)
attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)
wake
- list wake word detection services( optional )models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)
handle
- list intent handling services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)
intent
- list intent recognition services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)
satellite
- information about voice satellite (optional)area
- name of area where satellite is located (string, optional)has_vad
- true if the end of voice commands will be detected locally (boolean, optional)active_wake_words
- list of wake words that are actively being listend for (list of string, optional)max_active_wake_words
- maximum number of local wake words that can be run simultaneously (number, optional)supports_trigger
- true if satellite supports remotely-triggered pipelines
mic
- list of audio input services (optional)mic_format
- audio input format (required)rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)
snd
- list of audio output services (optional)snd_format
- audio output format (required)rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)
Transcribe audio into text.
transcribe
- request to transcribe an audio streamname
- name of model to use (string, optional)language
- language of spoken audio (string, optional)context
- context from previous interactions (object, optional)
transcript
- response with transcriptiontext
- text transcription of spoken audio (string, required)context
- context for next interaction (object, optional)
Synthesize audio from text.
synthesize
- request to generate audio from texttext
- text to speak (string, required)voice
- use a specific voice (optional)name
- name of voice (string, optional)language
- language of voice (string, optional)speaker
- speaker of voice (string, optional)
Detect wake words in an audio stream.
detect
- request detection of specific wake word(s)names
- wake word names to detect (list of string, optional)
detection
- response when detection occursname
- name of wake word that was detected (int, optional)timestamp
- timestamp of audio chunk in milliseconds when detection occurred (int optional)
not-detected
- response when audio stream ends without a detection
Detects speech and silence in an audio stream.
voice-started
- user has started speakingtimestamp
- timestamp of audio chunk when speaking started in milliseconds (int, optional)
voice-stopped
- user has stopped speakingtimestamp
- timestamp of audio chunk when speaking stopped in milliseconds (int, optional)
Recognizes intents from text.
recognize
- request to recognize an intent from texttext
- text to recognize (string, required)context
- context from previous interactions (object, optional)
intent
- response with recognized intentname
- name of intent (string, required)entities
- list of entities (optional)name
- name of entity (string, required)value
- value of entity (any, optional)
text
- response for user (string, optional)context
- context for next interactions (object, optional)
not-recognized
- response indicating no intent was recognizedtext
- response for user (string, optional)context
- context for next interactions (object, optional)
Handle structured intents or text directly.
handled
- response when intent was successfully handledtext
- response for user (string, optional)context
- context for next interactions (object, optional)
not-handled
- response when intent was not handledtext
- response for user (string, optional)context
- context for next interactions (object, optional)
Play audio stream.
played
- response when audio finishes playing
Control of one or more remote voice satellites connected to a central server.
run-satellite
- informs satellite that server is ready to run pipelinespause-satellite
- informs satellite that server is not ready anymore to run pipelinessatellite-connected
- satellite has connected to the serversatellite-disconnected
- satellite has been disconnected from the serverstreaming-started
- satellite has started streaming audio to the serverstreaming-stopped
- satellite has stopped streaming audio to the server
Pipelines are run on the server, but can be triggered remotely from the server as well.
run-pipeline
- runs a pipeline on the server or asks the satellite to run it when possiblestart_stage
- pipeline stage to start at (string, required)end_stage
- pipeline stage to end at (string, required)wake_word_name
- name of detected wake word that started this pipeline (string, optional)- From client only
wake_word_names
- names of wake words to listen for (list of string, optional)- From server only
start_stage
must be "wake"
announce_text
- text to speak on the satellite- From server only
start_stage
must be "tts"
restart_on_end
- true if the server should re-run the pipeline after it ends (boolean, default is false)- Only used for always-on streaming satellites
timer-started
- a new timer has startedid
- unique id of timer (string, required)total_seconds
- number of seconds the timer should run for (int, required)name
- user-provided name for timer (string, optional)start_hours
- hours the timer should run for as spoken by user (int, optional)start_minutes
- minutes the timer should run for as spoken by user (int, optional)start_seconds
- seconds the timer should run for as spoken by user (int, optional)command
- optional command that the server will execute when the timer is finishedtext
- text of command to execute (string, required)language
- language of the command (string, optional)
timer-updated
- timer has been paused/resumed or time has been added/removedid
- unique id of timer (string, required)is_active
- true if timer is running, false if paused (bool, required)total_seconds
- number of seconds that the timer should run for now (int, required)
timer-cancelled
- timer was cancelledid
- unique id of timer (string, required)
timer-finished
- timer finished without being cancelledid
- unique id of timer (string, required)
- → is an event from client to server
- ← is an event from server to client
- →
describe
(required) - ←
info
(required)
- →
transcribe
event withname
of model to use orlanguage
(optional) - →
audio-start
(required) - →
audio-chunk
(required)- Send audio chunks until silence is detected
- →
audio-stop
(required) - ←
transcript
- Contains text transcription of spoken audio
- →
synthesize
event withtext
(required) - ←
audio-start
- ←
audio-chunk
- One or more audio chunks
- ←
audio-stop
- →
detect
event withnames
of wake words to detect (optional) - →
audio-start
(required) - →
audio-chunk
(required)- Keep sending audio chunks until a
detection
is received
- Keep sending audio chunks until a
- ←
detection
- Sent for each wake word detection
- →
audio-stop
(optional)- Manually end audio stream
- ←
not-detected
- Sent after
audio-stop
if no detections occurred
- Sent after
- →
audio-chunk
(required)- Send audio chunks until silence is detected
- ←
voice-started
- When speech starts
- ←
voice-stopped
- When speech stops
- →
recognize
(required) - ←
intent
if successful - ←
not-recognized
if not successful
For structured intents:
- →
intent
(required) - ←
handled
if successful - ←
not-handled
if not successful
For text only:
- →
transcript
withtext
to handle (required) - ←
handled
if successful - ←
not-handled
if not successful
- →
audio-start
(required) - →
audio-chunk
(required)- One or more audio chunks
- →
audio-stop
(required) - ←
played