Smolagents: support ImageContent and AudioContent#61
Conversation
|
Hey @HSGamer thanks a lot for the contribution it's a very interesting feature that we wanted to add. I will have a look as soon as possible! |
# Conflicts: # uv.lock
| self.skip_forward_signature_validation = True | ||
|
|
||
| def forward(self, *args, **kwargs) -> str: | ||
| def forward(self, *args, **kwargs): |
There was a problem hiding this comment.
maybe we could type the return type here as image, audio or text
There was a problem hiding this comment.
Since Pillow and torchaudio are optional, I'm not sure that it won't throw error if the packages are not available
grll
left a comment
There was a problem hiding this comment.
Overall I think it's a very good change. I think it's great that it is using PIL and torch Audio as in Smolagents. Sorry for taking so long to review. Could you please add test and better documentation around the extra to import?
|
@HSGamer we are almost there. Lint and test are failing though. Could we also maybe simplify the tests and not create the audio or the image everytime but instead maybe create it once and commit it as a file in |
|
@grll I think I fixed the failing tests, at least |
thanks for the changes! for some reason the tests still fail in CI, I will have a look tomorrow |
* Smolagents: support ImageContent and AudioContent * update uv lock * add audio to test * change the command in audio * add a note about audio package in smolagents docs * add test_image * add test_audio * add sample files * use pytest-datadir to get the sample files * assert to make sure the right image size * Any result-type * add an audio backend via soundfile package to make tests work * fix missing pytest fixture for the datadir * fix mypy warning no stubs * fix format * improve typing of the forward function * fix type definition --------- Co-authored-by: Guillaume Raille <guillaume.raille@gmail.com>
Smolagents does have a method to dynamically handle types of tool result (https://github.com/huggingface/smolagents/blob/fc73322658a2c261cf59d817660c6c88d510431b/src/smolagents/agent_types.py#L262-L280). It supports Text as
str, Image asPIL.Imageand Audio asTensor.This PR modified the content to match the supported types.
This is useful for ToolCallingAgent since it supports all types of content. It's not useful for CodeAgent at the moment since it only supports Text, but this can be a preparation for when CodeAgent is upgraded to support all types.