Multimodal `ToolCallSummaryMessage` & `FunctionExecutionResult` Support #6381

fang-d · 2025-04-24T08:18:08Z

fang-d
Apr 24, 2025

It seems that the return value of function calling only supports the str type currently. But for VLMs, supporting multimodal function calling is necessary and important. A simple example is that it allows the model to automatically select images from the file system and analyze them. The content of the ToolCallSummaryMessage and FunctionExecutionResult should be str | agent_core.Image | list[str | agent_core.Image].

ekzhu · 2025-04-25T22:56:41Z

ekzhu
Apr 25, 2025

You may want to take a look at the new Workbench feature that is coming in the next release: https://microsoft.github.io/autogen/dev/user-guide/core-user-guide/components/workbench.html. The return value from calling tool in a workbench can be image.

1 reply

fang-d Apr 27, 2025
Author

Hi @ekzhu, thanks for your reply. This feature is awesome!

However, it seems that the default workbench, StaticWorkbench, still assumes that all the tools return a string (or object that can be converted to a string).

autogen/python/packages/autogen-core/src/autogen_core/tools/_static_workbench.py

Lines 55 to 64 in 63c791d

    
           try: 
        
               result_future = asyncio.ensure_future(tool.run_json(arguments, cancellation_token)) 
        
               cancellation_token.link_future(result_future) 
        
               result = await result_future 
        
               is_error = False 
        
           except Exception as e: 
        
               result = str(e) 
        
               is_error = True 
        
           result_str = tool.return_value_as_string(result) 
        
           return ToolResult(name=tool.name, result=[TextResultContent(content=result_str)], is_error=is_error)

Besides, AssistantAgent also assumes that all tools return a string:

autogen/python/packages/autogen-agentchat/src/autogen_agentchat/agents/_assistant_agent.py

Lines 1300 to 1314 in 63c791d

    
           # Handle normal tool call using workbench. 
        
           result = await workbench.call_tool( 
        
               name=tool_call.name, 
        
               arguments=arguments, 
        
               cancellation_token=cancellation_token, 
        
           ) 
        
           return ( 
        
               tool_call, 
        
               FunctionExecutionResult( 
        
                   content=result.to_text(), 
        
                   call_id=tool_call.id, 
        
                   is_error=result.is_error, 
        
                   name=tool_call.name, 
        
               ), 
        
           )

As a user, I think maybe it would be better if agent_core.Image objects could be supported by default in the next version.

fang-d · 2025-05-06T04:30:11Z

fang-d
May 6, 2025
Author

It seems that function calling results with images are not supported by the current OpenAI models. I will close this discussion.

0 replies

aFewThings · 2025-11-10T14:14:09Z

aFewThings
Nov 10, 2025

I'm leaving this comment because I think there might be others who are struggling like I was. For the Autogen framework's agent to properly process the multi-mode output of the MCP tool implemented with FastMCP, you need to use MCPWorkbench. MCPWorkbench handles multi-mode output properly without assuming the tool's output is text.

1 reply

aFewThings Nov 10, 2025

However, in the case of AssistantAgent, the image object is converted to b64 encoded text when the tool call result is wrapped in FunctionExecutionResult, which blocks multimodal inputs for LLMs, as in comments of @fang-d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal `ToolCallSummaryMessage` & `FunctionExecutionResult` Support #6381

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multimodal ToolCallSummaryMessage & FunctionExecutionResult Support #6381

Uh oh!

Uh oh!

fang-d Apr 24, 2025

Replies: 3 comments · 2 replies

Uh oh!

ekzhu Apr 25, 2025

Uh oh!

Uh oh!

fang-d Apr 27, 2025 Author

Uh oh!

fang-d May 6, 2025 Author

Uh oh!

aFewThings Nov 10, 2025

Uh oh!

aFewThings Nov 10, 2025

Multimodal `ToolCallSummaryMessage` & `FunctionExecutionResult` Support #6381

fang-d
Apr 24, 2025

Replies: 3 comments 2 replies

ekzhu
Apr 25, 2025

fang-d Apr 27, 2025
Author

fang-d
May 6, 2025
Author

aFewThings
Nov 10, 2025