Skip to content

Fix parallel final answers #1482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 27, 2025

Conversation

aymeric-roucher
Copy link
Collaborator

@aymeric-roucher aymeric-roucher commented Jun 24, 2025

Fixing #1481

Along the way:

  • improved tool calling management to handle call ids and show observations in call_id order
  • added the streaming of ToolCall objets
  • added systematic checking of input types to raise AgentToolCallError

else:
final_answer = answers

# TODO: Make sure this is not a problem: https://github.com/huggingface/smolagents/pull/1255#pullrequestreview-2822036066
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova WDYT? In #1255 (review), you had fixed an issue with return types by skipping the execution of final answer tool if the output was not dict or string.
But certain custom finalAnswerTool implementations might very well expect an image or audio to perform post processing on them, thus I think it's better to keep this. Now tests do pass, please make sure none of the usages you were thinking of has been broken by my changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reviewing my previous modification, I realize maybe I went a bit too fast: the actual issue was with the error message indeed! Your fix using str instead of json.loads makes sense and addresses it properly.

Thanks for catching this and ensuring compatibility with broader FinalAnswerTool implementations. I've re-checked the relevant usages, and nothing appears to be broken by your changes.

@@ -1470,12 +1458,12 @@ def execute_tool_call(self, tool_name: str, arguments: dict[str, str] | str) ->
# Handle execution errors
if is_managed_agent:
error_msg = (
f"Error executing request to team member '{tool_name}' with arguments {json.dumps(arguments)}: {e}\n"
f"Error executing request to team member '{tool_name}' with arguments {str(arguments)}: {e}\n"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to str to not raise issue in cases the arguments are not serializable, e.g. PIL Images or audio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the issue I found. Good solution.

Maybe we could improve the __str__ methods for AgentImage and AgentAudio, so we avoid having their memory address in the message error. Currently, we would get:

Error executing tool 'final_answer' with arguments {'answer': <smolagents.agent_types.AgentImage image mode= size=0x0 at 0x7EFCDD78E190>}: FileNotFoundError: [Errno 2] No such file or directory: 'path.png'
Please try again or use another tool

answer = agent.run("Fake task.")
assert answer == "1CUSTOM2"
assert answer == ["1 and 2", "3 and 4"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests that two final answer tool calls can run at once.

tool_arguments = tool_call.function.arguments
model_outputs.append(str(f"Called Tool: '{tool_name}' with arguments: {tool_arguments}"))
tool_calls.append(ToolCall(name=tool_name, arguments=tool_arguments, id=tool_call.id))
# Track final_answer separately, add others to parallel processing list
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore: in short, now that we have a ToolOutput object logging if the tool output is a final answer, we can just yield ToolOutput objects then process possible final answers in step_stream, which makes more sense to process at the step level.

yield ToolOutput(output=None, is_final_answer=False)

# Process final_answer call if present
if final_answer_call:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to split the logic: we process all tools the same way.

)

# Use stream_to_gradio to capture the output
outputs = list(
stream_to_gradio(
agent,
task="Test task",
additional_args=dict(image=AgentImage(value="path.png")),
additional_args=dict(image=PIL.Image.new("RGB", (100, 100))),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now always run the final_answer tool, thus was trying to create an AgentImage from this image where the path is fake, thus raising an error: instead I create a new PIL image (which matches the real use case).

@aymeric-roucher aymeric-roucher marked this pull request as ready for review June 24, 2025 18:36
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good fix and refactoring!

Comment on lines 1 to 4
from smolagents import GradioUI, InferenceClientModel, ToolCallingAgent, WebSearchTool


agent = CodeAgent(
agent = ToolCallingAgent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing, will rever before merge !

title="Output message of the LLM:",
level=LogLevel.DEBUG,
)

# Record model output
memory_step.model_output_message = chat_message
memory_step.model_output = chat_message.content
memory_step.model_output = str(chat_message.content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that chat_message.content might be None, so memory_step.model_output would be "None"` in that case. Is this expected?

else:
final_answer = answers

# TODO: Make sure this is not a problem: https://github.com/huggingface/smolagents/pull/1255#pullrequestreview-2822036066
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reviewing my previous modification, I realize maybe I went a bit too fast: the actual issue was with the error message indeed! Your fix using str instead of json.loads makes sense and addresses it properly.

Thanks for catching this and ensuring compatibility with broader FinalAnswerTool implementations. I've re-checked the relevant usages, and nothing appears to be broken by your changes.

@@ -1470,12 +1458,12 @@ def execute_tool_call(self, tool_name: str, arguments: dict[str, str] | str) ->
# Handle execution errors
if is_managed_agent:
error_msg = (
f"Error executing request to team member '{tool_name}' with arguments {json.dumps(arguments)}: {e}\n"
f"Error executing request to team member '{tool_name}' with arguments {str(arguments)}: {e}\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the issue I found. Good solution.

Maybe we could improve the __str__ methods for AgentImage and AgentAudio, so we avoid having their memory address in the message error. Currently, we would get:

Error executing tool 'final_answer' with arguments {'answer': <smolagents.agent_types.AgentImage image mode= size=0x0 at 0x7EFCDD78E190>}: FileNotFoundError: [Errno 2] No such file or directory: 'path.png'
Please try again or use another tool

Comment on lines 1421 to 1423
error_msg = check_tool_arguments(tool, arguments)
if error_msg:
raise AgentToolCallError(error_msg, self.logger)
Copy link
Member

@albertvillanova albertvillanova Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about raising directly the error from check_tool_arguments? And maybe renaming it to validate_tool_arguments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error needs to be given argument self.logger, this is why I preferred to keep the error creation within the agent rather than passing around the logger oboject.
Good point, renaming it to validate_tool_arguments!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think you agree that your proposal creates a somewhat awkward flow: having a function that either returns an error message or None, and relying on the caller to interpret and raise that, doesn’t feel clean or consistent.

I understand the technical constraint you mentioned around needing access to self.logger. In my opinion, this points to a deeper architectural concern, with a too tight coupling between the error generation and the agent internal self.logger.

Perhaps this could be addressed in a future PR to improve separation of concerns: we should be able to decouple this logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you, if you have ideas for decoupling it would be great to implement indeed!

final_output = list(tool_outputs.values())[0].output
else:
# ToolCallingAgent can return several final answers in parallel
final_output = [tool_outputs[k].output for k in sorted(tool_outputs.keys())]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small observation here: by sorting tool_outputs.keys(), we're effectively ordering the final outputs by tool_call.id (which is what output.id mirrors). This might not always reflect the original order in which the tool calls were made. If the logical or intended sequence depends on the call order rather than the ID's lexicographic order, would it make sense to preserve that instead?

Copy link
Member

@albertvillanova albertvillanova Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked it experimentally: call order may be different from call_id's lexicographic order:

>>> chat_message.tool_calls
[
    ChatMessageToolCall(
        function=ChatMessageToolCallFunction(
            arguments={'query': 'current age of Pope Leo XIV', 'filter_year': 2023}, name='web_search', description=None), 
        id='call_z6H1L4WqXbdfKvibbjZBAcIp', type='function'
    ), 
    ChatMessageToolCall(
        function=ChatMessageToolCallFunction(
            arguments={'query': 'current age of Andrew Garfield', 'filter_year': 2023}, name='web_search', description=None), 
        id='call_NhAIjB8HfSf8RgL8kSuIznrg', type='function'
    )
]
>>> list(sorted(tool_outputs.keys()))
[
    'call_NhAIjB8HfSf8RgL8kSuIznrg',  # Andrew Garfield
    'call_z6H1L4WqXbdfKvibbjZBAcIp'   # Pope Leo XIV
]

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call order should be preserved when returning call results.

outputs[tool_output.id] = tool_output
yield tool_output

memory_step.tool_calls = [parallel_calls[k] for k in sorted(parallel_calls.keys())]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here: you are sorting by lexicographical order, not by call order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a regression test for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you make sure that tests are consistent, if not ordering by an attribute of the tool call? Assign a specific call_order attribute? If so, is that really better than sorting by call_id?

Just to clarify, I wasn't suggesting introducing or using a custom call order attribute, but preserve the original call order as represented in chat_message.tool_calls.

@@ -1738,7 +1703,7 @@ def to_dict(self) -> dict[str, Any]:
return agent_dict

@classmethod
def from_dict(cls, agent_dict: dict[str, Any], **kwargs) -> "CodeAgent":
def from_dict(cls, agent_dict: dict[str, Any], **kwargs) -> "MultiStepAgent":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change should be reverted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to MultiStepAgent is less clear on which type of agent this is, but it fixes a linting error, this is why I did it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on this though, I can revert this if you think it's better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While renaming the return type to MultiStepAgent does silence the linting error, it actually masks the underlying issue rather than resolving it:

  • The root cause of the error is actually in the type hint of the from_dict method in the MultiStepAgent abstract base class
  • By updating the subclass to return MultiStepAgent, we're introducing a second issue: CodeAgent.from_dict actually returns a CodeAgent instance, not an MultiStepAgent one (indeed, MultiStepAgent is abstract so can't be instantiated).

I'd suggest reverting the change so that CodeAgent.from_dict continues to return a CodeAgent instance. We can then address the root cause (the type hint in MultiStepAgent.from_dict) in a follow-up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, reverting the change!

@aymeric-roucher
Copy link
Collaborator Author

@albertvillanova all tool calls are performed in the same ThreadPoolExecutor: so their order in the list of chat_message.tool_calls could be reversed when performing calls: I had the issue that tests wouldn't work because of this. How would you make sure that tests are consistent, if not ordering by an attribute of the tool call? Assign a specific call_order attribute? If so, is that really better than sorting by call_id?

@aymeric-roucher
Copy link
Collaborator Author

aymeric-roucher commented Jun 25, 2025

I've reversed a change that let the agent return several final answers and would aggregate them into a list.
This is because upon further thought, I found that returning several final_answer in parallel did not really makes sense: basically if the framework just accepts them all and aggregates them into the list, it means the decider of the final answer is in the end the framework. when it should rather be the agent.
So I now just raise an error when there are several calls to final_answer in parallel. Then the agent can just take one step to really aggregate its parallel final answers into a single consistent one.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and the all the good refactoring!

The issue with the call order not being preserved has not been addressed (#1482 (review)), but this could be done in a subsequent PR.

Feel free to merge as it is.

@aymeric-roucher
Copy link
Collaborator Author

@albertvillanova about call order: I've made an answer here: #1482 (comment). But indeed we can keep handling this in a follow-up PR!

@albertvillanova
Copy link
Member

albertvillanova commented Jun 27, 2025

@aymeric-roucher and I replied to your answer above here, in the dedicated thread: #1482 (comment)

@aymeric-roucher aymeric-roucher merged commit d6f9708 into main Jun 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Final answer is too greedy in ToolCallingAgent parallel calls
2 participants