Skip to content

Processing image/ multi modal responses in function tool results? #787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chiehmin-wei opened this issue May 29, 2025 · 2 comments
Open
Labels
question Question about using the SDK

Comments

@chiehmin-wei
Copy link

I have seen related discussion: #341
and a related PR: #654

But seems like function tools don't support returning images as outputs yet.

I wonder what's the best workaround we'd have around this, or whether including images in the outputs would make sense for my use case?

For context, I'm building a PagerDuty alert root cause analysis agent with access to tools like this:

agent = Agent("You are an expert SRE agent. Help me diagnose the root cause.", 
  tools = [search_logs_on_elasticsearch, check_panel_on_grafana]
)

For the check_panel_on_grafana tool, since time series data could be huge, I was thinking I'd first plot the data as an image, and then feed the image into LLM along with some descriptions (start time, end time, panel name, etc.).

I was thinking of just returning both the image and the text directly in the function output. Seems like that's not supported yet though.

Is my best workaround something like this? Call LLM directly and return the results?

@function_tool
def check_panel_on_grafana():
  data = get_data_from_grafana()
  graph = plot_graph(data)
  description = "cool description"

  prompt = "describe the image as thoroughly as possible"
  result = call_chatgpt_directly(prompts, messages=[{graph, description}])

  return result

but i guess call_chatgpt_directly won't have context to all the previous actions done by the agent thus far, and also, we only return text so all future actions won't get to see the actual image.

@chiehmin-wei chiehmin-wei added the question Question about using the SDK label May 29, 2025
@Sabahat-Shakeel
Copy link

Sabahat-Shakeel commented May 30, 2025

You're correct in identifying that current function_tool support in the OpenAI Agents SDK does not yet support returning images as function outputs directly in a structured or visualizable way


function_tools cannot return image-type outputs directly.

Manual call_chatgpt_directly(...) removes agent history/context.

No structured way to keep an image in the agent’s memory for later tools.

Save the image and upload a public or internal URL.
Plot the graph and upload it to a temporary cloud file store (e.g., S3, Cloudflare R2, or file.io for throwaway uploads). Then return the URL and description in the tool output.

@function_tool
def check_panel_on_grafana(start_time: str, end_time: str):
    data = get_data_from_grafana(start_time, end_time)
    image_path = plot_graph_and_save(data)  # Save locally

    image_url = upload_to_temp_storage(image_path)  # Upload to cloud
    description = f"Panel from {start_time} to {end_time}"

    return {
        "image_url": image_url,
        "description": description
    }

connet with the agent

agent = Agent(
   name = "agent name...",
   instructions = "your system prompt",
   tools=[
        check_panel_on_grafana,
        analyze_grafana_panel,
        search_logs_on_elasticsearch
    ]
)
# run the code with runner class 

@chiehmin-wei
Copy link
Author

I don't think providing the url would work, LLM wouldn't be able to see the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about using the SDK
Projects
None yet
Development

No branches or pull requests

2 participants